In short the vector processing model is one in which the processor (CPU, GPU etc.) takes one instruction and applies it to multiple data or multiple data sets. To optimally maximize the performance improvements that can be delivered through vector processing it is best to use vector processing whenever the need to manipulate (process) very large data sets or even multiple very large data sets arises. This is because vector processing instructions tend to be very complex in nature and form. The following simplified processing example illustrates this.
Decoding and Translating High-Level Programming and Operating System Instructions
Before a processor (CPU in this case) can perform any work upon data it must first familiarize itself with the necessary mechanisms, routines, processes and operations required to perform the work being asked of it. It does this by decoding and translating the supplied higher-level more “human friendly” programming and operating system instructions into a format which; it the processor, can understand and execute.
The combination of appropriately formatted instructions and data; including the correct byte ordering, along with the processor’s internal instructions are commonly referred to as the processor’s micro-ops and are native to each type, family and revision (also referred to as “stepping”) of the processor(s) involved.
Traditional (Scalar) Processing
Many traditional (scalar) processing tasks can vary to such an extent that the processor cannot immediately reuse the decoded and translated instructions it just executed on the next processing task. Thus; as the processor’s instruction cache becomes full, it will discard these older “idle” instructions.
As a result the next time it is asked to perform a processing task that does use these just discarded instructions the processor has no choice other than to decode and retranslate said instruction(s) into the appropriate micro-ops all over again.
Vector Processing Instruction Complexity
Because vector processing instructions can be very complex they will generally; in comparison to traditional scalar processing instructions, require considerably larger amounts of processor (CPU) cycles and time merely to decode and translate them into processor-specific micro-ops understandable to the processor and ready for the processor to execute.
Processing Efficiency and Optimization
Considerable amounts of processor cycles will be wasted if the processor followed the above original scalar style processing practice of discarding decoded and translated instructions immediately after executing them. In these instances; should the processor be required to use a recent instruction it will have no choice other than to start all over again and decode and translate the instruction anew.
To overcome this; modern processors store the decoded and translated instructions for a longer period of time after use prior to discarding them. Adopting this simple strategy ultimately proved to enhance overall system performance considerably.
The easiest way of achieving longer retention times for decoded instructions was to increase the quantity of cache memory available to the processor for this purpose. For the modern vector processing capable processor this has resulted in manufacturers designing and fabricating processors with ever increasing amounts of on-die “high-speed” cache (both L1 and L2) and a dedicated instruction cache.
Not to be left out of the performance hikes to be gained from this strategy traditional scalar processing can also take advantage of this new development (increased on-die cache).
Hybrid Processing Processor Designs
Unfortunately, these complex vector processing instructions do not perform at all well comparatively, when simpler processing on small data sets is required. As a direct result of this, modern general-purpose microprocessors (CPUs) have vector processing capabilities built into them such that the vector unit runs alongside the main scalar processor and is supplied data only by programs that “know” it is there.
Mainstream Vector Processing Today
Today we find that the two most common vector processing implementations in mainstream consumer computing are:
- Single Instruction, Multiple Data (SIMD) – The modern Graphics Processing Unit (GPU) uses a type of vector processing named Single Instruction Multiple Data (SIMD). This technique saves a lot of instruction processing and processing cycles as the relevant instruction is decoded and translated into the processor’s native micro-ops once and then applied massively to a very large data set. Modern technologies based around SIMD vector processing include Intel’s MMX and SSE both of which are built into all new Intel Pentium 4 and above CPUs. AMD’s 3D Now is another.
- Multiple Instruction, Multiple Data (MIMD) – The processor performs multiple instructions for vector processing on multiple data/data sets. That is to say that it executes multiple instructions on each bit of data across multiple data bits or multiple data sets. The data/data sets involved in such MIMD processing can be truly massive indeed.
Whenever a computer has multiple processing cores, multiple processors or even multiple multi-core processors; each with multiple processing pipelines at its disposal, it will allocate different chunks of the data to be processed amongst its processing resources.
This can be done because each core of these multi-core microprocessors is in itself a complete independent Single Instruction, Multiple Data (SIMD) microprocessor capable of running a number of instructions simultaneously and many SIMD instructions per nanosecond. In essence this is the simplest form of what is commonly referred to as massive parallelism and is the secret behind the raw number crunching power or modern supercomputers.