Notes: Vector Processors
This report delves into the fundamentals of Vector Processors. The report outlines the architecture, functionality, challenges, performance enhancements, as well as fallacies and pitfalls of vector processors.
Paper Summary
Increasing processor performance involves issuing multiple instructions per clock cycle and deepening pipelines to exploit instruction-level parallelism. However, this approach escalates the need for independent instructions, leading to quadratic growth in dependency tracking logic and circuit complexity. Vector processors offer an alternative, employing deep pipelines and high-level vector operations. Vector instructions resolve issues like data hazards and latency by specifying substantial work in a single instruction, eliminating control hazards from loop branches. While vector processors excel in scientific, engineering, and multimedia applications, challenges persist in managing large, low-locality datasets, addressed through innovative cache architectures and compiler support.
A vector processor comprises a scalar unit and a vector unit, with two main architectures: vector-register and memory-memory. In vector-register, operations involve vector registers, while memory-memory uses memory-to-memory operations. In VMIPS, components include eight vector registers, fully pipelined vector functional units (5 units), a vector load-store unit, scalar registers for input, and special registers: vector-length and vector-mask.
The vector instructions in focus include operations such as ADDV.D
and ADDVS.D
, where inputs are either pairs of vector registers or a vector register and a scalar register. The latter uses the scalar register value as input for all operations. Additionally, LV
and SV
instructions handle vector load and store operations, loading or storing entire double-precision vectors. These operations require a specified vector register and a general-purpose register indicating the memory address. The instruction set also encompasses variations like LVWS and SVWS for strided memory access, as well as LVI and SVI for immediate offset loading or storing. Moreover, CVI performs the conversion of integer values within a vector register to floating-point values. These instructions collectively offer versatile and efficient ways to manipulate vectors in memory, catering to various addressing modes and stride options.
The execution time of a sequence of vector operations relies on three main factors: the length of operand vectors, structural hazards, and data dependencies. By considering vector length and initiation rate, we can calculate the time for a single vector instruction. The concept of a “convoy” simplifies vector execution analysis. A convoy is a set of vector instructions that can potentially start execution together without structural or data hazards. Each convoy must finish execution before any subsequent instructions can begin. Introducing the “chime” metric, representing the time to execute one convoy, facilitates estimating the performance of a vector sequence. A chime is an approximate measure independent of vector length, and a vector sequence with \(m\) convoys takes about \(m\) chimes. However, the chime model overlooks vector startup time, arising from the pipelining latency of vector operations, which increases the effective time to execute a convoy. The total time for a convoy is determined by the sum of the vector length and the startup time.
The startup time for a load refers to the duration to fetch the first word from memory into a register. The vector initiation rate, determining the rate of fetching or storing new words, may not necessarily be one clock cycle due to memory bank stalls impacting effective throughput. To achieve an initiation rate of one word per clock, the memory system must handle this data rate, often accomplished by employing multiple memory banks. Vector processors commonly use memory banks instead of simple interleaving for three main reasons: supporting multiple simultaneous accesses, addressing non- sequential data words, and accommodating multiple processors sharing the same memory system. Real memory banks have access latency (time from address arrival to data return) and bank cycle time (occupied time per request). These factors influence the startup cost and effective bandwidth of fetching a vector from memory.
A vector-register processor has a natural vector length determined by the elements in each vector register (e.g., 64 for VMIPS), often not matching the variable vector lengths in real programs. To address this, a vector-length register (VLR) is introduced, controlling the length of vector operations like loads or stores, limited by the vector registers’ length. When dealing with vectors longer than the maximum length, strip mining is employed. This technique involves breaking the loop into segments, each processed within the maximum vector length. The running time of a strip-mined loop is influenced by the number of convoys and the overhead for each strip-mined sequence, comprising the cost of executing scalar code and the vector start-up cost for each convoy. The total running time for a vector sequence operating on a vector of length n is expressed as a sum with time being compiler and processor-dependent. Register allocation and instruction scheduling impact the components of each convoy and the start-up overhead.
The challenge arises when adjacent vector elements in memory are not sequential, introducing a concept called “stride” – the distance between elements gathered into a register. A vector-register processor can handle non-unit strides, accessing non-sequential memory locations, and reshaping them into a dense structure using vector-load and vector-store operations with stride capability. This ability is a significant advantage over cache-based processors, which inherently deal with unit stride data. The VMIPS instructions LVWS
(load vector with stride) and SVWS
(store vector with stride) facilitate working with non-unit strides, allowing for efficient data access and storage. However, supporting strides greater than one can lead to memory bank conflicts, where multiple accesses contend for a bank, causing stalls. The number of memory banks plays a crucial role in preventing conflicts and minimizing stalls. While unit strides benefit from special case handling in the memory system, non-unit strides may require careful consideration of bank configuration to avoid conflicts. In modern supercomputers, a multistage switching network is often used to connect memory pipelines to memory banks, but congestion can occur, leading to additional stalls in the memory system.
Chaining enables a vector operation to commence as soon as individual elements of its vector source operand are available, forwarding results between functional units. Initially, chaining worked like forwarding but constrained timing. Flexible chaining, a modern approach, permits a vector instruction to chain to any active vector instruction, requiring simultaneous access to the same vector register. This can be achieved by adding more ports or organizing the vector-register file into interleaved banks. While chaining facilitates parallel execution of dependent operations in the same convoy, reducing chimes, it doesn’t eliminate start-up overhead. Accurate running time estimates necessitate counting start-up time within and across convoys. Despite this, chaining significantly contributes to performance enhancement, with modern vector processors universally supporting flexible chaining.
Amdahl’s Law indicates limited speedup for programs with low to moderate vectorization due to conditionals in loops and sparse matrices. Vector-mask control, using a Boolean vector of length MVL, enables conditional execution of vector instructions. When the vector-mask register is active, operations affect only elements with corresponding 1s in the mask, leaving others unaffected. Clearing the mask sets it to all 1s for full operation. However, vector-mask usage has downsides, as instructions still consume execution time for elements with a mask of 0. Despite potential inefficiencies, the performance gap between vector and scalar modes makes vector-mask instructions crucial. In some processors, the vector mask only disables storing results, not the actual operation, potentially causing issues like false floating-point exceptions in certain scenarios.
Sparse matrices store vector elements in a compacted form, often accessed indirectly through index vectors. One common representation uses a bit vector to indicate existing elements and a dense vector for non-zero elements. Scatter-gather operations, facilitated by index vectors, aid in transitioning between dense and normal representations. Operations like load vector indexed (LVI
) and store vector indexed (SVI
) in VMIPS support these functions. This technique enables running code with sparse matrices in vector mode, either through programmer directives or, in advanced compilers, automatic vectorization with run-time checks for dependencies. Scatter-gather capabilities on recent supercomputers, although potentially slower, remain faster than scalar alternatives. Changes in matrix sparsity may require recomputing index vectors, with some processors offering efficient support, such as the CVI
(create vector index) instruction in VMIPS. Whether scatter-gather or conditionally executed versions are superior depends on the condition frequency and operation costs, with the former excelling when the condition is infrequent or reusable.
The VMIPS instruction set ensures that vector arithmetic instructions involve only corresponding elements from different vector registers, simplifying the construction of a highly parallel vector unit with multiple lanes. Each lane consists of a segment of the vector-register file and an execution pipeline from each vector functional unit. These lanes operate independently, executing vector instructions at a rate of one element group per cycle. The lack of interlane communication, except for accessing main memory, reduces wiring costs and required register file ports. This design allows current vector supercomputers to achieve high parallelism, completing up to 64 operations per cycle across multiple lanes. Adding lanes is a favored technique for enhancing vector performance, requiring minimal control complexity and no changes to existing machine code.
Increasing the number of lanes enhances peak performance, but it doesn’t impact start-up latency. To mitigate start-up overhead, overlapping the start of one vector instruction with the completion of the preceding ones becomes crucial. Some vector machines introduce recovery time or dead time between consecutive vector instructions to simplify control logic. Pipelining instruction start-up becomes more intricate with multiple instructions reading and writing the same vector register, especially when some instructions may unpredictably stall, such as a vector load encountering memory bank conflicts. However, in systems with numerous lanes and longer pipeline latencies, enabling fully pipelined instruction start-up becomes increasingly essential.
The success of running a program in vector mode depends on two main factors: the program’s inherent structure and the compiler’s capability. The program’s structure is influenced by the chosen algorithms and coding practices, determining whether loops have true data dependencies. Compiler capability varies in its ability to identify and vectorize loops with parallelism. Hand-optimized versions often show significant gains in vectorization, especially for codes that the compiler struggles to vectorize well. However, the level of vectorization alone doesn’t guarantee performance; alternative vectorization techniques can impact execution efficiency. The comparison of different compilers in vectorizing programs reveals significant variations in their effectiveness. For example, applying two different compilers to the same processor results in different levels of vectorization for various loops.
Results
To assess a vector processor’s performance on a vector problem, we examine both the start-up cost and the sustained rate. The preferred way to convey the performance of a vector processor on a loop is by presenting the execution time of the vector loop. Alternatively, for vector loops, people often utilize the MFLOPS (millions of floating-point operations per second) rating rather than providing the execution time. The notation Rn is used to represent the MFLOPS rating on a vector of length n. Whether using measurements like Tn (time) or Rn (rate), the choice is equivalent as long as the number of FLOPS is agreed upon.
For the DAXPY problem on VMIPS, the loop runs in three chimes, leading to an estimated MFLOPS rate of 2/3 times clock rate. Performance improvements can be achieved by adding vector load-store units, allowing convoys to overlap, and optimizing vector-register allocation. Calculating peak performance without start-up overhead yields a rate 1.33 times higher. For the Linpack benchmark, with a vector length of 66, sustained performance is estimated at 202 MFLOPS. N1/2 (vector length giving half peak performance) is 13. Vector mode is faster than scalar with NV = 2. Memory enhancements, like adding pipelines and enabling overlap, can further improve sustained performance, as evidenced by T66 dropping to 130 and achieving a 2.5 times improvement. The examination of various performance measures provides insights into the effectiveness of vector processors for specific applications.
Critical Review
Watch out for the pitfall of focusing solely on peak performance while overlooking start-up overhead, especially in memory-memory vector architectures, which face challenges due to start-up overhead, contributing to their limited popularity.
Another pitfall is the risk of increasing vector performance without corresponding improvements in scalar performance. Maintaining good scalar performance is crucial to minimize overhead costs (e.g., strip mining) and mitigate the impact of Amdahl’s Law. Despite a vector processor’s higher peak performance, its overall speed may be slower than a fast scalar processor, considering the harmonic mean.
Avoid the fallacy that assumes achieving vector performance is possible without providing sufficient memory bandwidth. In reality, codes like DAXPY, with 1.5 memory references per floating-point operation, and many scientific applications are memory-limited. Even if floating-point operations were instantaneous, a Cray-1 couldn’t enhance the vector sequence’s performance due to memory limitations. The Cray-1’s performance on Linpack significantly improved when the compiler applied clever transformations, optimizing memory references per FLOP and leveraging vector registers. This advantage underscores the importance of memory bandwidth and the ability to reuse values from vector registers, particularly in vector-register architectures compared to memory-memory vector architectures.
In the 1980s and 1990s, rapid advancements in pipelined scalar processors narrowed the performance gap between traditional vector supercomputers and affordable, fast, pipelined superscalar VLSI microprocessors. By 2002, desktop computers under 1000 dollars surpassed vector supercomputers in CPU clock rate. Although vector supercomputers have lower clock rates, their support for greater parallelism through multiple lanes distinguishes them. The fastest microprocessors approach the peak floating-point performance of leading vector supercomputers, but high clock rates don’t guarantee sustained application performance. The key difference lies in main memory bandwidth, where vector supercomputers outshine microprocessors, sustaining around 50 GB/sec per CPU compared to 1 GB/sec for microprocessors.
Vector supercomputers used SRAM as main memory to reduce memory bank needs and vector start-up penalties. However, due to cost, modern vector supercomputers shifted to DRAM, leveraging higher-bandwidth interfaces like synchronous DRAM. They also adapted commodity technology, incorporating vector data caches for improved price-performance. These vector caches are designed to maintain high main memory bandwidth despite cache misses. Vector supercomputers now utilize CMOS technology, shared with microprocessors, offering competitive transistor performance, greater density, and reduced power dissipation. While microprocessors incorporate vector extensions for multimedia applications, newer extensions like AltiVec and SSE2 enhance vector length and compiler support.
In summary, the vector supercomputers of today adapt commodity technology, incorporate vector data caches, and use CMOS technology, while microprocessors extend support for vector operations, simplifying performance boosting for certain applications.
References
- D. A. Patterson and J. L. Hennessy, Computer architecture: a quantitative approach. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1990.