Skip to content

Commit

Permalink
9-4: remove periods at the end of section names
Browse files Browse the repository at this point in the history
  • Loading branch information
dankamongmen committed Sep 14, 2024
1 parent 35e3eec commit 0413f63
Showing 1 changed file with 12 additions and 12 deletions.
24 changes: 12 additions & 12 deletions chapters/9-Optimizing-Computations/9-4 Vectorization.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,27 +2,27 @@

## Vectorization {#sec:Vectorization}

On modern processors, the use of SIMD instructions can result in a great speedup over regular un-vectorized (scalar) code. When doing performance analysis, one of the top priorities of the software engineer is to ensure that the hot parts of the code are vectorized. This section guides engineers toward discovering vectorization opportunities. For a recap on the SIMD capabilities of modern CPUs, readers can take a look at [@sec:SIMD].
On modern processors, the use of SIMD instructions can result in a great speedup over regular un-vectorized (scalar) code. When doing performance analysis, one of the top priorities of the software engineer is to ensure that the hot parts of the code are vectorized. This section guides engineers toward discovering vectorization opportunities. For a recap of the SIMD capabilities of modern CPUs, readers can take a look at [@sec:SIMD].

Often vectorization happens automatically without any user intervention, this is called autovectorization. In such a situation, a compiler automatically recognizes the opportunity to produce SIMD machine code from the source code. Autovectorization could be a convenient solution because modern compilers generate fast vectorized code for a wide variety of programs.
Often vectorization happens automatically without any user intervention; this is called autovectorization. In such a situation, a compiler automatically recognizes the opportunity to produce SIMD machine code from the source code. Autovectorization could be a convenient solution because modern compilers generate fast vectorized code for a wide variety of programs.

However, in some cases, auto-vectorization does not succeed without intervention by the software engineer, perhaps based on feedback[^2] they get from, say, compiler optimization reports or profiling data. In such cases, programmers need to tell the compiler that a particular code region is vectorizable or that vectorization is profitable. Modern compilers have extensions that allow power users to control the auto-vectorization process and make sure that certain parts of the code are vectorized efficiently. However, this control is limited. We will provide several examples of using compiler hints in the subsequent sections.
However, in some cases, autovectorization does not succeed without intervention by the software engineer, perhaps based on feedback[^2] they get from, say, compiler optimization reports or profiling data. In such cases, programmers need to tell the compiler that a particular code region is vectorizable or that vectorization is profitable. Modern compilers have extensions that allow power users to control the autovectorization process and make sure that certain parts of the code are vectorized efficiently. However, this control is limited. We will provide several examples of using compiler hints in the subsequent sections.

It is important to note that there is a range of problems where SIMD is important and where auto-vectorization just does not work and is not likely to work in the near future. One example can be found in [@Mula_Lemire_2019]. Outer loop autovectorization is not currently attempted by compilers. They are less likely to vectorize floating-point code because results will differ numerically. Code involving permutations or shuffles across vector lanes is also less likely to auto-vectorize, and this is likely to remain difficult for compilers.
It is important to note that there is a range of problems where SIMD is important and where autovectorization just does not work and is not likely to work in the near future. One example can be found in [@Mula_Lemire_2019]. Outer loop autovectorization is not currently attempted by compilers. They are less likely to vectorize floating-point code because results will differ numerically. Code involving permutations or shuffles across vector lanes is also less likely to autovectorize, and this is likely to remain difficult for compilers.

There is one more subtle problem with autovectorization. As compilers evolve, optimizations that they make are changing. The successful auto-vectorization of code that was done in the previous compiler version may stop working in the next version, or vice versa. Also, during code maintenance or refactoring, the structure of the code may change, such that autovectorization suddenly starts failing. This may occur long after the original software was written, so it would be more expensive to fix or redo the implementation at this point.
There is one more subtle problem with autovectorization. As compilers evolve, optimizations that they make are changing. The successful autovectorization of code that was done in the previous compiler version may stop working in the next version, or vice versa. Also, during code maintenance or refactoring, the structure of the code may change, such that autovectorization suddenly starts failing. This may occur long after the original software was written, so it would be more expensive to fix or redo the implementation at this point.

When it is absolutely necessary to generate specific assembly instructions, one should not rely on compiler autovectorization. In such cases, code can instead be written using compiler intrinsics, which we will discuss in [@sec:secIntrinsics]. In most cases, compiler intrinsics provide a 1-to-1 mapping to assembly instructions. Intrinsics are somewhat easier to use than inline assembly because the compiler takes care of register allocation, and they allow the programmer to retain considerable control over code generation. However, they are still often verbose and difficult to read and subject to behavioral differences or even bugs in various compilers.

For a middle path between low-effort but unpredictable autovectorization, and verbose/unreadable but predictable intrinsics, one can use a wrapper library around intrinsics. These tend to be more readable, can centralize compiler fixes in a library as opposed to scattering workarounds in user code, and still allow developers control over the generated code. Many such libraries exist, differing in their coverage of recent or 'exotic' operations, and the number of platforms they support. To our knowledge, Highway is currently the only one that fully supports scalable vectors as seen in the SVE and RISC-V V instruction sets. Note that one of the authors is the tech lead for this library. It will be introduced in [@sec:secIntrinsics].
For a middle path between low-effort but unpredictable autovectorization, and verbose/unreadable but predictable intrinsics, one can use a wrapper library around intrinsics. These tend to be more readable, can centralize compiler fixes in a library as opposed to scattering workarounds in user code, and still allow developers control over the generated code. Many such libraries exist, differing in their coverage of recent or "exotic" operations, and the number of platforms they support. To our knowledge, Highway is currently the only one that fully supports scalable vectors as seen in the SVE and RISC-V V instruction sets. Note that one of the authors is the tech lead for this library. It will be introduced in [@sec:secIntrinsics].

Note that when using intrinsics or a wrapper library, it is still advisable to write the initial implementation using C++. This allows rapid prototyping and verification of correctness, by comparing the results of the original code against the new vectorized implementation.

In the remainder of this section, we will discuss several of these approaches, especially inner loop vectorization because it is the most common type of autovectorization. The other two types, outer loop vectorization, and SLP (Superword-Level Parallelism) vectorization, are mentioned in Appendix B.

### Compiler Auto-Vectorization.
### Compiler Autovectorization

Multiple hurdles can prevent auto-vectorization, some of which are inherent to the semantics of programming languages. For example, the compiler must assume that unsigned loop indices may overflow, and this can prevent certain loop transformations. Another example is the assumption that the C programming language makes: pointers in the program may point to overlapping memory regions, which can make the analysis of the program very difficult. Another major hurdle is the design of the processor itself. In some cases, processors don’t have efficient vector instructions for certain operations. For example, predicated (bitmask-controlled) load and store operations are not available on most processors. Another example is vector-wide format conversion between signed integers to doubles because the result operates on vector registers of different sizes. Despite all of the challenges, the software developer can work around many of the challenges and enable vectorization. Later in this section, we provide guidance on how to work with the compiler and ensure that the hot code is vectorized by the compiler.
Multiple hurdles can prevent autovectorization, some of which are inherent to the semantics of programming languages. For example, the compiler must assume that unsigned loop indices may overflow, and this can prevent certain loop transformations. Another example is the assumption that the C programming language makes: pointers in the program may point to overlapping memory regions, which can make the analysis of the program very difficult. Another major hurdle is the design of the processor itself. In some cases, processors don’t have efficient vector instructions for certain operations. For example, predicated (bitmask-controlled) load and store operations are not available on most processors. Another example is vector-wide format conversion between signed integers to doubles because the result operates on vector registers of different sizes. Despite all of the challenges, the software developer can work around many of the challenges and enable vectorization. Later in this section, we provide guidance on how to work with the compiler and ensure that the hot code is vectorized by the compiler.

The vectorizer is usually structured in three phases: legality-check, profitability-check, and transformation itself:

Expand All @@ -36,13 +36,13 @@ The vectorizer is usually structured in three phases: legality-check, profitabil

[Amdahl's law](https://en.wikipedia.org/wiki/Amdahl's_law)[^6] teaches us that we should spend time analyzing only those parts of code that are used the most during the execution of a program. Thus, performance engineers should focus on hot parts of the code that were highlighted by a profiling tool. As mentioned earlier, vectorization is most frequently applied to loops.

Discovering opportunities for improving vectorization should start by analyzing hot loops in the program and checking what optimizations were performed by the compiler. Checking compiler vectorization remarks (see [@sec:compilerOptReports]) is the easiest way to know that. Modern compilers can report whether a certain loop was vectorized, and provide additional details, e.g., vectorization factor (VF). In the case when the compiler cannot vectorize a loop, it is also able to tell the reason why it failed.
Discovering opportunities for improving vectorization should start by analyzing hot loops in the program and checking what optimizations were performed by the compiler. Checking compiler vectorization reports (see [@sec:compilerOptReports]) is the easiest way to know that. Modern compilers can report whether a certain loop was vectorized, and provide additional details, e.g., vectorization factor (VF). In the case when the compiler cannot vectorize a loop, it is also able to tell the reason why it failed.

An alternative way to use compiler optimization reports is to check assembly output. It is best to analyze the output from a profiling tool that shows the correspondence between the source code and generated assembly instructions for a given loop. That way you only focus on the code that matters, i.e., the hot code. However, understanding assembly language is much more difficult than a high-level language like C++. It may take some time to figure out the semantics of the instructions generated by the compiler. However, this skill is highly rewarding and often provides valuable insights. Experienced developers can quickly tell whether the code was vectorized or not just by looking at instruction mnemonics and the register names used by those instructions. For example, in x86 ISA, vector instructions operate on packed data (thus have `P` in their name) and use `XMM`, `YMM`, or `ZMM` registers, e.g., `VMULPS XMM1, XMM2, XMM3` multiplies four single precision floats in `XMM2` and `XMM3` and saves the result in `XMM1`. But be careful, often people conclude from seeing the `XMM` register being used, that it is vector code -- not necessary. For instance, the `VMULSS` instruction will only multiply one single-precision floating-point value, not four.
An alternative way to use compiler optimization reports is to check assembly output. It is best to analyze the output from a profiling tool that shows the correspondence between the source code and generated assembly instructions for a given loop. That way you only focus on the code that matters, i.e., the hot code. However, understanding assembly language is much more difficult than a high-level language like C++. It may take some time to figure out the semantics of the instructions generated by the compiler. However, this skill is highly rewarding and often provides valuable insights. Experienced developers can quickly tell whether the code was vectorized or not just by looking at instruction mnemonics and the register names used by those instructions. For example, in x86 ISA, vector instructions operate on packed data (thus have `P` in their name) and use `XMM`, `YMM`, or `ZMM` registers, e.g., `VMULPS XMM1, XMM2, XMM3` multiplies four single precision floats in `XMM2` and `XMM3` and saves the result in `XMM1`. But be careful, often people conclude from seeing the `XMM` register being used, that it is vector code---not necessarily. For instance, the `VMULSS` instruction will only multiply one single-precision floating-point value, not four.

There are a few common cases that developers frequently run into when trying to accelerate vectorizable code. Below we present four typical scenarios and give general guidance on how to proceed in each case.

#### Vectorization Is Illegal.
#### Vectorization Is Illegal

In some cases, the code that iterates over elements of an array is simply not vectorizable. Vectorization remarks are very effective at explaining what went wrong and why the compiler can’t vectorize the code. [@lst:VectDep] shows an example of dependence inside a loop that prevents vectorization.[^31]

Expand All @@ -55,7 +55,7 @@ void vectorDependence(int *A, int n) {
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
While some loops cannot be vectorized due to the hard limitations described above, others could be vectorized when certain constraints are relaxed. There are situations when the compiler cannot vectorize a loop because it simply cannot prove it is legal to do so. Compilers are generally very conservative and only do transformations when they are sure it doesn't break the code. Such soft limitations could be relaxed by providing additional hints to the compiler. For example, when transforming the code that performs floating-point arithmetic, vectorization may change the behavior of the program. The floating-point addition and multiplication are commutative, which means that you can swap the left-hand side and the right-hand side without changing the result: `(a + b == b + a)`. However, these operations are not associative, because rounding happens at different times: `((a + b) + c) != (a + (b + c))`. The code in [@lst:VectIllegal] cannot be auto-vectorized by the compiler. The reason is that vectorization would change the variable sum into a vector accumulator, and this will change the order of operations and may lead to different rounding decisions and a different result.
While some loops cannot be vectorized due to the hard limitations described above, others could be vectorized when certain constraints are relaxed. There are situations when the compiler cannot vectorize a loop because it simply cannot prove it is legal to do so. Compilers are generally very conservative and only do transformations when they are sure it doesn't break the code. Such soft limitations could be relaxed by providing additional hints to the compiler. For example, when transforming the code that performs floating-point arithmetic, vectorization may change the behavior of the program. The floating-point addition and multiplication are commutative, which means that you can swap the left-hand side and the right-hand side without changing the result: `(a + b == b + a)`. However, these operations are not associative, because rounding happens at different times: `((a + b) + c) != (a + (b + c))`. The code in [@lst:VectIllegal] cannot be autovectorized by the compiler. The reason is that vectorization would change the variable sum into a vector accumulator, and this will change the order of operations and may lead to different rounding decisions and a different result.
Listing: Vectorization: floating-point arithmetic.
Expand Down

0 comments on commit 0413f63

Please sign in to comment.