Skip to end of metadata
Go to start of metadata

 What instructions can be paired:

  • Loads and stores (only simple ones)
  • Binary operators
  • Intrinsics (sqrt, pow, powi, sin, cos, log, log2, log10, exp, exp2, fma)
  • Casts (for non-pointer types)
  • Insert- and extract-element operations 

Loop Vectorizer

  • widens instructions in loops to operate on multiple consecutive iterations

  • enabled by default

  • complex loops can be vectorized

Different loops

  • can vectorize loops with unknown start and endpoint, backwards

  • runtime pointer checks, ie live alias analysis

  • reduction support, for example summing and multiplying all to one variable is splitted to temporary variables

  • inductions, can use loop variable normally

  • IF-statements, all controls are ok inside innermost loop

  • Also work with c++ pointer iterating loops

  • Math functions

  • min/max patterns, collect min/max value

  • can omit small floating points errors for speed

  • can do also these but many cases not beneficial

    • scalar instructions:(A[i] += B[i * 4];)

    • type conversation


This section shows the the execution time of Clang on a simple benchmark: gcc-loops. This benchmarks is a collection of loops from the GCC autovectorization page by Dorit Nuzman.

The chart below compares GCC-4.7, ICC-13, and Clang-SVN with and without loop vectorization at -O3, tuned for “corei7-avx”, running on a Sandybridge iMac. The Y-axis shows the time in msec. Lower is better. The last column shows the geomean of all the kernels.


And Linpack-pc with the same configuration. Result is Mflops, higher is better.

Three components to do for all loops

  • The Loop Vectorization Legality

    • checks if it is allowed to vectorize, ie there is no problematic  dependencies etc

    • is this loop innermost loop?

  • Cost modeling

    • checks if it is worth of

    • cost tables and guessing, sometimes went wrong

  • vectorization

    • actual job


SLP Vectorizer (superword-level parallelism) 

  • merges scalars that are found from the code into vectors

  • enabled by default

  • combines similar independent instructions into vector instructions

  • memory access, arithmetic operations, comparison and PHI-nodes

  • bottom-up, across basic blocks




  • None