The myth of auto-SIMD

Posted on Updated on

As someone whose Ph.D work was on the auto parallelization and who spent 5 years in developing the auto-SIMDization feature of the xlc compiler for POWER processors, my view on auto-simdization has turned 180 degrees over the years. Yes, my 5 years on auto-SIMD was one of my most productive years and the best collaboration experience ever. But nowadays, every time someone brought up auto-SIMD as the solution to solving the programmability difficulty of SIMD, I can’t help but shoot back.

Auto-SIMD is the holy grail of SIMD programming models. Everybody wants it: programmers, executives, program managers, academics, compiler designers (myself included). The problem is that the perceived capability of auto-SIMD is quite different from the realistic capability of an auto-SIMD compiler. Putting it bluntly, auto-SIMD compilers rarely work when applied to real codes. Many times, compiler users came to us with a piece of their codes that the compiler cannot parallelize. Sometimes the code is simple to human eyes, but complicated to compilers because of aliasing and unknown side-effects through function calls. Sometimes, the loop is parallel but may be “messed up” by internal compiler transformations that confused the SIMD analysis. This is no fault of the compiler. It is simply the wrong task for the compiler to figure out a parallel loop out of a sequential program. Think about it: how many times do you rely on auto-parallelization to produce a parallel code? Probably none. The same is true for auto-SIMD.

There are times where the compiler indeed can SIMDize a loop, but the loop is often so simple (e.g., matmul w/ all global side-effects known) and the amount of compiler analysis required is so humongous (e.g., inter-procedural analysis) that it is much easier for the programmer to indicate the SIMDizable region to the compiler using some programming interface (e.g., OMP SIMD directives).

I have seen so many times, decision makers embrace SIMD because it sounds like the best solution to solve the problem; compiler practitioners declare victories after SIMDizing a few self-selected kernels and publishing the initial results; and users gave up using the feature after a few frustrated tries.

I never forget the three questions that my boss often asks about a new research idea: 1) does it work? 2) what does it do for me? 3) when is it available? Auto-SIMD fails the very 1st test.


4 thoughts on “The myth of auto-SIMD

    Matt said:
    January 21, 2014 at 5:33 am

    One question I always have when the weakness of auto-parallel systems is raised: how much could we improve the situation through better languages[1]: 1) Making the language easier for the compiler to analyze 2) Making it more obvious to the user where vectorization may or may not be possible.

    Array languages are sort of an extreme here, but I wonder if there’s a middle ground to be found.

    [1]: I say this as someone who has never really looked terribly deeply into language design, but instead has learned what rough edges exist through the cutting difficulty of analysis.

      pengwu responded:
      January 21, 2014 at 2:20 pm

      I am for 1) explicit parallel programming models, which could be as simple as an OMP-like pragma approach to indicate SIMD regions or a slightly more evolved new language dialect like CUDA or OpenCL; or 2) domain-specific language such as matrix/array languages w/ restrictions on aliasing rules that make it relatively easy to analyze parallelism.

      1) is already happening, there are many budding explicit parallel programming models for SIMD, OMP 4.0 has SIMD construct, OpenACC, OpenCL, CUDA, ISPC to name a few. The syntax may differ but the principles are similar, i.e., let programmers to indicate parallelism and write a parallel program, and leave the mundane task mapping parallel codes to SIMD hardware to the compiler. I believe this is the right programming interface for both programmers and compilers to do what they are each good at.

    Jianbin Fang said:
    April 28, 2014 at 8:19 pm

    Hi Peng,

    This is Jianbin Fang, a PhD student from Delft University of Technology, the Netherlands. I do agree with your comments. There is a minor issue. That is, although OpenCL specifies parallelism explicitly, it also has vector data types. Do you think these vector types are a bit redundant?


      pengwu responded:
      May 2, 2014 at 3:26 pm

      Vector data types specify both parallelism (i.e., all elements of the vector can be processed together) and how to layout data (i.e., all elements involved in parallel computation are packed into one vector), where traditional OpenCL parallelism does not cover the data layout aspect of parallelism.

      So for certain workloads that care very much about SIMD performance and on SIMD platforms whose performance is very sensitive to data layout (e.g., alignment and stride-one access on limited width SIMD vectors), using explicit vector programming may still be necessary.

      Looking ahead, SIMD hardware is moving more and more towards tolerating imperfect data layout and even some divergent control-flow and w/ more mature SPMD-on-SIMD compilers, there are likely less people using the vector construct and instead find reasonable performance from compilers that can map parallelism to SIMD and multiple cores at the same time. Also explicit vector codes may not be efficiently mapped to later SIMD architectures that support wider vector width.

      I believe Intel’s OpenCL compiler since v1.5 maps vectors back to parallelism and use SPMD-on-SIMD techniques (so called implicit vectorization) to generate SIMD and/or multi-threading codes.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s