From CELL BE to GPGPU
To give some history, I was working on the CELL BE (the SIMD engine behind PlayStation 3) right after joining IBM in 2001. For the next 5 years, I worked in a research team that designed the intrinsics interfaces for CELL BE, added automatic vectorization feature to the XLC compiler, and prototyped an OMP-based single-source compiler for CELL BE (with the lofty goal of writing one program that can be compiled to run on both the host and the SIMD accelerator). I left SIMD in 2006 to pursue other research interests and, to be honest, thought at the time that there was nothing major to be done about SIMD.
I came back to SIMD in 2012, after ignoring all the research hypes on GPGPUs for many years, only to find SIMD engines at the core of every GPGPU. If you asked me in 2007 if it is possible to support a programming model like CUDA for a SIMD engine, I’d say impossible and back it up w/ our experience of building the single-source compiler for CELL, which shares the vision as CUDA.
So why did CUDA succeed but single-source did not? I believe the key difference is in the hardware. The CELL SIMD design had so many constraints, e.g., memory accesses have to be aligned and stride-one, and no support for control-flow. As a result, many common language constructs cannot be mapped efficiently to CELL SIMD. GPGPU, on the other hand, relaxed the key constraints of traditional SIMD hardware:
- gather/scatter support: allow arbitrary load/stores in SIMD codes
- predication or divergent branch support: allow diverging control-flow in SIMD codes
From traditional SIMD to SIMT
In the chart above, we divide SIMD architectures into two broad categories.
- To the left is the traditional SIMD architecture, which provides little support for arbitrary memory access (e.g., lack of gather/scatter capability) and no support for control-flow divergence.
- To the right is the new era of SIMD, let’s call it SIMT (Single-Instruction-Multiple-Thread) as coined by NVidia. The key hardware features of SIMT are gather/scatter and supports for control-flow divergence. The latter can be provided via divergent branch (NVidia) or predication (Xeon Phi).
A final note of the chart below is the trend line of where the industry is going. While GPGPUs are root squarely in the SIMT domain today, Intel’s SIMD is gradually shifting towards SIMT in both hardware and software tools. I predict that in 5-year time, software supports for SIMT will be mature enough so that SIMT will become the mainstream programming model for SIMD.