With the maturity of general-purpose programming languages (e.g., C/C++, Java) and the availability of open-source compilers (e.g., gcc and LLVM), the traditional role of a compiler expert is becoming increasingly narrower and incremental today.
Are we (the “compiler experts”) working ourselves out of relevance? Yes and no. Yes, because hard-core optimizing compiler work is becoming increasingly rare nowadays at academic conferences (the frontier of what are considered as hard problems by the field ). No, because if we take a fresh look at our field with a more open mind, we’ll discover a much bigger space of productivity and performance optimization roles that are well suited to our expertise.
COMPILATION, A BY-PRODUCT OF COMPUTING ABSTRACTION
Let’s first step back and look at the origin of compilation. The figure to the left is how programming was like more than half a century ago. At the time, programming is done directly with punching binary instructions to a card. Out of the pain of punch card programming, the need of layers of abstractions arises. So please remember the origin of compilation as really the by-product of computing abstractions.
As shown in the picture below, there have been three levels of abstractions when mapping algorithms/applications all the way down to physical hardware.
- Hardware-centric abstraction. The assembly language emerges first to provide abstractions of the hardware, e.g., mnemonics and symbol resolution.
- Programming-centric abstraction. High-level programming languages (FORTRAN, C/C++, Java) emerge next to provide abstractions of programming or computing such as structured control-flow (as opposed to branch instructions in assembly), modules such as function calls and object-oriented programming, and abstractions of data (e.g., type system, scope).
- Application-/domain-centric abstraction. The more recent trends focus on abstractions of application domains. Such abstractions often take the form of a programming framework (e.g., map-reduce for big data domain, Django/ruby-on-rails for web servers). Sometimes the abstraction is in the form of domain-specific language (e.g., R for data analysis or array languages for dense matrix computation).
Let us not forget that the fact that compilers/assembler/framework/runtime are simply byproducts of abstractions. So abstractions come first!
THE GOLDEN AGE OF DOMAIN-SPECIFIC ABSTRACTIONS
The past 50 years have seen the rapid advances and maturity in the 1st-level (hardware-specific, especially for general-purpose, homogeneous systems) and the 2nd-level (programming-specific, especially for statically typed languages) abstractions. A few remaining challenges in the 1st- and 2nd- abstraction levels include:
- Abstraction for systems of hardware (especially heterogeneous system) and its implementations. Some of our current way of programming heterogeneous systems is not much better than the “punch card” programming model of the past. Sometimes the abstractions are defined but the implementations are nowhere near satisfactory.
Most important of all, I believe this is the golden age of domain-specific abstractions (the 3rd level of abstractions). Take big data as an example, the last decade has seen its programming model evolve from “punch card” programming for distributed systems to a rich space of big data programming models (e.g., graph runtime, map-reduce, key-value-store, and other domain-specific languages).
ABSTRACTION FIRST, IMPLEMENTATION FOLLOWS
The biggest challenge of application-specific abstractions is first and foremost the abstraction itself. The problem is often manifested as a vague problem statement such as “the development cost of my software is too large” or “the software has grown into a beast, hard to maintain and extend” or the perennial complains about “application is too slow”. The root cause of a lot of these problems is poor software architecture, and can be cured by providing layers of abstractions in the software architecture.
This is what our team (SDK & compiler team) is doing w/ various internal product codes. There is no universal rule of how an application-specific programming model looks like and/or how to derive it. One has to look at the design document and the implementation of the current software over and over again to come out w/ a proper abstraction. That is the biggest challenge of this work. Once the abstraction is defined, implementation (compiler or not) will naturally follow.
When you have a software productivity or performance problem, remember the first picture shown above and ask yourself, “am I doing punch-card programming for my domain/application”? If the answer is yes, think about abstraction first!
Two weeks into my new job at Huawei, I thought it would be nice to document this process, especially for those who may go through a similar mid-career transition. The last time when I joined a new workplace was more than 12 years ago when I was just out of graduate school. It was easy back then because the work was well defined and there were many senior colleagues to guide me through. This time though I am serving a role that the group has never had before and am supposed to lead rather than to be led. So far, I’ve survived fine. So here are a few tips.
No idea what to do at first? Talk to people and listen tentatively. My past two weeks were spent mainly to talk to people, e.g., team members, my boss, peers from other teams etc. On average, I spent at least 3 hours with each member of my team, first learning the project, then about themselves, then myself and my thoughts. One trick about talking to and getting to know new people is to listen with all your attention. And as I learned along the way, it is very important not to make any assumptions and not to judge too quickly.
Record, observe, and reflect. In the last two weeks, I was submerged in all kinds of information and many human interactions. This is both stimulating and chaotic, so it is especially important to record the information and reflect on them. Evernote is a god’s send and asking team members for their resume helps a little too. I also keep a daily journal and make sure to reserve quiet time for myself everyday to reflect on all the information taken in.
Follow any lead and act upon it. With new information flowing in and after putting your mind to digest them, work items naturally emerge. Act upon any issue that emerges as trust is often built gradually and sometimes on the smallest commitments delivered. One of the first things I observed is that the team is used to impromptu communication by stopping by each other’s office at any time. This clearly would not work for me who will work remotely most of the time. So I initiated an effort to build an internal group social media w/ group website, personal blogs, wiki, forum, etc. Not only does it benefit remote members it also makes the entire team more productive as communication is open to all and sharing thoughts and work is made easier. Along the way, my colleagues also get to know me better through my personal website/blogs.
Needless to say, the past two weeks was really intense for me. But I really enjoyed it. From an individual contributor to a team lead into a vast space of technical possibilities is an exciting transition for me. I feel lucky to have such a wonderful team to work with and Iam even more excited to my next trip to China to meet w/ the team over there.
Saw this quote at Adela’s school the other day, I can’t agree more.
The only way that we can live is if we grow; the only way we can grow is if we change; the only way we can change is if we learn; the only way we can learn is if we are exposed; the only way we can be exposed is if we throw ourselves out into the open.
By C. Joybell C.
2013 has been my year of new exposure and internal changes. The triggers are seemingly trivial: owning a smartphone, WeChat, high school reunion, my first open source project, starting a daily journal, and a good book.
Getting exposed to new sources of stimulus is often the first step of change.
As someone whose Ph.D work was on the auto parallelization and who spent 5 years in developing the auto-SIMDization feature of the xlc compiler for POWER processors, my view on auto-simdization has turned 180 degrees over the years. Yes, my 5 years on auto-SIMD was one of my most productive years and the best collaboration experience ever. But nowadays, every time someone brought up auto-SIMD as the solution to solving the programmability difficulty of SIMD, I can’t help but shoot back.
Auto-SIMD is the holy grail of SIMD programming models. Everybody wants it: programmers, executives, program managers, academics, compiler designers (myself included). The problem is that the perceived capability of auto-SIMD is quite different from the realistic capability of an auto-SIMD compiler. Putting it bluntly, auto-SIMD compilers rarely work when applied to real codes. Many times, compiler users came to us with a piece of their codes that the compiler cannot parallelize. Sometimes the code is simple to human eyes, but complicated to compilers because of aliasing and unknown side-effects through function calls. Sometimes, the loop is parallel but may be “messed up” by internal compiler transformations that confused the SIMD analysis. This is no fault of the compiler. It is simply the wrong task for the compiler to figure out a parallel loop out of a sequential program. Think about it: how many times do you rely on auto-parallelization to produce a parallel code? Probably none. The same is true for auto-SIMD.
There are times where the compiler indeed can SIMDize a loop, but the loop is often so simple (e.g., matmul w/ all global side-effects known) and the amount of compiler analysis required is so humongous (e.g., inter-procedural analysis) that it is much easier for the programmer to indicate the SIMDizable region to the compiler using some programming interface (e.g., OMP SIMD directives).
I have seen so many times, decision makers embrace SIMD because it sounds like the best solution to solve the problem; compiler practitioners declare victories after SIMDizing a few self-selected kernels and publishing the initial results; and users gave up using the feature after a few frustrated tries.
I never forget the three questions that my boss often asks about a new research idea: 1) does it work? 2) what does it do for me? 3) when is it available? Auto-SIMD fails the very 1st test.
In my community, it is not unusual to see someone that boxes an entire career by one compiler infrastructure. Some of us are labelled by the infrastructure or the language we are working on. I used to be an xlc person and would simply focus on solutions that involve xlc. The mantra used to be: “if not an xlc problem, it’s not my problem”, or “if not a compiler problem, it’s not my problem”. Yes, I was boxed in first by a compiler infrastructure, then by the field of compiler.
The reality is that real problems do not often land squarely into ones’ specialty domain. Not only do the ultimate receivers of the compiler field (application developers) change their programming habits and tools’ requirement all the time, but also for many real problems, compilers may not be the best solution (e.g., sometimes a little change of the algorithm or programming can go a long way and with a much more immediate effect).
This is a reflection on how a compiler person can be trapped by the infrastructures he is associated with and miss out a much larger scope to apply his expertise.
While many of us have started from the “common starting point” box, most of us do not have the luxury of staying in that box for a career (or do you want to if you can?). In this figure, I identified 3 “stretch” roles of a compiler expertise in the system compiler research area that are becoming increasingly important:
- Enable a better hardware design. Increasingly I see compiler people deeply involved in the concept phase of processor design, translating application level requirements to the hardware, proposing new hardware features/instructions, identify workloads, and evaluate performance benefits.
- Programming interface design. This is really a process to extract common components, in terms of a middleware, a common runtime, a domain specific language, or some language features, from users applications to improve productivity and to create a common component for deep and platform-specific optimization (e.g., graph runtime in System G).
- Performance engineering and analysis. This involves workload on-boarding, deep performance analysis (extremely time-consuming and an art in itself), tuning of OS/machine/compiler configurations and code extraction/rewrite.
These 3 roles are a perfect match to a compiler expertise who has the rare understanding of the entire system stack, across application, OS, runtime, compiler/JIT, and hardware. When the importance of optimization in a traditional compiler/JIT is diminishing, we may look more and more to these new areas to expand into.
From CELL BE to GPGPU
To give some history, I was working on the CELL BE (the SIMD engine behind PlayStation 3) right after joining IBM in 2001. For the next 5 years, I worked in a research team that designed the intrinsics interfaces for CELL BE, added automatic vectorization feature to the XLC compiler, and prototyped an OMP-based single-source compiler for CELL BE (with the lofty goal of writing one program that can be compiled to run on both the host and the SIMD accelerator). I left SIMD in 2006 to pursue other research interests and, to be honest, thought at the time that there was nothing major to be done about SIMD.
I came back to SIMD in 2012, after ignoring all the research hypes on GPGPUs for many years, only to find SIMD engines at the core of every GPGPU. If you asked me in 2007 if it is possible to support a programming model like CUDA for a SIMD engine, I’d say impossible and back it up w/ our experience of building the single-source compiler for CELL, which shares the vision as CUDA.
So why did CUDA succeed but single-source did not? I believe the key difference is in the hardware. The CELL SIMD design had so many constraints, e.g., memory accesses have to be aligned and stride-one, and no support for control-flow. As a result, many common language constructs cannot be mapped efficiently to CELL SIMD. GPGPU, on the other hand, relaxed the key constraints of traditional SIMD hardware:
- gather/scatter support: allow arbitrary load/stores in SIMD codes
- predication or divergent branch support: allow diverging control-flow in SIMD codes
From traditional SIMD to SIMT
In the chart above, we divide SIMD architectures into two broad categories.
- To the left is the traditional SIMD architecture, which provides little support for arbitrary memory access (e.g., lack of gather/scatter capability) and no support for control-flow divergence.
- To the right is the new era of SIMD, let’s call it SIMT (Single-Instruction-Multiple-Thread) as coined by NVidia. The key hardware features of SIMT are gather/scatter and supports for control-flow divergence. The latter can be provided via divergent branch (NVidia) or predication (Xeon Phi).
A final note of the chart below is the trend line of where the industry is going. While GPGPUs are root squarely in the SIMT domain today, Intel’s SIMD is gradually shifting towards SIMT in both hardware and software tools. I predict that in 5-year time, software supports for SIMT will be mature enough so that SIMT will become the mainstream programming model for SIMD.