—To see a world in a grain of sand ... and eternity in an hour
This course is designed to help someone think holistically about parallelism. Starting with "scale up" or "scale out" overlooks the fundamentals: first, we should "scale down." It's quite shocking the amount of parallelism already available in a single core.
Too often, we jump to multiple processors (cores or nodes) without learning how to optimise the first one. Then, to our horror, parallel computing makes code slower! 1 + 1 = 0.5; it defies reason! This can't be an encouraging initiation for the budding parallel programmer.
This course trains a parallel programmer before they ever touch a multicore (never mind distributed or co-processing!) device. If one can expose parallelism within a single core, adding more cores is (relatively) easy, regardless of whether they are connected via a network, the PCIe bus, or L3 cache.
This patient approach spends considerable time tackling the memory wall, data access patterns, and cache optimisation. We learn to enable super-scalar execution by exposing instruction-level parallelism and we activate SIMD. We try to get so much out of an individual core that it is worth asking, "what's the need for another one?"
The second half of the course focuses on "algorithmic primitives" for breaking dependencies, thereby exposing data-level parallelism. These primitives are then applied to multicore and GPU architectures to get the most out of each.
At the end of the course, one should have a substantial grasp on how to optimise performance of a program and also on how to problem solve with a data-level parallel perspective.
The course is structured in four modules:
- Single-core parallelism (breaking the memory wall, latency hiding, cache-friendly data structures, super-scalar OOE, ILP, and SIMD)
- Multi-core parallelism (hyper-threading, lock-free parallelism, data-parallel design, synchronisation, shared memory)
- GPU co-processing (step-locked execution, thread saturation, SIMT design)
- Advanced topics (e.g., processor-in-memory), if there is enough time
The lectures mostly follow two formats, either a class discussion of a paper (that the students should have read in advance) or a live coding session to illustrate a particular effect. In the case of the paper discussions, I have included "minutes" of our class discussion, which usually match my notes that I have prepared in advance to direct the discussion. The live-coding sessions are more involved with a sub-directory structure that includes source code (and header) files as well as a README to describe the overall lecture plan.
Two exams separate individual performance, but the class is primarily based on one semester-long group project. Each group should develop an efficient GPU algorithm to process a complex computing task of their choice. The project is set up in stages to facilitate this by first requiring to optimise a single-threaded implementation, then port it to multi-core, and then finally implement it for a GPU. Some pre-screening is necessary to ensure that they choose a problem early on that has a potential to fit the GPU architecture well.
This reflects the course concept that achieving a high degree of GPU parallelism requires first understanding how to optimise a single thread.
This course is designed for upper-level or graduate Computer Science students. Familiarity with asymptotic analysis, algorithm design, and data structures is assumed, preferably at the level of a systems course (e.g., Intro to Operating Systems). One should also be capable of reading generic C++11 code quickly enough to follow a live-coding session.