Stanford CS149, Winter 2019


From smart phones, to multi-core CPUs and GPUs, to the world's largest supercomputers and web sites, parallel processing is ubiquitous in modern computing. The goal of this course is to provide a deep understanding of the fundamental principles and engineering trade-offs involved in designing modern parallel computing systems as well as to teach parallel programming techniques necessary to effectively utilize these machines. Because writing good parallel programs requires an understanding of key machine performance characteristics, this course will cover both parallel hardware and software design.

Basic Info
Tues/Thurs 4:30-6:00pm
Room 420-040
Instructors: Kunle Olukotun and Kayvon Fatahalian
See the course info page for more info on course policies and logistics.
Winter 2019 Schedule
Jan 8
Motivations for parallel chip decisions, challenges of parallelizing code
Jan 10
Forms of parallelism: multicore, SIMD, threading + understanding latency and bandwidth
Jan 15
ways of thinking about parallel programs, and their corresponding hardware implementations, ISPC programming
Jan 17
Thought process of parallelizing a program in data parallel and shared address space models
Jan 22
Achieving good work distribution while minimizing overhead, scheduling Cilk programs with work stealing
Jan 24
Message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention
Jan 29
CUDA programming abstractions, and how they are implemented on modern GPUs
Jan 31
Definition of memory coherence, invalidation-based coherence using MSI and MESI, false sharing
Feb 5
Consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics
Feb 7
Directory-based coherence, machine-level atomic operations, implementing locks, implementing barriers
Feb 12
Midterm Exam
Feb 14
Data parallel thinking: map, reduce, scan, prefix sum, groupByKey
Feb 19
producer-consumer locality, RDD abstraction, Spark implementation and scheduling
Feb 21
Fine-grained snychronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers
Feb 26
Motivation for transactions, design space of transactional memory implementations, lazy-optimistic HTM
Feb 28
Energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, mobile SoCs
Mar 5
Motivation for DSLs, OptiML, Delite, case study on Halide
Mar 7
GraphLab, Ligra, and GraphChi, streaming graph processing, graph compression
Mar 12
Scheduling convlayers, exploiting precision and sparsity, DNN acelerators (e.g., GPU TensorCores, TPU)
Mar 14
Have a great spring break!
Programming Assignments
Jan 18Assignment 1: Analyzing Parallel Program Performance on a Quad-Core CPU
Feb 1Assignment 2: A Multi-Threaded Web Server in Java
Feb 17Assignment 3: Big Graph Processing in OpenMP
Mar 1Assignment 4: A Simple Renderer in CUDA
Mar 14Assignment 5: Big Data Processing in Spark
Written Assignments
Jan 25Written Assignment 1: Multi-threading Scheduling + The Most ALUs Wins + Angry Students
Feb 5Written Assignment 2: Sending Messages + Processing Pipeline + Be a Processor Architect
Feb 20Written Assignment 3: Spin Locks + PKPU + Data Parallel Thinking
Feb 27Written Assignment 4: Spark Memory Footprint + Concurrent Hashtables and Graphs
Mar 8Written Assignment 5: Comparing and Swapping + Transactions on Trees