Stanford CS149, Fall 2020

From smart phones, to multi-core CPUs and GPUs, to the world's largest supercomputers and web sites, parallel processing is ubiquitous in modern computing. The goal of this course is to provide a deep understanding of the fundamental principles and engineering trade-offs involved in designing modern parallel computing systems as well as to teach parallel programming techniques necessary to effectively utilize these machines. Because writing good parallel programs requires an understanding of key machine performance characteristics, this course will cover both parallel hardware and software design.

Basic Info
Tues/Thurs 2:30-3:50pm
Virtual Course Only
Instructors: Kayvon Fatahalian and Kunle Olukotun
See the course info page for more info on policies and logistics.
Fall 2020 Schedule
Sep 15
Motivations for parallel chip decisions, challenges of parallelizing code
Sep 17
Forms of parallelism: multicore, SIMD, threading + understanding latency and bandwidth
Sep 22
Ways of thinking about parallel programs, and their corresponding hardware implementations, ISPC programming
Sep 24
Thought process of parallelizing a program in data parallel and shared address space models
Sep 29
Achieving good work distribution while minimizing overhead, scheduling Cilk programs with work stealing
Oct 01
Message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention
Oct 06
CUDA programming abstractions, and how they are implemented on modern GPUs
Oct 08
Data parallel thinking: map, reduce, scan, prefix sum, groupByKey
Oct 13
Producer-consumer locality, RDD abstraction, Spark implementation and scheduling
Oct 15
Definition of memory coherence, invalidation-based coherence using MSI and MESI, false sharing
Oct 20
Consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics, implementing locks and atomic operations
Oct 22
Fine-grained snychronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers
Oct 27
Midterm Exam
good luck to everyone
Oct 29
Motivation for transactions, design space of transactional memory implementations, lazy-optimistic HTM
Nov 03
Energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, mobile SoCs
Nov 05
Motivation for DSLs, case study on Halide image processing DSL
Nov 10
GraphLab, Ligra, and GraphChi, streaming graph processing, graph compression
Nov 12
Performance programming for FPGAs and CGRAs
Nov 17
Scheduling convlayers, exploiting precision and sparsity, DNN acelerators (e.g., GPU TensorCores, TPU)
Nov 19
Enjoy your Winter holiday break!
Programming Assignments
Sep 25 Assignment 1: Analyzing Parallel Program Performance on a Quad-Core CPU
Oct 8 Assignment 2: Scheduling Task Graphs
Oct 23 Assignment 3: A Simple Renderer in CUDA
Nov 10 Assignment 4: Big Graph Processing in OpenMP
Nov 19 Assignment 5: Optional Assignment
Written Assignments
Oct 6 Written Assignment 1
Oct 13 Written Assignment 2
Oct 20 Written Assignment 3
Nov 5 Written Assignment 4
Nov 17 Written Assignment 5