Stanford CS149, Fall 2022
PARALLEL COMPUTING

This page contains lecture slides and recommended readings for the Fall 2022 offering of CS149.

(Challenges of parallelizing code, motivations for parallel chips, processor basics)
Further Reading:
(Forms of parallelism: multicore, SIMD, threading + understanding latency and bandwidth)
(Finish up multi-threaded and latency vs. bandwidth. ISPC programming, abstraction vs. implementation)
(Ways of thinking about parallel programs, and their corresponding hardware implementations, thought process of parallelizing a program in data parallel and shared address space models)
(Achieving good work distribution while minimizing overhead, scheduling Cilk programs with work stealing)
(Message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention)
(CUDA programming abstractions, and how they are implemented on modern GPUs)
(Data-parallel operations like map, reduce, scan, prefix sum, groupByKey)
(Producer-consumer locality, RDD abstraction, Spark implementation and scheduling)
(Definition of memory coherence, invalidation-based coherence using MSI and MESI, false sharing)
(Consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics)
(Implementation of locks, fine-grained synchronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers)
(Motivation for transactions, design space of transactional memory implementations.)
(Finishing up transactional memory focusing on implementations of STM and HTM.)
(Energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, mobile SoCs, Performance/productivity motivations for DSLs, case study on Halide image processing DSL)
(domain-specific frameworks for graph processing, streaming graph processing, graph compression, DRAM basics)
(Programming reconfigurable hardware like FPGAs and CGRAs)
(Efficiently scheduling DNN layers, mapping to matrix-multiplication, layer fusion, DNN accelerators (e.g., GPU TensorCores, TPU))