Stanford CS149, Fall 2022

PARALLEL COMPUTING

From smart phones, to multi-core CPUs and GPUs, to the world's largest supercomputers and web sites, parallel processing is ubiquitous in modern computing. The goal of this course is to provide a deep understanding of the fundamental principles and engineering trade-offs involved in designing modern parallel computing systems as well as to teach parallel programming techniques necessary to effectively utilize these machines. Because writing good parallel programs requires an understanding of key machine performance characteristics, this course will cover both parallel hardware and software design.

Basic Info

Time: Tues/Thurs 10:30-11:50am

Location: NVIDIA Auditorium

Instructors: Kayvon Fatahalian and Kunle Olukotun

See the course info page for more info on policies and logistics.

Fall 2022 Schedule

Sep 27		Why Parallelism? Why Efficiency? Challenges of parallelizing code, motivations for parallel chips, processor basics
Sep 29		A Modern Multi-Core Processor Forms of parallelism: multicore, SIMD, threading + understanding latency and bandwidth
Oct 04		Multi-Core Arch II + ISPC Programming Abstractions Finish up multi-threaded and latency vs. bandwidth. ISPC programming, abstraction vs. implementation
Oct 06		Parallel Programming Basics Ways of thinking about parallel programs, and their corresponding hardware implementations, thought process of parallelizing a program in data parallel and shared address space models
Oct 11		Performance Optimization I: Work Distribution and Scheduling Achieving good work distribution while minimizing overhead, scheduling Cilk programs with work stealing
Oct 13		Performance Optimization II: Locality, Communication, and Contention Message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention
Oct 18		GPU architecture and CUDA Programming CUDA programming abstractions, and how they are implemented on modern GPUs
Oct 20		Data-Parallel Thinking Data-parallel operations like map, reduce, scan, prefix sum, groupByKey
Oct 25		Distributed Data-Parallel Computing Using Spark Producer-consumer locality, RDD abstraction, Spark implementation and scheduling
Oct 27		Cache Coherence Definition of memory coherence, invalidation-based coherence using MSI and MESI, false sharing
Nov 01		Memory Consistency Consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics
Nov 03		Locks, Fine-Grained Synchronization, and Lock-Free Programming Implementation of locks, fine-grained synchronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers
Nov 08		Democracy Day (no class) Take time to volunteer/educate yourself/take action!
Nov 10		Transactional Memory 1 Motivation for transactions, design space of transactional memory implementations.
Nov 15		Midterm (no class) The midterm will be an evening midterm. We may use the class period as a review period.
Nov 17		Transactional Memory 2 Finishing up transactional memory focusing on implementations of STM and HTM.
Nov 29		Hardware Specialization + Domain Specific Programming Languages Energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, mobile SoCs, Performance/productivity motivations for DSLs, case study on Halide image processing DSL
Dec 01		Parallel Graph Processing Frameworks + How DRAM Works domain-specific frameworks for graph processing, streaming graph processing, graph compression, DRAM basics
Dec 06		Programming for Hardware Specialization Programming reconfigurable hardware like FPGAs and CGRAs
Dec 08		Efficiently Evaluating DNNs (+ Course Wrap Up) Efficiently scheduling DNN layers, mapping to matrix-multiplication, layer fusion, DNN accelerators (e.g., GPU TensorCores, TPU)

Programming Assignments

Oct 7	Assignment 1: Analyzing Parallel Program Performance on a Quad-Core CPU
Oct 24	Assignment 2: Scheduling Task Graphs on a Multi-Core CPU
Nov 9	Assignment 3: A Simple Renderer in CUDA
Dec 1	Assignment 4: Big Graph Processing in OpenMP
Dec 9	Extra Credit: Implement Matrix Multiplication as Fast as You Can

Written Assignments

Oct 14	Written Assignment 1
Oct 28	Written Assignment 2
Nov 4	Written Assignment 3
Nov 11	Written Assignment 4
Dec 5	Written Assignment 5