Stanford CS149, Fall 2020
PARALLEL COMPUTING
From smart phones, to multi-core CPUs and GPUs, to the world's largest supercomputers and web sites, parallel processing is ubiquitous in modern computing. The goal of this course is to provide a deep understanding of the fundamental principles and engineering trade-offs involved in designing modern parallel computing systems as well as to teach parallel programming techniques necessary to effectively utilize these machines. Because writing good parallel programs requires an understanding of key machine performance characteristics, this course will cover both parallel hardware and software design.
Basic Info
Tues/Thurs 2:30-3:50pm
Virtual Course Only
Instructors: Kayvon Fatahalian and Kunle Olukotun
See the course info page for more info on policies and logistics.
Fall 2020 Schedule
Sep 15 |
|
Motivations for parallel chip decisions, challenges of parallelizing code
|
Sep 17 |
|
Forms of parallelism: multicore, SIMD, threading + understanding latency and bandwidth
|
Sep 22 |
|
Ways of thinking about parallel programs, and their corresponding hardware implementations, ISPC programming
|
Sep 24 |
|
Thought process of parallelizing a program in data parallel and shared address space models
|
Sep 29 |
|
Achieving good work distribution while minimizing overhead, scheduling Cilk programs with work stealing
|
Oct 01 |
|
Message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention
|
Oct 06 |
|
CUDA programming abstractions, and how they are implemented on modern GPUs
|
Oct 08 |
|
Data parallel thinking: map, reduce, scan, prefix sum, groupByKey
|
Oct 13 |
|
Producer-consumer locality, RDD abstraction, Spark implementation and scheduling
|
Oct 15 |
|
Definition of memory coherence, invalidation-based coherence using MSI and MESI, false sharing
|
Oct 20 |
|
Consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics, implementing locks and atomic operations
|
Oct 22 |
|
Fine-grained snychronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers
|
Oct 27 |
|
Midterm Exam
good luck to everyone
|
Oct 29 |
|
Motivation for transactions, design space of transactional memory implementations, lazy-optimistic HTM
|
Nov 03 |
|
Energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, mobile SoCs
|
Nov 05 |
|
Motivation for DSLs, case study on Halide image processing DSL
|
Nov 10 |
|
GraphLab, Ligra, and GraphChi, streaming graph processing, graph compression
|
Nov 12 |
|
Performance programming for FPGAs and CGRAs
|
Nov 17 |
|
Scheduling convlayers, exploiting precision and sparsity, DNN acelerators (e.g., GPU TensorCores, TPU)
|
Nov 19 |
|
Enjoy your Winter holiday break!
|
Programming Assignments
Sep 25 | Assignment 1: Analyzing Parallel Program Performance on a Quad-Core CPU |
Oct 8 | Assignment 2: Scheduling Task Graphs |
Oct 23 | Assignment 3: A Simple Renderer in CUDA |
Nov 10 | Assignment 4: Big Graph Processing in OpenMP |
Nov 19 | Assignment 5: Optional Assignment |
Written Assignments
Oct 6 | Written Assignment 1 |
Oct 13 | Written Assignment 2 |
Oct 20 | Written Assignment 3 |
Nov 5 | Written Assignment 4 |
Nov 17 | Written Assignment 5 |