Stanford CS149, Fall 2025

PARALLEL COMPUTING

From smart phones, to multi-core CPUs, to GPUs, to AI accelerators, to the world's largest supercomputers and web sites, parallel processing is ubiquitous in modern computing. The goal of this course is to provide a deep understanding of the fundamental principles and engineering trade-offs involved in designing modern parallel computing systems as well as to teach parallel programming techniques necessary to effectively utilize these machines. Because writing good parallel programs requires an understanding of key machine performance characteristics, this course will cover both parallel hardware and software design.

Basic Info

Time: Tues/Thurs 10:30-11:50am

Location: NVIDIA Auditorium

Instructors: Kayvon Fatahalian and Kunle Olukotun

See the course info page for more info on policies and logistics.

Fall 2025 Schedule

Sep 23		Why Parallelism? Why Efficiency? Challenges of parallelizing code, motivations for parallel chips, processor basics
Sep 25		A Modern Multi-Core Processor (Part I) Forms of parallelism: multi-core, SIMD, and multi-threading
Sep 30		Modern Multi-Core Architecture (Part II) + ISPC Programming Abstractions Finish up multi-threaded and latency vs. bandwidth. ISPC programming, abstraction vs. implementation
Oct 02		Parallelizing Code: An Example Thought Process Process of parallelizing a program in data parallel and shared address space models
Oct 07		Program Optimization 1: Work Distribution and Scheduling Achieving good work distribution while minimizing overhead, scheduling Cilk programs with work stealing
Oct 09		Program Optimization 2: Locality and Communication Message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention
Oct 14		GPU Architecture and CUDA Programming CUDA programming abstractions, and how they are implemented on modern GPUs
Oct 16		Data-Parallel Thinking Data-parallel operations like map, reduce, scan, prefix sum, groupByKey
Oct 21		Efficiently Evaluating DNNs on GPUs: Transformers and ConvNets Efficiently scheduling DNN layers, mapping convs to matrix-multiplication, transformers, layer fusion
Oct 23		Hardware Specialization Energy-efficient computing, motivation for and design of hardware accelerators. Case study on DNN accelerator design.
Oct 28		Programming Systems for Specialized Hardware Modern trends and programming systems for creating specialized hardware
Oct 30		Mapping AI Applications to the Datacenter Computer How modern AI applications are served at datacenter scale
Nov 04		Democracy Day (no class) Attend Stanford's many events!
Nov 06		Domain-Specific Programming Systems and AI-Driven Performance Optimization Domain-specific programming abstractions for writing high-performance code, automatic program optimization, with a focus on optimization driven by AI agents
Nov 11		Cache Coherence Invalidation-based coherence using MSI and MESI, false sharing
Nov 13		Implementing Synchronization + Memory Consistency Fine-grained synchronization via locks, motivation for relaxed consistency, implications to programmers.
Nov 18		Midterm Exam (no class) This will be an evening exam, so there's no class
Nov 20		Fine-Grained Locking and Lock-Free Programming Fine-grained synchronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem
Dec 02		Transactional Memory (Part I) Motivation for transactions, design space of transactional memory implementations, STM and HTM basics
Dec 04		Transactional Memory (Part II) + Ask Me Anything with Kunle and Kayvon Suggestions for post cs149 topics. AMA with the course staff.
Dec 11		Final Exam Held from 3:30-6:30pm

Lecture Videos

We cannot distribute lecture videos to the public this year, but videos from a prior version of the course (2023) are available on Stanford's Youtube Channel.

Programming Assignments

Oct 6	Assignment 1: Analyzing Parallel Program Performance on a Quad-Core CPU
Oct 16	Assignment 2: Scheduling Task Graphs on a Multi-Core CPU
Oct 30	Assignment 3: A Circle Renderer in CUDA
Nov 13	Assignment 4: Fused Conv+MaxPool on the Trainium2 Accelerator
Dec 4	Assignment 5: Make the World's Fastest CUDA Kernels

Written Assignments

Oct 9	Written Assignment 1
Oct 21	Written Assignment 2
Nov 6	Written Assignment 3
Dec 3	Written Assignment 4