Stanford CS149, Fall 2025

PARALLEL COMPUTING

This page contains lecture slides and recommended readings for the Fall 2025 offering of CS149.

We cannot distribute lecture videos to the public this year, but videos from a prior version of the course (2023) are available on Stanford's Youtube Channel.

Lecture 1: Why Parallelism? Why Efficiency?

(Challenges of parallelizing code, motivations for parallel chips, processor basics)

Lecture 2: A Modern Multi-Core Processor (Part I)

(Forms of parallelism: multi-core, SIMD, and multi-threading)

Lecture 3: Modern Multi-Core Architecture (Part II) + ISPC Programming Abstractions

(Finish up multi-threaded and latency vs. bandwidth. ISPC programming, abstraction vs. implementation)

Lecture 4: Parallelizing Code: An Example Thought Process

(Process of parallelizing a program in data parallel and shared address space models)

Lecture 5: Program Optimization 1: Work Distribution and Scheduling

(Achieving good work distribution while minimizing overhead, scheduling Cilk programs with work stealing)

Lecture 6: Program Optimization 2: Locality and Communication

(Message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention)

Lecture 7: GPU Architecture and CUDA Programming

(CUDA programming abstractions, and how they are implemented on modern GPUs)

Lecture 8: Data-Parallel Thinking

(Data-parallel operations like map, reduce, scan, prefix sum, groupByKey)

Lecture 9: Efficiently Evaluating DNNs on GPUs: Transformers and ConvNets

(Efficiently scheduling DNN layers, mapping convs to matrix-multiplication, transformers, layer fusion)

Lecture 10: Hardware Specialization

(Energy-efficient computing, motivation for and design of hardware accelerators. Case study on DNN accelerator design.)

Lecture 11: Programming Systems for Specialized Hardware

(Modern trends and programming systems for creating specialized hardware)

Lecture 12: Mapping AI Applications to the Datacenter Computer

(How modern AI applications are served at datacenter scale)

Lecture 13: Domain-Specific Programming Systems and AI-Driven Performance Optimization

(Domain-specific programming abstractions for writing high-performance code, automatic program optimization, with a focus on optimization driven by AI agents)

Lecture 14: Cache Coherence

(Invalidation-based coherence using MSI and MESI, false sharing)

Lecture 15: Implementing Synchronization + Memory Consistency

(Fine-grained synchronization via locks, motivation for relaxed consistency, implications to programmers.)

Lecture 16: Fine-Grained Locking and Lock-Free Programming

(Fine-grained synchronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem)

Lecture 17: Transactional Memory (Part I)

(Motivation for transactions, design space of transactional memory implementations, STM and HTM basics)

Lecture 18: Transactional Memory (Part II) + Ask Me Anything with Kunle and Kayvon

(Suggestions for post cs149 topics. AMA with the course staff.)