Lecture 1: Why Parallelism? Why Efficiency?

(Challenges of parallelizing code, motivations for parallel chips, processor basics)

Further Reading:

The Future of Microprocessors. by K. Olukotun and L. Hammond, ACM Queue 2005
Power: A First-Class Architectural Design Constraint. by Trevor Mudge IEEE Computer 2001

Lecture 2: A Modern Multi-Core Processor

(Forms of parallelism: multicore, SIMD, threading + understanding latency and bandwidth)

Lecture 3: Parallel Programming Abstractions

(Ways of thinking about parallel programs, and their corresponding hardware implementations, ISPC programming)

Lecture 4: Parallel Programming Basics

(Thought process of parallelizing a program in data parallel and shared address space models)

Lecture 5: Performance Optimization I: Work Distribution and Scheduling

(Achieving good work distribution while minimizing overhead, scheduling Cilk programs with work stealing)

Lecture 6: Performance Optimization II: Locality, Communication, and Contention

(Message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention)

Lecture 7: GPU architecture and CUDA Programming

(CUDA programming abstractions, and how they are implemented on modern GPUs)

Lecture 8: Data-Parallel Thinking

(Data-parallel operations like map, reduce, scan, prefix sum, groupByKey)

Lecture 9: Distributed Computing Using Spark

(Producer-consumer locality, RDD abstraction, Spark implementation and scheduling)

Lecture 10: Cache Coherence

(Definition of memory coherence, invalidation-based coherence using MSI and MESI, false sharing)

Lecture 11: Memory Consistency

(Consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics)

Lecture 12: Locks, Fine-Grained Synchronization, and Lock-Free Programming

(Implementation of locks, fine-grained synchronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers)

Lecture 13: Transactional Memory

(Motivation for transactions, design space of transactional memory implementations.)

Lecture 14: Transactional Memory 2

(Finishing up transactional memory focusing on implementations of STM and HTM.)

Lecture 15: Heterogeneous Parallel Processing

(Energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, mobile SoCs)

Lecture 16: Domain Specific Programming Languages (Case Study: Halide)

(Performance/productivity motivations for DSLs, case study on Halide image processing DSL)

Lecture 17: Parallel Graph Processing Frameworks + How DRAM Works

(domain-specific frameworks for graph processing, streaming graph processing, graph compression, DRAM basics)

Lecture 18: Programming for Hardware Specialization

(Programming reconfigurable hardware like FPGAs and CGRAs)

Lecture 19: Efficiently Evaluating DNNs (+ Course Wrap Up)

(Scheduling conv layers, exploiting precision and sparsity, DNN acelerators (e.g., GPU TensorCores, TPU))