(Challenges of parallelizing code, motivations for parallel chips, processor basics)

(Forms of parallelism: multi-core, SIMD, and multi-threading)

(Finish up multi-threaded and latency vs. bandwidth. ISPC programming, abstraction vs. implementation)

(Structuring parallel programs. Process of parallelizing a program in data parallel and shared address space models)

(Achieving good work distribution while minimizing overhead, scheduling Cilk programs with work stealing)

(Message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention)

(CUDA programming abstractions, and how they are implemented on modern GPUs)

(Data-parallel operations like map, reduce, scan, prefix sum, groupByKey)

(Producer-consumer locality, RDD abstraction, Spark implementation and scheduling)

(Efficiently scheduling DNN layers, mapping convs to matrix-multiplication, transformers, layer fusion)

(Energy-efficient computing, motivation for and design of hardware accelerators)

(Modern trends and programming systems for creating specialized hardware)

(Programming hardware, motivation for and definition of memory coherence)

(Invalidation-based coherence using MSI and MESI, false sharing)

(Fine-grained synchronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem)

(Motivation for relaxed consistency, implications to programmers. Performance/productivity motivations for DSLs, case studies on several DSLs)

(Motivation for transactions, design space of transactional memory implementations, STM and HTM basics)

(Suggestions for post cs149 topics. AMA with the course staff.)