From smart phones, to multi-core CPUs and GPUs, to the world's largest supercomputers and web sites, parallel processing is ubiquitous in modern computing. The goal of this course is to provide a deep understanding of the fundamental principles and engineering trade-offs involved in designing modern parallel computing systems as well as to teach parallel programming techniques necessary to effectively utilize these machines. Because writing good parallel programs requires an understanding of key machine performance characteristics, this course will cover both parallel hardware and software design.
Challenges of parallelizing code, motivations for parallel chips, processor basics
Forms of parallelism: multi-core, SIMD, and multi-threading
Finish up multi-threaded and latency vs. bandwidth. ISPC programming, abstraction vs. implementation
Ways of thinking about parallel programs, thought process of parallelizing a program in data parallel and shared address space models
Achieving good work distribution while minimizing overhead, scheduling Cilk programs with work stealing
Message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention
CUDA programming abstractions, and how they are implemented on modern GPUs
Data-parallel operations like map, reduce, scan, prefix sum, groupByKey
Producer-consumer locality, RDD abstraction, Spark implementation and scheduling
Efficiently scheduling DNN layers, mapping convs to matrix-multiplication, transformers, layer fusion
Definition of memory coherence, invalidation-based coherence using MSI and MESI, false sharing
Relaxed consistency models and their motivation, acquire/release semantics
Democracy Day (no class)
Take time to volunteer/educate yourself/take action!
Fine-grained synchronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers
The midterm will be an evening midterm on Nov 15th. We will use the class period as a review period.
Performance/productivity motivations for DSLs, case studies on several DSLs
Motivation for transactions, design space of transactional memory implementations.
Finishing up transactional memory focusing on implementations of STM and HTM.
Energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, mobile SoCs
How DRAM works, suggestions for post-cs149 topics
Held at 3:30pm. Location TBD
|Oct 10||Written Assignment 1|
|Oct 26||Written Assignment 2|
|Nov 3||Written Assignment 3|
|Nov 11||Written Assignment 4|
|Dec 6||Written Assignment 5|