### Hardware Acceleration of DNNs

**Visual Computing Systems** Stanford CS348K, Spring 2023

#### **Lecture 8:**

### Hardware acceleration of DNN inference/training



Google TPU3





Intel Deep Learning Inference Accelerator



**Cerebras Wafer Scale Engine** 





### **Investment in Al hardware**

#### SambaNova Systems Raises \$676M in Series D, Surpasses \$5B Valuation and Becomes World's Best-Funded Al Startup

SoftBank Vision Fund 2 leads round backing breakthrough platform that delivers unprecedented AI capability and accessibility to customers worldwide

#### April 13, 2021 09:00 AM Eastern Daylight Time

PALO ALTO, Calif.--(BUSINESS WIRE)--SambaNova Systems, the company building the industry's mo hardware and services to run AI applications, today announced a \$676 million Series D funding round I Fund 2\*. The round includes additional new investors Temasek and GIC, plus existing backers including managed by BlackRock, Intel Capital, GV (formerly Google Ventures), Walden International and WRVI.

"We're here to revolutionize the AI market, and this round greatly accelerates that mission"

🔰 Tweet this

This Series D brings SambaNova's total funding and rockets its valuation to more than \$5 billion.

Now the best-funded AI systems and services pl world, SambaNova will use its latest injection to legacy competitors as it continues to shatter the hardware and software currently on the market solutions for private and public sectors more acc

"We're here to revolutionize the AI market, and this round greatly accelerates that mission," said Rodrig founder and CEO. "Traditional CPU and GPU architectures have reached their computational limits. To to solve humanity's greatest technology challenges, a new approach is needed. We've figured out that to see a wealth of prudent investors validate that."

Artificial intelligence chip startup Cerebras Systems claims it has the "world's fastest AI supercomputer," thanks to its large Wafer Scale Engine processor that comes with 400,000 compute cores.

The Los Altos, Calif.-based startup introduced its CS-1 system at the Supercomputing conference in Denver last week after raising more than \$200 million in funding from investors, most recently with an \$88 million Series D round that was raised in November 2018, according to Andrew Feldman, the founder and CEO of Cerebras who was previously an executive at AMD.

SambaNova's flagship offering is Dataflow-as-a-Service (DaaS), a subscription-based, extensible AI services platform designed to jump-start enterprise-level AI initiatives, augmenting organizations' AI capabilities and accelerating the work of existing data centers, allowing the organization to focus on its business objectives instead of infrastructure.



#### Al chipmaker Graphcore raises \$222M at a \$2.77B valuation and puts an IPO in its sights

Ingrid Lunden @ingridlunden / 10:59 PM PST • December 28, 2020

#### Comment



#### **Grog Closes \$300 Million Fundraise**

Wed, April 14, 2021, 6:00 AM · 4 min read

 $\sim$ 

- With Investment Co-Led by Tiger Global Management and D1 Capital, Groq Is Well **Capitalized for Accelerated Growth**
- MOUNTAIN VIEW, Calif., April 14, 2021 / PRNewswire / -- Groq Inc., a leading innovator in compute accelerators for artificial intelligence (AI), machine learning (ML) and high performance computing, today announced that it has closed its Series C fundraising. Groq closed \$300 million in new funding, co-led by Tiger Global Management and D1 Capital, with participation from The Spruce House Partnership and Addition, the venture firm founded by Lee Fixel. This round brings Groq's total funding to \$367 million, of which \$300 million has been raised since the second-half of 2020, a direct result of strong customer endorsement since the company launched its first product.



#### Image Credits: Graphcore

Applications based on artificial intelligence — whether they are systems running autonomous services, platforms being used in drug development or to predict the spread of a virus, traffic management for 5G networks or something else altogether require an unprecedented amount of computing power to run. And today, one of the big names in the world of designing and

#### Intel Acquires Artificial Intelligence Chipmaker Habana Labs

#### Combination Advances Intel's AI Strategy, Strengthens Portfolio of AI Accelerators for the Data Center

SANTA CLARA Calif., Dec. 16, 2019 – Intel Corporation today announced that it has acquired Habana Labs, an Israel-based developer of programmable deep learning accelerators for the data center for approximately \$2 billion. The combination strengthens Intel's artificial intelligence (AI) portfolio and accelerates its efforts in the nascent, fast-growing AI silicon market, which Intel expects to be greater than \$25 billion by 2024<sup>1</sup>.

"This acquisition advances our AI strategy, which is to provide customers with solutions to fit every performance need - from the intelligent edge to the data center," said Navin Shenoy, executive vice president and general manager of the Data Platforms Group at Intel. "More specifically, Habana turbo-charges our AI offerings for the data center with a high-performance training processor family and a standards-based programming environment to address evolving AI workloads."

### Two computer architecture reminders (review, one more time)



### **Compute specialization = energy efficiency**

- Rules of thumb: compared to high-quality C code on CPU...
- Throughput-maximized processor architectures: e.g., GPU cores
  - Approximately 10x improvement in perf / watt
  - Assuming code maps well to wide data-parallel execution and is compute bound
- Fixed-function ASIC ("application-specific integrated circuit")
  - Can approach 100-1000x or greater improvement in perf/watt
  - Assuming code is compute bound and and is not floating-point math

[Source: Chung et al. 2010 , Dally 08]



# Data movement has high energy cost

#### "Ballpark" numbers

- Integer op: ~ 1 pJ\*
- Floating point op: ~20 pJ\*
- Reading 64 bits from small local SRAM (1mm away on chip): ~ 26 pJ
- Reading 64 bits from low power mobile DRAM (LPDDR): ~1200 pJ

#### Implications

- Reading 10 GB/sec from memory: ~1.6 watts
- Entire power budget for mobile GPU: ~1 watt (remember phone is also running CPU, display, radios, etc.)
- iPhone 6 battery: ~7 watt-hours (note: my Macbook Pro laptop: 99 watt-hour battery)
- Exploiting locality matters!!!

[Sources: Bill Dally (NVIDIA), Tom Olson (ARM)]

\* Cost to just perform the logical operation, not counting overhead of instruction decode, load data from registers, etc.

Rule of thumb in modern system design: always seek to reduce amount of data movement in a computer



### **On-chip caches locate data near processing**



\* Caches also provide high bandwidth data transfer to CPU

Processors run efficiently when data is resident in caches **Caches reduce memory access latency \* Caches reduce the energy cost of data access** 



### Memory stacking locates memory near chip

Example: NVIDIA A100 GPU

#### Up to 80 GB HMB2 stacked memory 2 TB/sec memory bandwidth

Also note: A100 has 40 MB L2 cache (increased from 6.1 MB on V100)







### Improving hardware efficiency for DNN operations



# **Efficiency estimates \***

- Estimated overhead of programmability (instruction stream, control, etc.)
  - Half-precision FMA (fused multiply-add)
  - Half-precision DP4 (vec4 dot product)
  - Half-precision 4x4 MMA (matrix-matrix multiply + accumulate)



#### \* Estimates by Bill Dally using academic numbers, SysML talk, Feb 2018

2000% 500% 27%



Features a Computer Vision Accelerator (CVA), a custom module for deep learning acceleration (large matrix multiply unit)

~ 2x more efficient than NVIDIA V100 MMA instruction despite being highly specialized component. (includes optimization of gating multipliers if either operand is zero)



# **Ampere GPU SM (A100)**

#### Each SM core has: 64 fp32 ALUs (mul-add) **32 int32 ALUs**

#### 4 "tensor cores"

**Execute 8x4 x 4x8 matrix mul-add instr** A x B + C for matrices A,B,C A, B stored as fp16, accumulation with fp32 C

There are 108 SM cores in the GA100 GPU: 6,912 fp32 mul-add ALUs 432 tensor cores 1.4 GHz max clock = 19.5 TFLOPs fp32 + 312 TFLOPs (fp16/32 mixed) in tensor cores

| SM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                                                                                                |                                                                                            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |  |  |  |  |  |  |  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|--|--|
| L1 Instruction Cache                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                                                                                |                                                                                            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |  |  |  |  |  |  |  |
| L0 Ir                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | nstruction C                                                                                                                   | ache                                                                                       | L0 Instruction Cache                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |  |  |  |  |  |  |  |
| Warp Sch                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | neduler (32 ti                                                                                                                 | hread/clk)                                                                                 | Warp Scheduler (32 thread/clk)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |  |  |  |  |  |  |  |
| Dispatcl                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | h Unit (32 th                                                                                                                  | read/clk)                                                                                  | Dispatch Unit (32 thread/clk)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |  |  |  |  |  |  |  |
| Register                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | File (16,384                                                                                                                   | 4 x 32-bit)                                                                                | Register File (16,384 x 32-bit)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |  |  |  |  |  |  |  |
| INT32 INT32 FP32 FP32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | FP64                                                                                                                           |                                                                                            | INT32 INT32 FP32 FP32 FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  |  |  |  |  |  |  |
| INT32 INT32 FP32 FP32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | FP64                                                                                                                           |                                                                                            | INT32 INT32 FP32 FP32 FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  |  |  |  |  |  |  |
| INT32 INT32 FP32 FP32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | FP64                                                                                                                           |                                                                                            | INT32 INT32 FP32 FP32 FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  |  |  |  |  |  |  |
| INT32 INT32 FP32 FP32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | FP64                                                                                                                           |                                                                                            | INT32 INT32 FP32 FP32 FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  |  |  |  |  |  |  |
| INT32 INT32 FP32 FP32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | FP64                                                                                                                           | TENSOR CORE                                                                                | INT32 INT32 FP32 FP32 FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  |  |  |  |  |  |  |
| INT32 INT32 FP32 FP32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | FP64                                                                                                                           |                                                                                            | INT32 INT32 FP32 FP32 FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  |  |  |  |  |  |  |
| INT32 INT32 FP32 FP32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | FP64                                                                                                                           |                                                                                            | INT32 INT32 FP32 FP32 FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  |  |  |  |  |  |  |
| INT32 INT32 FP32 FP32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | FP64                                                                                                                           |                                                                                            | INT32 INT32 FP32 FP32 FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  |  |  |  |  |  |  |
| LD/ LD/ LD/ LD/<br>ST ST ST ST                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | LD/ LD/<br>ST ST                                                                                                               | LD/ LD/ SFU                                                                                | LD/ LD/ LD/ LD/ LD/ LD/ LD/ LD/ ST                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |  |  |  |  |  |  |  |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |                                                                                                                                |                                                                                            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |  |  |  |  |  |  |  |
| L0 In<br>Warp Sch<br>Dispatch<br>Register                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | nstruction C<br>neduler (32 th<br>h Unit (32 th<br>File (16,384                                                                | ache<br>hread/clk)<br>read/clk)<br>4 x 32-bit)                                             | L0 Instruction Cache<br>Warp Scheduler (32 thread/clk)<br>Dispatch Unit (32 thread/clk)<br>Register File (16,384 x 32-bit)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  |  |  |  |  |  |  |
| L0 In<br>Warp Sch<br>Dispatch<br>Register                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | nstruction C<br>neduler (32 th<br>h Unit (32 th<br>File (16,384<br>FP64                                                        | ache<br>hread/clk)<br>read/clk)<br>4 x 32-bit)                                             | L0 Instruction Cache<br>Warp Scheduler (32 thread/clk)<br>Dispatch Unit (32 thread/clk)<br>Register File (16,384 x 32-bit)<br>INT32 INT32 FP32 FP32 FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |  |  |  |  |  |  |  |
| L0 In<br>Warp Sch<br>Dispatch<br>Register<br>INT32 INT32 FP32 FP32<br>INT32 INT32 FP32 FP32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | nstruction C<br>neduler (32 th<br>h Unit (32 th<br>File (16,384<br>FP64<br>FP64                                                | ache<br>hread/clk)<br>read/clk)<br>4 x 32-bit)                                             | L0 Instruction Cache         Warp Scheduler (32 thread/clk)         Dispatch Unit (32 thread/clk)         Register File (16,384 x 32-bit)         INT32 INT32       FP32       FP64         INT32 INT32       FP32       FP64         INT32 INT32       FP32       FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |  |  |  |  |  |  |  |
| L0 In<br>Warp Sch<br>Dispatch<br>Register<br>INT32 INT32 FP32 FP32<br>INT32 INT32 FP32 FP32<br>INT32 INT32 FP32 FP32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | nstruction C<br>neduler (32 th<br>h Unit (32 th<br>File (16,384<br>FP64<br>FP64<br>FP64                                        | ache<br>hread/clk)<br>read/clk)<br>4 x 32-bit)                                             | L0 Instruction CacheWarp Scheduler (32 thread/clk)Dispatch Unit (32 thread/clk)Register File (16,384 x 32-bit)INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |  |  |  |  |  |  |  |
| L0 In<br>Warp Sch<br>Dispatch<br>Register<br>INT32 INT32 FP32 FP32<br>INT32 INT32 FP32 FP32<br>INT32 INT32 FP32 FP32<br>INT32 INT32 FP32 FP32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | nstruction C<br>neduler (32 th<br>h Unit (32 th<br>File (16,384<br>FP64<br>FP64<br>FP64<br>FP64                                | ache<br>hread/clk)<br>read/clk)<br>4 x 32-bit)                                             | L0 Instruction CacheWarp Scheduler (32 thread/clk)Dispatch Unit (32 thread/clk)Dispatch Unit (32 thread/clk)Register File (16,384 x 32-bit)INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |  |  |  |  |  |  |  |
| L0 In<br>Warp Sch<br>Dispatch<br>Dispatch<br>Register<br>INT32 INT32 FP32 FP32<br>INT32 INT32 FP32 FP32<br>INT32 INT32 FP32 FP32<br>INT32 INT32 FP32 FP32<br>INT32 INT32 FP32 FP32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | nstruction C<br>neduler (32 th<br>h Unit (32 th<br>File (16,384<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64                        | ache<br>hread/clk)<br>read/clk)<br>4 x 32-bit)<br>TENSOR CORE                              | L0 Instruction CacheWarp Scheduler (32 thread/clk)Dispatch Unit (32 thread/clk)Dispatch Unit (32 thread/clk)Register File (16,384 x 32-bit)INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |  |  |  |  |  |  |  |
| L0 In<br>Warp Sch<br>Dispatch<br>Dispatch<br>Register<br>INT32 INT32 FP32 FP32<br>INT32 INT32 FP32 FP32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | nstruction C<br>heduler (32 th<br>h Unit (32 th<br>File (16,384<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64        | ache<br>hread/clk)<br>read/clk)<br>4 x 32-bit)<br>TENSOR CORE                              | L0 Instruction CacheWarp Scheduler (32 thread/clk)Dispatch Unit (32 thread/clk)Dispatch Unit (32 thread/clk)Register File (16,384 x 32-bit)INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64INT32 INT32FP32FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |  |  |  |  |  |  |  |
| L0 In<br>Warp Sch<br>Dispatch<br>Dispatch<br>Register<br>INT32 INT32 FP32 FP32<br>INT32 INT32 FP32 FP32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | nstruction C<br>neduler (32 th<br>h Unit (32 th<br>File (16,384<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP6 | ache<br>hread/clk)<br>read/clk)<br>4 x 32-bit)<br>TENSOR CORE                              | L0 Instruction CacheWarp Scheduler (32 thread/clk)Dispatch Unit (32 thread/clk)Dispatch Unit (32 thread/clk)Register File (16,384 x 32-bit)INT32 INT32FP32FP64INT32 INT32FP32FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |  |  |  |  |  |  |  |
| L0 In<br>Warp Sch<br>Dispatch<br>Dispatch<br>Register<br>INT32 INT32 FP32 FP32<br>INT32 INT32 FP32 FP32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | nstruction C<br>neduler (32 th<br>h Unit (32 th<br>File (16,384<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP6 | ache<br>hread/clk)<br>read/clk)<br>4 x 32-bit)<br>TENSOR CORE                              | L0 Instruction Cache           Warp Scheduler (32 thread/clk)           Dispatch Unit (32 thread/clk)           Dispatch Unit (32 thread/clk)           Register File (16,384 x 32-bit)           INT32 INT32           INT32 FP32 FP32           FP64           INT32 FP32 FP32           FP64           INT32 FP32 FP32 FP64           INT32 FP32 FP32 FP64           INT32 FP32 FP32 FP64           INT32 FP32 FP32 FP64           INT32 INT32 FP32 FP32 FP64 |  |  |  |  |  |  |  |
| L0 In           Warp Sch           Dispatch           Dispatch           Register           INT32 INT32         FP32         FP32           INT32         INT32         FP32         FP32           INT33         INT32         FP33         FP33            INT32         FP33< | nstruction C<br>neduler (32 th<br>h Unit (32 th<br>File (16,384<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP6 | ache<br>hread/clk)<br>read/clk)<br>4 x 32-bit)<br>TENSOR CORE                              | L0 Instruction CacheWarp Scheduler (32 thread/clk)Dispatch Unit (32 thread/clk)Dispatch Unit (32 thread/clk)Register File (16,384 x 32-bit)INT32 INT32FP32FP64INT32 INT32FP32FP64                                                                                                                                                                                                                                                                                         |  |  |  |  |  |  |  |
| L0 In<br>Warp Sch<br>Dispatch<br>Dispatch<br>Register<br>INT32 INT32 FP32 FP32<br>INT32 INT32 FP32 FP32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | nstruction C<br>neduler (32 th<br>h Unit (32 th<br>File (16,384<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP6 | ache<br>hread/clk)<br>read/clk)<br>4 x 32-bit)<br>TENSOR CORE<br>LD/ LD/ SFU<br>ST LD/ SFU | L0 Instruction CacheWarp Scheduler (32 thread/clk)Dispatch Unit (32 thread/clk)Dispatch Unit (32 thread/clk)Register File (16,384 x 32-bit)INT32 INT32FP32FP64INT32 INT33FP32FP32INT32 INT34FP32FP32INT35FI3SFU                                                                                                   |  |  |  |  |  |  |  |

| L1 Instruction Cache                |           |           |           |           |           |           |                                 |                                |       |                      |           |           |           |           |           |             |            |    |    |  |
|-------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|---------------------------------|--------------------------------|-------|----------------------|-----------|-----------|-----------|-----------|-----------|-------------|------------|----|----|--|
| L0 Instruction Cache                |           |           |           |           |           |           |                                 |                                |       | L0 Instruction Cache |           |           |           |           |           |             |            |    |    |  |
| Warp Scheduler (32 thread/clk)      |           |           |           |           |           |           | <b>-</b>                        | Warp Scheduler (32 thread/clk) |       |                      |           |           |           |           |           |             |            |    |    |  |
| Dispatch Unit (32 thread/clk)       |           |           |           |           |           |           | Dispatch Unit (32 thread/clk)   |                                |       |                      |           |           |           |           |           |             |            |    |    |  |
| Register File (16,384 x 32-bit)     |           |           |           |           |           |           | Register File (16,384 x 32-bit) |                                |       |                      |           |           |           |           |           |             |            |    |    |  |
| INT32 IN                            | IT32      | FP32      | FP32      | FP        | 64        |           |                                 |                                |       | INT32                | INT32     | FP32      | FP32      | FP        | 64        |             |            |    |    |  |
| INT32 IN                            | IT32      | FP32      | FP32      | FP        | 64        |           |                                 |                                | INT32 | INT32                | FP32      | FP32      | FP        | 64        |           |             |            |    |    |  |
| INT32 IN                            | IT32      | FP32      | FP32      | FP        | 64        |           |                                 |                                |       | INT32                | INT32     | FP32      | FP32      | FP        | 64        |             |            |    |    |  |
| INT32 IN                            | 1T32      | FP32      | FP32      | FP        | 64        |           |                                 |                                |       |                      | INT32     | FP32      | FP32      | FP        | 64        | т           | TENSOR COR |    |    |  |
| INT32 IN                            | IT32      | FP32      | FP32      | FP        | 64        |           |                                 | COORL                          |       | INT32                | INT32     | FP32      | FP32      | FP        | 64        |             |            |    |    |  |
| INT32 IN                            | IT32      | FP32      | FP32      | FP        | 64        |           |                                 |                                |       | INT32                | INT32     | FP32      | FP32      | FP        | 64        |             |            |    |    |  |
| INT32 IN                            | IT32      | FP32      | FP32      | FP        | 64        |           |                                 |                                |       |                      | INT32     | FP32      | FP32      | FP        | 64        |             |            |    |    |  |
| INT32 IN                            | IT32      | FP32      | FP32      | FP        | 64        |           |                                 |                                |       | INT32                | INT32     | FP32      | FP32      | FP        | 64        |             |            |    |    |  |
| LD/<br>ST                           | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST                       | SFU                            |       | LD/<br>ST            | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST   | LD/<br>ST  | s  | FU |  |
| L0 Instruction Cache                |           |           |           |           |           |           |                                 |                                |       |                      |           | L0 Ir     | nstruc    | tion C    | ache      |             |            |    |    |  |
| Warp Scheduler (32 thread/clk)      |           |           |           |           |           | - 11      | Warp Scheduler (32 thread/clk)  |                                |       |                      |           |           |           |           |           |             |            |    |    |  |
| Dispatch Unit (32 thread/clk)       |           |           |           |           |           |           | Dispatch Unit (32 thread/clk)   |                                |       |                      |           |           |           |           |           |             |            |    |    |  |
| Register File (16,384 x 32-bit)     |           |           |           |           |           |           | Register File (16,384 x 32-bit) |                                |       |                      |           |           |           |           |           |             |            |    |    |  |
| INT32 IN                            | 1T32      | FP32      | FP32      | FP        | 64        |           |                                 |                                |       | INT32                | INT32     | FP32      | FP32      | FP        | 964       |             |            |    |    |  |
| INT32 IN                            | 1T32      | FP32      | FP32      | FP        | 64        |           |                                 |                                | INT32 | INT32                | FP32      | FP32      | FP        | 64        |           |             |            |    |    |  |
| INT32 IN                            | IT32      | FP32      | FP32      | FP        | 64        |           |                                 |                                |       | INT32                | INT32     | FP32      | FP32      | FP        | 964       | TENSOR CORE |            |    |    |  |
| INT32 IN                            | IT32      | FP32      | FP32      | FP        | 64        | т         | ENSO                            | CORE                           |       | INT32                | INT32     | FP32      | FP32      | FP        | 964       |             |            | RE |    |  |
| INT32 IN                            | 1T32      | FP32      | FP32      | FP        | 64        |           |                                 | CORE                           |       | INT32                | INT32     | FP32      | FP32      | FP        | 64        |             |            |    |    |  |
| INT32 IN                            | IT32      | FP32      | FP32      | FP        | 64        |           |                                 |                                |       | INT32                | INT32     | FP32      | FP32      | FP        | 964       |             |            |    |    |  |
| INT32 IN                            | IT32      | FP32      | FP32      | FP        | 64        |           |                                 |                                |       | INT32                | INT32     | FP32      | FP32      | FP        | 964       |             |            |    |    |  |
| INT32 IN                            | IT32      | FP32      | FP32      | FP        | 64        |           |                                 |                                |       | INT32                | INT32     | FP32      | FP32      | FP        | 64        |             |            |    |    |  |
| LD/<br>ST                           | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST                       | SFU                            |       | LD/<br>ST            | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST   | LD/<br>ST  | S  | FU |  |
| 192KB L1 Data Cache / Shared Memory |           |           |           |           |           |           |                                 |                                |       |                      |           |           |           |           |           |             |            |    |    |  |
|                                     |           |           |           | _         |           |           |                                 |                                |       |                      |           |           |           |           |           |             |            |    |    |  |

#### Single instruction to perform 2x8x4x8 FP16 + 8x8 TF32 ops

The NVIDIA tensor core approach is an evolutionary design: add DNNspecific instructions to a traditional programmable processor ("evolve, don't replace")

# **Google TPU** (version 1)



### Google's TPU (v1)



#### Figure credit: Jouppi et al. 2017

![](_page_12_Picture_4.jpeg)

# **TPU area proportionality**

![](_page_13_Figure_1.jpeg)

Figure credit: Jouppi et al. 2017

Arithmetic units ~ 30% of ch Note low area footprint of co

**Key instructions:** read host memory write host memory read weights matrix\_multiply / convolv activate

![](_page_13_Picture_6.jpeg)

![](_page_13_Picture_7.jpeg)

#### (matrix vector multiplication example: *y*=*Wx*)

![](_page_14_Figure_2.jpeg)

Accumulators (32-bit)

![](_page_14_Picture_5.jpeg)

#### (matrix vector multiplication example: *y*=*Wx*)

![](_page_15_Figure_2.jpeg)

Accumulators (32-bit)

![](_page_15_Picture_5.jpeg)

#### (matrix vector multiplication example: *y*=*Wx*)

![](_page_16_Figure_2.jpeg)

Accumulators (32-bit)

![](_page_16_Picture_5.jpeg)

#### (matrix vector multiplication example: *y*=*Wx*)

![](_page_17_Figure_2.jpeg)

Accumulators (32-bit)

![](_page_17_Picture_5.jpeg)

#### (matrix vector multiplication example: *y*=*Wx*)

![](_page_18_Figure_2.jpeg)

Accumulators (32-bit)

![](_page_18_Picture_5.jpeg)

#### (matrix vector multiplication example: *y*=*Wx*)

![](_page_19_Figure_2.jpeg)

Accumulators (32-bit)

![](_page_19_Picture_5.jpeg)

#### (matrix matrix multiplication example: *Y*=*WX*)

![](_page_20_Figure_2.jpeg)

#### Notice: need multiple 4x32bit accumulators to hold output columns

Accumulators (32-bit)

![](_page_20_Picture_6.jpeg)

![](_page_21_Figure_1.jpeg)

![](_page_21_Figure_3.jpeg)

![](_page_21_Picture_5.jpeg)

![](_page_22_Figure_1.jpeg)

![](_page_22_Figure_3.jpeg)

![](_page_22_Picture_5.jpeg)

![](_page_23_Figure_1.jpeg)

![](_page_23_Figure_3.jpeg)

![](_page_23_Picture_5.jpeg)

![](_page_24_Figure_1.jpeg)

![](_page_24_Figure_3.jpeg)

![](_page_24_Picture_5.jpeg)

### **TPU Performance/Watt**

![](_page_25_Figure_1.jpeg)

GM = geometric mean over all apps WM = weighted mean over all apps

#### total = cost of host machine + CPU incremental = only cost of TPU

![](_page_25_Picture_6.jpeg)

# **Alternative scheduling strategies**

#### **Psum = partial sum**

![](_page_26_Figure_2.jpeg)

![](_page_26_Figure_4.jpeg)

(b) Output Stationary

![](_page_26_Figure_6.jpeg)

(c) No Local Reuse

Figure credit: Sze et al. 2017

![](_page_26_Figure_9.jpeg)

![](_page_26_Figure_10.jpeg)

### Input stationary design (dense 1D conv example)

(matrix vector multiplication example: y=Wx)

Assume: 1D input/output 3-wide filters 2 output channels (K=2)

![](_page_27_Figure_3.jpeg)

# Stream of weights (2 1D filters of size 3)

Processing elements (implement multiply)

Accumulators (implement +=)

![](_page_27_Picture_8.jpeg)

# Scaling up (for training big models)

#### **Example: GPT-3 language model**

![](_page_28_Figure_2.jpeg)

![](_page_28_Picture_5.jpeg)

### **TPU v3 supercomputer**

#### TPU v3 board 4 TPU3 chips

![](_page_29_Picture_2.jpeg)

#### **TPU supercomputer (1024 TPU v3 chips)**

![](_page_29_Picture_4.jpeg)

# One TPU v3 board TPUs connected by 2D Torus interconnect

![](_page_29_Figure_6.jpeg)

![](_page_29_Picture_8.jpeg)

### Additional examples of "Al chips"

### **1. Huge numbers of compute units**

#### 2. Huge amounts of on-chip storage to maintain input weights and intermediate values

#### Key ideas:

![](_page_30_Picture_7.jpeg)

# **GraphCore MK2 GC200 IPU**

#### **IPU-Tiles**<sup>™</sup>

![](_page_31_Figure_2.jpeg)

(59B transistors similar size to A100 GPU)

![](_page_31_Picture_4.jpeg)

# **Cerebras Wafer-Scale Engine (WSE)**

![](_page_32_Picture_1.jpeg)

**Tightly interconnected tile of chips (entire wafer)** Many more transistors (1.2T) than largest single chips (Example: NVIDIA A100 GPU has 54B)

Compilation of DNN to platform involves "laying out" DNN layers in space on processing grid.

#### Neural network

![](_page_32_Picture_5.jpeg)

|                     | Cerebras WSE           |
|---------------------|------------------------|
| Chip size           | 46,225 mm <sup>2</sup> |
| Cores               | 400,000                |
| On chip<br>memory   | 18 Gigabytes           |
| Memory<br>bandwidth | 9 Petabytes/S          |
| Fabric<br>bandwidth | 100 Petabits/S         |

![](_page_32_Picture_8.jpeg)

![](_page_32_Picture_10.jpeg)

### SambaNova reconfigurable dataflow unit

Again, notice tight integration of storage and compute

![](_page_33_Figure_2.jpeg)

![](_page_33_Figure_3.jpeg)

![](_page_33_Picture_5.jpeg)

### Another example of spatial layout

![](_page_34_Figure_1.jpeg)

#### Notice: inter-layer communication occurs through on-chip interconnect, not through off-chip memory.

![](_page_34_Figure_3.jpeg)

![](_page_34_Picture_5.jpeg)

### **Exploiting sparsity**

![](_page_35_Picture_2.jpeg)

## **Architectura**<sup>§</sup> ≤

- **Consider operatic**
- If hardware detei
  - Don't fire ALU (

![](_page_36_Figure_4.jpeg)

- conv1 - Don't move data
- But ALU is idle (computation doesn't run faster, optimization only saves energy)

![](_page_36_Figure_7.jpeg)

![](_page_36_Picture_11.jpeg)

#### Model compression 2 3 3 value 3.4 0.9

- Step 1: sparsify weights by truncating weights with small values to zero
- Step 2: compress surviving non-zeros
  - Cluster weights via k-means clustering
  - Compress weights by only storing index of assigned cluster (lg(k) bits)

![](_page_37_Figure_5.jpeg)

[Han et al.]

#### [Figure credit: Han ICLR16]

![](_page_37_Figure_8.jpeg)

cluster index centroids (2 bit uint) 3: 2.00 3 0 2 2: 1.50 0 3 1: 0.00 3 0 0 0: -1.00 3 2 2 Ι

![](_page_37_Picture_13.jpeg)

### Sparse, weight-sharing fully-connec

$$b_i = ReLU\left(\sum_{j=0}^{n-1} W_{ij}a_j\right)$$

$$b_i = ReLU\left(\sum_{j \in X_i \cap Y} S[I_{ij}]a_j\right)$$

Note: activations are sparse due to ReLU

Fully-connected layer: Matrix-vector multiplication of activation vector *a* against weight matrix *W* 

Sparse, weight-sharing representation:  $I_{ij} = index \text{ for weight } W_{ij}$  S[] = table of shared weight values $X_i = list of non-zero indices in row i$ 

Y = list of non-zero indices in vector a

![](_page_38_Picture_8.jpeg)

### **Sparse-matrix, vector multiplication** Custom hardware for decode and evaluate sparse, compressed DNNs

# Represent weight matrix in compressed sparse column (CSC) format to exploit sparsity in activation vector:

```
for each nonzero a_j in a:
   for each nonzero M_ij in column M_j:
        b_i += M_ij * a_j
```

#### More detailed version (assumes CSC matrix):

![](_page_39_Picture_6.jpeg)

### Parallelization of sparse-matrix-vector product

#### Stride rows of matrix across processing elements Output activations strided across processing elements

| $ec{a}$  | ( 0        | 0          | <b>a</b> <sub>2</sub> | 0          | a4         | <b>a</b> 5  | 0           | a7)        | ) |  |  |  |  |  |  |
|----------|------------|------------|-----------------------|------------|------------|-------------|-------------|------------|---|--|--|--|--|--|--|
|          |            |            |                       |            |            |             |             |            |   |  |  |  |  |  |  |
| $^{P}E0$ | $w_{0,0}$  | 0          | $w_{0,2}$             | 0          | $w_{0,4}$  | $ w_{0,5} $ | $w_{0,6}$   | 0          |   |  |  |  |  |  |  |
| PE1      | 0          | $w_{1,1}$  | 0                     | $w_{1,3}$  | 0          | 0           | $w_{1,6}$   | 0          |   |  |  |  |  |  |  |
| PE2      | 0          | 0          | $w_{2,2}$             | 0          | $w_{2,4}$  | 0           | 0           | $w_{2,7}$  |   |  |  |  |  |  |  |
| PE3      | 0          | $w_{3,1}$  | 0                     | 0          | 0          | $ w_{0,5} $ | 0           | 0          |   |  |  |  |  |  |  |
|          | 0          | $w_{4,1}$  | 0                     | 0          | $w_{4,4}$  | 0           | 0           | 0          |   |  |  |  |  |  |  |
|          | 0          | 0          | 0                     | $w_{5,4}$  | 0          | 0           | 0           | $w_{5,7}$  |   |  |  |  |  |  |  |
|          | 0          | 0          | 0                     | 0          | $w_{6,4}$  | 0           | $ w_{6,6} $ | 0          |   |  |  |  |  |  |  |
|          | $w_{7,0}$  | 0          | 0                     | $w_{7,4}$  | 0          | 0           | $w_{7,7}$   | 0          |   |  |  |  |  |  |  |
|          | $w_{8,0}$  | 0          | 0                     | 0          | 0          | 0           | 0           | $w_{8,7}$  |   |  |  |  |  |  |  |
|          | $w_{9,0}$  | 0          | 0                     | 0          | 0          | 0           | $w_{9,6}$   | $w_{9,7}$  |   |  |  |  |  |  |  |
|          | 0          | 0          | 0                     | 0          | $w_{10,4}$ | 0           | 0           | 0          |   |  |  |  |  |  |  |
|          | 0          | 0          | $w_{11,2}$            | 0          | 0          | 0           | 0           | $w_{11,7}$ |   |  |  |  |  |  |  |
|          | $w_{12,0}$ | 0          | $w_{12,2}$            | 0          | 0          | $w_{12,5}$  | 0           | $w_{12,7}$ |   |  |  |  |  |  |  |
|          | $w_{13,0}$ | $w_{13,2}$ | 0                     | 0          | 0          | 0           | $w_{13,6}$  | 0          |   |  |  |  |  |  |  |
|          | 0          | 0          | $w_{14,2}$            | $w_{14,3}$ | $w_{14,4}$ | $w_{14,5}$  | 0           | 0          |   |  |  |  |  |  |  |
|          | 0          | 0          | $w_{15,2}$            | $w_{15,3}$ | 0          | $w_{15,5}$  | 0           | 0          |   |  |  |  |  |  |  |

### Weights stored local to PEs. Must broadcast non-zero a\_j's to all PEs Accumulation of each output b\_i is local to PE

![](_page_40_Figure_5.jpeg)

![](_page_40_Picture_7.jpeg)

### **Efficient Inference Engine (EIE) for quantized** sparse/matrix vector product

**Custom hardware for decoding compressed-sparse representation** 

![](_page_41_Figure_2.jpeg)

![](_page_41_Picture_6.jpeg)

# **EIE efficiency**

![](_page_42_Figure_1.jpeg)

![](_page_42_Figure_2.jpeg)

![](_page_42_Figure_3.jpeg)

**CPU: Core i7 5930k (6 cores) GPU: GTX Titan X** mGPU: Tegra K1

Sources of energy savings:

- **Compression allows all weights to be stored in SRAM (reduce DRAM loads)**
- Low-precision 16-bit fixed-point math (5x more efficient than 32-bit fixed math)
- Skip math on input activations that are zero (65% less math)

Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.

#### Warning: these are not end-to-end numbers: just results on fully connected layers!

![](_page_42_Picture_13.jpeg)

### Reminder: input stationary design (dense 1D)

(matrix vector multiplication example: *y*=*Wx*)

Assume: 1D input/output 3-wide filters 2 output channels (K=2)

> out(0,i-1) 6 out(1,i-1)

![](_page_43_Figure_4.jpeg)

# Stream of weights (2 1D filters of size 3)

Processing elements (implement multiply)

Accumulators (implement +=)

![](_page_43_Picture_9.jpeg)

### Input stationary design (sparse example)

Assume: **1D input/output 3-wide SPARSE filters** 2 output channels (K=2)

![](_page_44_Figure_2.jpeg)

![](_page_44_Picture_5.jpeg)

# **SCNN: accelerating sparse conv layers**

- Like EIE: assume both activations and conv weights are sparse
- Weight stationary design:
  - **Each PE receives:** 
    - A set of I input activations from an input channel: a list of I (value, (x,y)) pairs
    - A list of F non-zero weights
    - Each PE computes: the cross-product of these values: P x I values
    - Then scatters P x I results to correct accumulator buffer cell
    - Then repeat for new set of F weights (reuse l inputs)
- Then, after convolution:
  - **ReLU** sparsifies output
  - **Compress outputs into** sparse representation for use as input to next layer

[Parashar et al. ISCA17]

![](_page_45_Figure_15.jpeg)

![](_page_45_Figure_18.jpeg)

# SCNN results (on GoogLeNet)

![](_page_46_Figure_1.jpeg)

**DCNN** = dense CNN evaluation DCNN-opt = includes ALU gating, and compression/decompression of activations

Stanford CS348K, Spring 2023

![](_page_46_Picture_8.jpeg)

# Summary of hardware accelerators for efficient inference

- **Specialized instructions for dense linear algebra computations** 
  - Reduce overhead of control (compared to CPUs/GPUs)
- **Reduced precision operations (cheaper computation + reduce bandwidth requirements)**
- Systolic / dataflow architectures for efficient on-chip communication - Different scheduling strategies: weight-stationary, input/output stationary, etc.

  - Huge amounts of on-chip memory to avoid off-chip communication
  - **Exploit sparsity in activations and weights** 
    - Skip computation involving zeros

- Hardware to accelerates decompression of sparse representations like compressed sparse row/column

![](_page_47_Figure_13.jpeg)