

**Stanford CS149: Parallel Computing**  
**Written Assignment 1**

**Hardware Basics**

**Problem 1: (Graded for Correctness - 20 pts)**

A. (5 pts) Consider a multi-core processor that has three cores. Each core runs at 1 GHz (1 billion operations per second). Each core is single-threaded (meaning it only maintains state for a single execution context) and can complete one single-precision floating point arithmetic operation per clock. What is the peak arithmetic throughput of the processor in terms of floating point operations per second?

B. (5 pts) Now imagine the cores from part A are upgraded so that they perform 8-wide SIMD instructions. Assuming these cores still complete one of these SIMD instructions per clock, what is the peak arithmetic throughput of the processor (in terms of floating point operations per second)?

C. (5 pts) Finally, imagine that each core from part B was a multi-threaded core that maintain execution contexts for up to four hardware threads each. What is the peak arithmetic throughput of the processor in terms of floating operations per second?

D. (5 pts) Imagine that each core from part C was further modified to support superscalar execution where the core can complete one scalar floating point operation and one 8-wide SIMD instruction per clock from the same thread (if those instructions are independent). What is the peak arithmetic throughput of the processor (in terms of floating point operations per second)?

## Identifying Dependencies + A Bit on Superscalar Execution

### Problem 2: (Graded for Correctness - 20 pts)

A. (8 pts) Please draw the dependency graph for the following sequence of nine instructions. Label each node by the number of the instruction being performed.

1. ADD R2 <- R0, R1
2. MUL R3 <- R0, R1
3. MUL R3 <- R2, R3
4. SUB R4 <- R0, R1
5. MUL R2 <- R2, R3
6. MUL R3 <- R3, R4
7. MUL R2 <- R2, R3
8. MUL R4 <- R4, R1
9. MUL R0 <- R2, R4

B. (8 pts) **THIS PROBLEM IS INDEPENDENT FROM PART A.** Now consider the following dependency graph of instructions. Instructions are marked as ADDs and MULs.



Consider running the instruction stream on a processor that supports three-way superscalar processing. However the processor is constrained in three ways:

- (a) Instructions that begin to execute in the same clock must all be MUL instructions.
- (b) MUL instructions have a latency of two cycles to complete. In other words, if any instruction Y is dependent on MUL instruction X, and X executes in cycle  $c$ , then Y can only begin to execute in cycle  $c + 2$ . Other instructions that are independent of X can run in cycle  $c + 1$ .
- (c) ADD instructions have a latency of one cycle to complete.

Please determine the minimum number of cycles that the instruction stream takes, and in the diagram below, please list which instructions begin in which cycle.



C. (4 pts) If your goal was to only run the instruction stream given by the DAG in part B above, is it a good idea to consider modifying the processor to support 5-way superscalar execution? Why or why not? (Hint: using the term “ILP” in your answer would be helpful.)

## Pipelining

### Problem 3: (Graded on Effort Only - 20 pts)

A. (10 pts) Consider a final exam grading pipeline where the 10 CS149 CAs work together to grade the exams. There are seven questions, and assume that each question takes 1 CA 1 minute to grade, except question 1, which takes a CA 3 minutes to grade. Imagine that the CAs arrange themselves in a pipeline in the following way:

- 2 CAs team up work on grading question 1 (CAs process different exams in parallel)
- 3 CAs work on grading question 2 (CAs process different exams in parallel)
- 1 CA (per question) works on grading each of the remaining 5 questions.

After one of the question 1 grader's completes grading question 1 for an exam, they put in the pile for the three question 2 graders. When question 2 for an exam is graded, it goes in the pile for the one question 3 grader. When that is finished, it goes in the pile for the one question four grader, etc. Given this configuration, assuming the CAs have to grade a very large number of exams, **what is the steady state throughput of the grading pipeline in terms of exams per minute?**

exams per minute

B. (10 pts) Given the same setup as part A, once a Q1 grader starts working on an exam, what's the latency of completing grading for all questions of the exam? (You can assume that if a grader is free they pick up and start grading a new exam the moment it arrives in their pile.)

minutes

## Understanding Hardware Multi-Threading and Caches

### Problem 4: (Graded on Effort Only - 20 pts)

Please assume that an application spawns two C++ threads that each execute the following C++ function. (The thread spawn is not shown.) Note how the function does different work based on the thread id.

```
// assume threadId is the id of the current thread
// assume numThreads evenly divides N
void my_function(float* in, float* out, int N, int numThreads, int threadId) {
    int start = threadId * N/numThreads;
    int end = start + N/numThreads;
    for (int i=0; i<1000; i++) {
        for (int j=start; j<end; j++) {
            float tmp = in[j]; // memory load
            for (int k=0; k<ITERS; k++)
                tmp *= tmp; // 1 float op
            out[j] = tmp; // memory store
        }
    }
}
```

A. (10 pts) Assume you are running the application on a machine with: a single core processor that can perform one floating point operation per clock. The core has two hardware execution contexts. **The core is connected to a memory system with infinite bandwidth, a memory READ latency of 50 cycles, a memory write latency of 0, and with a cache that has four-byte cache lines and is 16 MB in size. Cache hits have zero latency.** When the application is executed on this processor with  $N = 1$  billion and  $ITERS=10$ , what is the steady-state utilization of the core? Please put your answer in the box and then provide a brief justification. (Keep in mind the application has spawned two threads.)

% utilization

B. (10 pts) **Now consider the case where the value of  $N$  is reduced from 1 billion to 1 million.** Assuming all other aspects of the program, processor, and memory stay the same, you now have the option of either increasing the clock rate of the processor by 25%, or reducing the memory latency to 20 cycles. Which option will yield the highest performance. Why? **Recall that the cache size has four-byte cache lines, and is 16 MB in size.**

## SIMD Divergence, and Avoiding It

### Problem 5: (Graded on Effort Only - 20 pts)

Consider the following image, which has a width of 16 pixels, and a height of  $N$  pixels. White pixels in the figure correspond to the value 1.0 and dark pixels correspond to the value 0.0. The first 16 rows are shown in the figure. You should assume that the pattern repeats vertically in chunks of 13 rows. (Yes, you can assume  $N$  is a multiple of 13 for simplicity.)



You write the following ISPC program to process the image above. Assume that `foo()` and `bar()` are helper functions that only perform arithmetic.

```
const int IMAGE_WIDTH = 16;
void myfunction(uniform float* input, uniform float* output, int imageHeight) {

    for (uniform int row=0; row<imageHeight; row++) {
        for (uniform int col=0; col<IMAGE_WIDTH; col+=programCount) {

            int idx = row*IMAGE_WIDTH + col + programIndex;
            float val = input[idx];    // load four bytes
            float result;

            if (val == 1.0) {
                result = foo(val);    // 1 cycle of arithmetic
            } else {
                result = bar(val);    // 10 cycles of arithmetic
            }
            output[idx] = result;    // store four bytes
        }
    }
}
```

A. (10 pts) Assume that you are running the code on the image above using a **ISPC gang size of 4** (**programCount=4**). You run the code on a single core, 1 GHz processor, with 4-wide SIMD instructions, and support for eight hardware threads (execution contexts) per core. If we assume all loop indexing and memory operations are “free”, how many total clocks does the code take? Your answer should be expressed in terms of  $N$  (which you can assume is divisible by 13). Please show your work and put your answer in the box.

clocks

B. (10 pts) Given your answer in part A, what is the overall average SIMD utilization of the processor over the duration of the computation? Your answer can just be an expression in terms of  $N$  or a final number between 0 and 100%. Please show your work and put your answer in the box. (Note that math does not work out to a clean integer number in this problem.)

utilization

## Be An ISPC Compiler

### PRACTICE PROBLEM 1:

A. Please study the ISPC function `my_ispc_function` given below. The function multiplies all elements of the array 'input' by two.

```
// Recall ISPC's built-in variables:  
//   'programCount' is the ISPC gang size  
//   'programIndex' is a per-instance identifier between 0 and programCount-1;  
  
void my_ispc_function(uniform int N, uniform float* input, uniform float* output) {  
  
    // do not assume programCount divides N  
    uniform int chunkSize = (N+programCount-1) / programCount;  
  
    int start = programIndex * chunkSize;  
    int end = start + chunkSize;  
  
    if (end > N)  
        end = N;  
  
    for (int i=start; i<end; i++) {  
        output[i] = 2 * input[i];  
    }  
}
```

Imagine you are implementing an ISPC compiler that translates ISPC programs into C programs that use vector intrinsics. To help, we have provided you a library of vector intrinsics similar to the library you used in Assignment 1. The library's methods are given on the next page. NOTE: YOU DO NOT need to study these functions in detail, but note that:

- (a) Although not listed, assume that all arithmetic instructions (add, subtract, multiply, divide, etc. are present in the library for your use) and versions are present for both vectors of floats and vectors of ints).
- (b) Assume that all binary operations on masks are present: and, or, equal
- (c) Just like Assignment 1, all operations can take an optional mask (1's in the mask mean the lane is ENABLED)
- (d) There are two types of vector load and store methods: **packed loads and stores** (loading consecutive elements) and **scatters and gathers** (loading non-consecutive elements).

On the next page, please translate this ISPC program into its vector equivalent `my_vector_function`. *Your implementation can be pseudocode, but it should produce the same mapping of work to vector lanes as the real ISPC compiler implementation.*

```

// Assume: intVec/floatVec are vectors of ints/floats of size PROGRAM_COUNT
// Assume: maskVec is a vector of bools: {111...111} = all lanes enabled
// ARITHMETIC EXAMPLES ****
intVec initVec(int value, maskVec mask);           // set all elements to 'value'
maskVec initMaskVec(bool value, maskVec mask);      // set mask to all ones or zeros
intVec copyVec(intVec a, maskVec mask);             // result = a;
intVec addVec(intVec a, intVec b, maskVec mask);    // add vectors
                                                    // (same for: 'mul', 'sub', 'div')
maskVec lessThanVec(intVec a, intVec b, maskVec mask); // result[i] = a[i] < b[i]
maskVec andVec(maskVec a, maskVec b, maskVec mask); // a && b
                                                    // (same for: 'equal', 'or')
maskVec notVec(maskVec a, maskVec mask);            // !a
int    countOnesVec(maskVec v, maskVec mask);        // returns number of 1's in v

// LOAD/STORE, GATHER/SCATTER EXAMPLES ****
// load A[0], A[1], A[2], ..., A[PROGRAM_COUNT-1] into vector
intVec loadVec(int* A, maskVec mask);

// store v into A[0], A[1], A[2], ..., A[PROGRAM_COUNT-1]
void    storeVec(int* A, intVec v, maskVec mask);

// load A[indices[0]], ..., A[indices[PROGRAM_COUNT-1]] into vector
intVec gatherVec(int* A, intVec indices, maskVec mask);

// store elements of v into A[indices[0]], ..., A[indices[PROGRAM_COUNT-1]]
void    scatterVec(int* A, intVec indices, intVec v, maskVec mask);

// YOUR IMPLEMENTATION GOES HERE (WE'VE STARTED IT FOR YOU) ****

void my_vector_function(uniform int N, uniform float* in, uniform float* out) {
    // assume programIndex = {0, 1, 2, 3, ..., PROGRAM_COUNT-1};
    intVec programIndex;
    intVec programCount = initVec(PROGRAM_COUNT);
    intVec chunkSize    = divVec(addVec(initVec(N),
                                         addVec(programCount, initVec(-1))), programCount);
    intVec start        = mulVec(programIndex, chunkSize);
    intVec end          = addVec(start, chunkSize);
}

```

B. There is a performance problem with the current implementation of `my_ispc_function`. **Please explain the problem and then re-write the original ISPC code (not your solution that uses intrinsics)** to remove this problem. (Hint: the problem has to do with the efficiency of memory access.) For simplicity, your implementation can assume `programCount` divides `N` equally if you wish – though it would be cooler if it did not. (Note: if you've forgotten *exact* ISPC syntax it's okay, just write good pseudocode.)

## Multi-Core Arch Review

### PRACTICE PROBLEM 2:

A. Consider a CPU with two cores running at 3 GHz. Each core can execute one eight-wide SIMD instruction per clock, and supports 4-way multi-threading. What is the peak arithmetic throughput of the processor in operations per second? **One SIMD operation should be counted as 8 floating point operations.**

floating point operations per second

B. Consider the following C program:

```
float A[1024];
float result = 0.0;
for (int i=0; i<10000; i++)
    for (int j=0; j<1024; j++)
        result += A[j];
```

Imagine that you work for a chip design company, and your team's job is to improve the performance of the code on single-core computer with the following properties:

- The core is single-threaded.
- It has a 32 KB cache with 4 byte cache lines. It employs a least recently used (LRU) replacement policy.
- Cache hits have 1 cycle of latency. (The data can be used immediately in the next clock.)
- The core's memory system has infinite memory bandwidth, but memory latency of 16 cycles.

Your boss gives your team two options: You can add 32-way hardware multi-threading to the existing core (and modify the program so that iterations of the outer *i*-loop are evenly partitioned across 32 threads). Or you can add 4-wide SIMD instruction support to the processor, and rewrite the *innermost loop* to use SIMD instructions. If your goal is to achieve the highest performance, which solution gives you the highest performance? Why? **In your answer state how much faster your chosen solution is than the alternative.** Hint: trace through the execution of the program and consider which memory accesses are cache misses and which are cache hits. If a majority are misses, what does that suggest? If a majority are hits, what does that suggest?

C. Please draw the dependency graph for the following sequence of nine instructions. Label each node by the number of the instruction being performed. Also imagine you are running the instructions on a two-way superscalar core that can execute two instructions per clock, BUT ONLY IF THE TWO INSTRUCTIONS ARE A MULTIPLY AND AN ADD. (It cannot run two multiplies or two adds simultaneously.) **After drawing the DAG, please give the minimum number of cycles needed to execute this sequence on this core?** To make grading easy, please draw each node in your DAG in the column for the cycle in which it occurs.

1. ADD R2 <- R0, R1
2. ADD R3 <- R2, R2
3. ADD R4 <- R0, R2
4. MUL R0 <- R3, R4
5. ADD R1 <- R2, R3
6. ADD R2 <- R0, R1
7. ADD R3 <- R0, R1
8. MUL R2 <- R1, R2
9. MUL R0 <- R2, R3



The instruction sequence takes **10** clocks.

D. Consider a program that creates two threads. Each thread performs one memory load operation, followed by ten arithmetic instructions that depend on the result of the load. This pattern then repeats many times for each thread. Assume that you are executing this program on a **two-threaded core** that can execute **one arithmetic instruction per clock**. The core is connected to a memory system that has a **infinite memory bandwidth** but a **memory latency of 90 cycles**. What is the steady-state utilization of the processor?

## NBA Superstars with the Most FLOPS

### PRACTICE PROBLEM 3:

As they near the twilight of their playing careers, NBA superstars Lebron James and Chris Paul decide to take on a new challenge and start a high-throughput processor company to compete with NVIDIA for the lucrative AI market. As an homage to their playing style and with an eye on high arithmetic intensity workloads, they decide to call their company "Big FLOPS Processors".

The Big FLOPS GPU, codenamed "NBA2K", has the following characteristics:

- 16 cores operating at 2 GHz
- 64-hardware threads per core
- Cores can run instructions from exactly one thread per clock (interleaved multi-threading)
- Superscalar capability to execute four 32-wide SIMD instructions per clock, *all from the same thread*
- A private 32 KB cache per core
- The processor is connected to a memory system with 0 memory latency and infinite memory bandwidth

The NBA2K processor implements SIMD using the "explicit SIMD" style typically of Intel CPUs. Each thread issues explicit SIMD vector instructions.

A. Under ideal code circumstances, what is the peak arithmetic throughput of the NBA2K processor in terms of **scalar operations per second**?

scalar operations per second

B. What is the maximum number of threads that are concurrently executed by the NBA2K processor? (Remember, "concurrent" means state for the threads is live on the processor at the same time.)

threads

C. Assume that the ISPC gang size is 32 and that a program creates 16 ISPC tasks executing the ISPC function below (taskIndex in the code ranges from 0 to 15 inclusive). Observe that each invocation of the ISPC function (a.k.a. each task) accesses the same values from the array `input`, so they all compute the same answer. But they write to a different per-task element of `output`.

```
const int N = 1024 * 1024 * 1024;

void my_ispc_code(uniform float* input, uniform float* output) {
    float accum = 0.0f;           // arithmetic operation

    for (int i=programIndex * 4; i<N; i+=programCount) {
        float val = input[i];      // load operation
        for (int j=0; j<4096; j++) {
            accum += val;          // arithmetic operation
        }
    }
    output[taskIndex] = reduceAdd(accum); // arithmetic operation
}
```

Given the code above, what is the percentage of peak throughput obtained when running the above code on the NBA2K processor? You can assume that the only floating point arithmetic instructions are the ones labeled. (assume integer math and comparisons are "free".) **Hint: You should inspect the code and consider instruction dependencies in the code.**

percent of peak throughput

D. Consider the following rewrite of the program. First, verify that the rewrite computes the same result as before. (It does!)

```
const int N = 1024 * 1024 * 1024;

void my_ispc_code(uniform float* input, uniform float* output) {
    float accum[4] = {0.0, 0.0, 0.0, 0.0}; // arithmetic operation

    int stepSize = 4 * programCount;
    for (int i=programIndex; i<N; i+=stepSize) {
        float val[4];
        val[0] = input[i]; // load operation
        val[1] = input[i+1]; // load operation
        val[2] = input[i+2]; // load operation
        val[3] = input[i+3]; // load operation
        for (int j=0; j<4096; j++) {
            accum[0] += val[0]; // arithmetic operation
            accum[1] += val[1]; // arithmetic operation
            accum[2] += val[2]; // arithmetic operation
            accum[3] += val[3]; // arithmetic operation
        }
    }
    float result = accum[0] + accum[1] + accum[2] + accum[3]; // arithmetic ops
    output[taskIndex] = result; // store operation
}
```

Given the code above, what is the percentage of peak throughput obtained when running the above code on the "FLOPS GPU"? You can assume that the only floating point arithmetic instructions are the ones labeled. (assume integer math and comparisons are "free".) **Hint: You should inspect the code and consider instruction dependencies in the code.**

percent of peak throughput

E. Now imagine we return to the code in part C, but change the NBA2K processor in the following ways.

- The processor can **only perform one 32-wide SIMD instruction per core per clock** (not 4)
- The processor only supports **2 hardware threads per core.** (not 64)
- We keep the infinite memory bandwidth, but **introduce a memory latency of 256 cycles.**

Assuming that you run the ISPC code in a manner where you create **one task for every hardware execution context of the processor**, do you expect the program to realize near peak arithmetic throughput of the machine? Why or why not?

Circle one:                    Yes                    No

Explain:

## A Task Queue on a Multi-Core, Multi-Threaded CPU

### PRACTICE PROBLEM 4:

The figure below shows a single-core CPU with an 32 KB L1 cache and execution contexts for up to two threads of control. The core executes threads assigned to contexts T0-T1 in an interleaved fashion by switching the active thread only on a memory stall; **Memory bandwidth is infinitely high in this system, but memory latency on a cache miss is 200 clocks.**

FAQ about the cache: To keep things simple, assume a cache hit takes only a one cycle. Assume cache lines are 4 bytes (a single floating point value), and the cache implements a least-recently used (LRU) replacement policy—meaning that when a cache line needs to be evicted, the line that was last accessed the furthest in the past is evicted. It may be helpful to think about how this cache behaves when a program reads 33 KB contiguous bytes of memory over and over. Hint: confirm to yourself that in this situation every load will be a cache miss.

In this problem assume the CPU performance no prefetching.



You are implementing a task queue for a system with this CPU. The task queue is responsible for executing independent tasks that are created as a part of a bulk launch (much like how an ISPC task launch creates many independent tasks). You implement your task system using a pool of worker threads, all of which are spawned at program launch. When tasks are added to the task queue, the worker threads grab the next task in the queue by atomically incrementing a shared counter `next_task_id`. Pseudocode for the execution of a worker thread is shown below.

```
mutex queue_lock;
int next_task_id;           // set to zero at time of bulk task launch
int total_tasks;            // set to total number of tasks at time of bulk task launch
float* task_args[MAX_NUM_TASKS]; // initialized elsewhere

while (1) {

    int my_task_id;

    LOCK(queue_lock);
    my_task_id = next_task_id++;
    UNLOCK(queue_lock);

    if (my_task_id < total_tasks)
        TASK_A(my_task_id, task_args[my_task_id]);
    else
        break;
}
```

A. Consider one possible implementation of TASK\_A from the code on the previous page:

```
function TASK_A(int task_id, float* X) {
    for (int i=0; i<1000; i++) {
        for (int j=0; j<1024*64; j++) {
            load X[j]    // assume this is a cold miss when i=0
            // ... 50 non-memory instructions using X
        }
    }
}
```

The inner loop of TASK\_A scans over 64K elements = 256 KB of elements of array X, performing 50 arithmetic instructions after each load. This process is repeated over the same data 1000 times. **Assume there are no other significant memory instructions in the program and that each task works on a completely different input array X (there is no sharing of data across tasks). Remember the cache is 32 KB.**

In order to process a bulk launch of TASK\_A, you create two worker threads, WT0 and WT1, and assign them to CPU execution contexts T0 and T1. Do you expect the program to execute *substantially faster* using the two-thread worker pool than if only one worker thread was used? If so, please calculate how much faster. (Your answer need not be exact, a back-of-the envelop calculation is fine.) If not, explain why.

*(Careful: please consider the program's execution behavior on average over the entire program's execution ("steady state" behavior). Past students have been tricked by only thinking about the behavior of the first loop iteration of the first task.) It may be helpful to draw when threads are running and stalled waiting for a load on the diagram below.*



B. Consider the same setup as the previous problem. How many hardware threads would the CPU core need in order for the machine to maintain peak throughput (100% utilization) on this workload?

C. **Now consider the case where the program is modified to contain 100,000 instructions in the inner-most loop.** Do you expect your two-thread worker pool to execute the program *substantially faster* than a one thread pool? If so, please calculate how much faster (your answer need not be exact, a back-of-the envelop calculation is fine). If not, explain why.

D. Now consider the case where the cache size is changed to 1 MB and you are running the original program from Part A (50 math instructions in the inner loop). When running the program from part A on this new machine, do you expect your two-thread worker pool to execute the program *substantially faster* than a one thread pool? If so, please calculate how much faster (your answer need not be exact, a back-of-the envelop calculation is fine). If not, explain why.



E. Now consider the case where the L1 cache size is changed to 384 KB. Assuming you cannot change the implementation of TASK\_A from Part A, would you choose to use a worker thread pool of one or two threads? Why does this improve performance and how much higher throughput does your solution achieve?

## Multi-Core Architecture Practice

## PRACTICE PROBLEM 5:

- A. Consider a multi-core processor that runs at 2 GHz and has 4 cores. Each core can perform up to one 8-wide SIMD vector instruction per clock and supports hardware multi-threading with 4 hardware execution context per core. What is the maximum throughput of the processor in units of scalar floating point operations per second? Please show your calculations.
  
- B. Consider the processor from part A running a program that perfectly parallelizes onto many cores, makes perfect usage of SIMD vector processing, and has an arithmetic intensity of 4. (4 scalar floating ops per byte of data transferred from memory.) If we assume the processor has a memory system providing 64 GB/sec of bandwidth, is the program compute-bound or bandwidth bound on this processor? You may assume for math simplicity that 1 GB/sec is a one billion bytes per second ( $10^9$ ). Please show your work.

C. Consider a cache that contains 32 KB of data, has a cache line size of 4 bytes, is fully associative (meaning any cache line can go anywhere in the cache), and uses an LRU (least recently used—the line evicted is the line that was last accessed the longest time ago) replacement policy. Please describe why the following code will take a cache miss on every data access to the array A.

```
const int SIZE = 1024 * 64;
float A[SIZE];
float sum = 0.0;
for (int reps=0; reps<32; reps++)
    for (int i=0, i<SIZE; i++)
        sum += A[i];
```

## Dependency Graphs, ILP, and Superscalar/Multi-Threaded Execution

### PRACTICE PROBLEM 6:

Consider the following sequence of scalar instructions running within a single thread.

```
1. LD  R0 <- mem[R4]    // load memory address given by R4 into R0
2. ADD R1 <- R0, R0    // R1 = R0 + R0
3. ADD R2 <- R0, R0    // R2 = R0 + R0
4. ADD R3 <- R0, R0    // R3 = R0 + R0
5. MUL R1 <- R1, R1
6. MUL R2 <- R2, R2
7. MUL R3 <- R3, R3
8. ADD R1 <- R1, R2
9. ADD R1 <- R1, R3
10. ST  mem[R5] <- R1  // store R1 into memory at address given by R5
```

A. Please draw the instruction dependency graph for this instruction sequence.

B. What is the maximum amount of instruction level parallelism (ILP) present in the program. Keep in mind that ILP is entirely a property of the program itself, not the machine it is run on.

C. Consider running this instruction stream on a single core, single-threaded processor that has superscalar execution capability to perform up to two instructions per clock, **PROVIDED THAT EXACTLY ONE INSTRUCTION IS A MUL**. In other words, the processor has two execution units (ALUs): one can execute add/load/store instructions and the other can execute only multiplications. Please assume all instructions (including loads/stores) individually complete in one cycle. You cannot modify the instructions in the program. What is the minimum number of cycles needed to execute this program? (Hint: think carefully about what instructions are allowed to run concurrently. The processing can run two instructions at the same time if they are independent and exactly one is a multiply.)

D. Now assume that the code above is changed so that it is: (1) running in a loop, where instructions 1-10 make up the body of the loop, and in each iteration of the loop the addresses stored in R4 and R5 are different. (2) Loop iterations are perfectly parallelized across `std::threads`. We are running this code on a **single core, multi-threaded processor that can process one instruction per clock** (arithmetic instructions, LDs, STs). However, from the moment a processor begins to execute a load instruction, there is a latency of 25 cycles before the data can be used by a dependent instruction. To be clear: if a core issues a LD in clock  $c$ , the LD occupies the core in clock  $c$ , but then the core cannot run an instruction that uses the value of the load until clock  $c + 25$ .

How many threads must be interleaved for the core to run at 100% efficiency? Please justify your answer.

The patterns are pretty, but the SIMD efficiency may not be

### PRACTICE PROBLEM 7:

Consider the following ISPC code that processes a  $16 \times 16$  input image (`input`) containing white, gray, or black pixels. It produces a  $16 \times 16$  output image (`output`).

```
const int IMAGE_SIZE = 16;
void myfunction(uniform float* input, uniform float* output) {

    for (uniform int row=0; row<IMAGE_SIZE; row++) {
        for (uniform int col=0; col<IMAGE_SIZE; col+=programCount) {

            int idx = row*IMAGE_SIZE + col + programIndex;
            float val = input[idx];      // load four bytes
            float result;

            if (!isWhite(val)) {
                if (isGray(val)) {
                    result = foo1(val);    // 10 cycles of arithmetic
                } else {
                    result = foo2(val);    // 10 cycles of arithmetic
                }
            } else {
                result = foo3(val);    // 10 cycles of arithmetic
            }
            output[idx] = result;      // store four bytes
        }
    }
}
```

The questions are on the next page...

A. Consider running the code on the image below. **Note the full image is  $16 \times 16$  pixels... the entire image is not shown in the figure but you can assume the pattern repeats as indicated, with only the diagonal being colored.**

Assume that the only arithmetic cycles we want you to count are the arithmetic instructions labeled in the code. (All conditionals and load/stores are “free”.) What is the overall SIMD efficiency (50%, 100%, etc.) achieved when running the code on a 1 GHz single-core CPU with a **SIMD-width (and corresponding ISPC gang size) of 4**. In other words, the core can perform one 4-wide SIMD operation per clock. (Hint: start by computing the number of cycles needed to perform one row worth of the computation.)



B. What is the SIMD efficiency obtained when running the same code, using the same processor, but on the image below? Again, please keep in mind the image is  $16 \times 16$  in size (e.g., the pattern of columns below just extends downward another 8 rows).



C. Now imagine that you are given the following function:

```
transpose(uniform float* input, uniform float* output)
```

This function transposes the  $16 \times 16$  input image and stores the result in a  $16 \times 16$  output. (Recall that a transpose is  $\text{output}[i][j] = \text{input}[j][i]$ .) Write down pseudocode for how you would use calls to `myfunction` and `transpose` to produce a computation that **eliminates all execution divergence** when running `myfunction` on the image from part B. Yes, you can allocate temp buffers as needed.

Fill in pseudocode in `myNewFunction` below. Your solution should compute the same output image `output` as `myfunction(input, output)`.

```
void myNewFunction(float* input, float* output)
```

D. Imagine that the code is run on the same processor, but now with a memory system with a memory bandwidth of **8 bytes per clock**. Please assume that the transpose operation is **bandwidth bound** and the cost of the operation is equal to the time needed to transfer required data to and from memory. You should also assume that **myfunction** is **compute bound** and that the required time is a function of the number of cycles of arithmetic performed.

**When running on the input image from part B** Is the solution you proposed in part C faster or slower than the single call to **myfunction** used in part B? How much faster is it (in clocks)?

Please keep in mind that the processor:

- Can execute one four-wide SIMD operation per clock
- Is attached to a memory bus that can transfer 8 bytes of data per clock
- The input/output images are  $16 \times 16$  elements in size.
- Each image element (a float) is 4 bytes

**Hint: compute the amount of time needed to transpose the data, and then compute the amount of time (in clocks) needed to perform **myfunction** on this modified input.**

## Caching Basics

### PRACTICE PROBLEM 8:

A. Assume we are running a program on a processor with a data cache. All data loaded from memory is first loaded into the processor's data cache, and then transferred from the cache to the processor's registers. (This is true of most systems.) **When new data is brought into the cache, the cache has a policy of evicting the least recently used data** (the data that has been accessed the longest time ago) to make room for the newly accessed data. Imagine that the cache is 32 KB, and consider running the following program:

```
const int SIZE = 64 * 1024;
float mydata[SIZE]; // hint: how much data is this?

float sum = 0.0;
for (int i=0; i<100; i++) {
    for (int j=0; j<SIZE; j++) {
        sum += mydata[j];
    }
}
```

A **cache miss** occurs when a processor accesses data from memory that is not present in the cache. When the program starts running, each each of data accessed during the  $i=0$  iteration of the outer loop is a cache miss, since that is the first time the data was accessed in the program. Now consider the entire program's execution. Please describe what fraction of the accesses to the `mydata` array will be cache misses. (For those that are familiar with the details of cache operation, please assume that the cache is fully-associative, and has a cache line size of one float. If these terms are unfamiliar to you, you can safely ignore them... or Google it!)

B. Now consider the case where `SIZE = 2048`. Does your answer to part A change with this new array size. Why or why not?

C. Now assume that you are running the following code. Would you rather have a processor with a 32 KB data cache, or would you rather add multi-threading to the processor. Why or why not?

```
float sum = 0.0;
for (int i=0; i<1000000; i++) {
    sum += 2.0 * myarray[i] + 1.0;
}
```

## Superscalar and Hardware Multi-Threading

### PRACTICE PROBLEM 9:

Consider the following sequence of 12 instructions. There is a load operation, followed by 10 math operations, followed by a store. Note that some of the instructions are scalar instructions, and others are vector instructions operating on vector registers (Vx registers). The vector operations have "V" at the beginning of their instruction names.

```
1. LD      R1    <- [R0]
2. VSPLAT  V0    <- R1      // copy R1 into all elements of V0
3. MUL     R2    <- R1, R1
4. ADD     R2    <- R2, 16   // R2 + 16
5. VSPLAT  V1    <- R2      // copy R2 into all elements of V1
6. VMUL    V2    <- V0, V0
7. VADD    V3    <- V0, V0
8. VMUL    V3    <- V1, V3
9. VMUL    V3    <- V2, V3
10. VRED   R1    <- V3      // reduction: sum all elements of V3 into R1
11. MUL     R1    <- R1, R2
12. ST      [R4]  <- R1
```

A. Please draw the dependency graph for the instruction sequence.

B. Imagine the instruction sequence is executed on a *single-core, single-threaded processor*. The processor supports **superscalar execution** in that it can fetch/decode up to **two instructions per clock**, but it has one scalar execution unit and one vector execution unit. Therefore, **it can run instructions IN ANY ORDER THAT RESPECTS THE PROGRAM DEPENDENCY GRAPH**, but it can only run two independent instructions per clock if and only if one instruction is a scalar instruction and the other instruction is a vector instruction. Assuming that all instructions take 1 cycle to complete, how many cycles does it take to complete this instruction sequence?

C. Now assume the sequence of instructions on the previous page is run in a loop. For example:

```
float in[VERY_BIG]; // let VERY_BIG = 100,000,000
float out[VERY_BIG];

// parallelize iterations of this loop using threads
#pragma omp parallel for
for (int i=0; i<VERY_BIG; i++) {
    // assume myfunc() is the 10 non-LD/ST instrs in the seq above
    out[i] = myfunc(in[i]);
}
```

The loop is run on the same single core, single-threaded processor as before, **but now the latency of a LD instruction is 20 clocks**. (That is, if the LD instruction begins on clock  $c$ , an instruction depending on the LD can begin on clock  $c + 20$ . (There is one cycle to execute the LD instruction on the processor's scalar unit, followed by 19 cycles of waiting before the dependent instruction can begin.) All other instructions still have a latency of 1 cycle.

In this setup, **what is the utilization of the core's vector execution unit?** (What fraction of cycles is the vector unit executing instructions? Hint: What are all the reasons the vector unit might not be utilized? Note: it's fine to give your answer as a fraction.)

D. Now imagine the core is multi-threaded, and can choose **to simultaneously execute two instructions from the same thread, or from different threads, provided they meet the one scalar and one vector instruction per clock constraint**. In this design, can you achieve full utilization of the vector unit? If so, how many total threads do you need (please explain why) If not, explain why.

## Picking the Right CPU for the Job

### PRACTICE PROBLEM 10:

You write a bit of ISPC code that modifies a grayscale image of size  $32 \times \text{height}$  pixels based on the contents of a black and white “mask” image of the same size. The code brightens input image pixels by a factor of 1000 if the corresponding pixel of the mask image is white (the mask has value 1.0) and by a factor of 10 otherwise.

The code partitions the image processing work into 128 ISPC tasks, which you can assume balance perfectly onto all available CPU processors.

```
void brighten_image(uniform int height, uniform float image[], uniform float mask_image[])
{
    uniform int NUM_TASKS = 128;
    uniform int rows_per_task = height / NUM_TASKS;
    launch[NUM_TASKS] brighten_chunk(rows_per_task, image, mask_image);
}

void brighten_chunk(uniform int rows_per_task, uniform float image[], uniform float mask_image[])
{
    // 'programCount' is the ISPC gang size.
    // 'programIndex' is a per-instance identifier between 0 and programCount-1.
    // 'taskIndex' is a per-task identifier between 0 and NUM_TASKS-1

    // compute starting image row for this task
    uniform int start_row = rows_per_task * taskIndex;

    // process all pixels in a chunk of rows
    for (uniform int j=start_row; j<start_row+rows_per_task; j++) {
        for (uniform int i=0; i<32; i+=programCount) {

            int idx = j*32 + i + programIndex;
            int iters = (mask_image[idx] == 1.f) ? 1000 : 10;

            float tmp = 0.f;
            for (int k=0; k<iters; k++)
                tmp += image[idx];           // these are the ops we want you to count

            image[idx] = tmp;
        }
    }
}
```

(question continued on next page)

You go to the store to buy a new CPU that runs this computation as fast as possible. On the shelf you see the following three CPUs on sale for the same price:

- (A) 1 GHz *single core* CPU capable of performing one 32-wide SIMD floating point addition per clock
- (B) 1 GHz *12-core* CPU capable of performing one 2-wide SIMD floating point addition per clock
- (C) 4 GHz *single core* CPU capable of performing one floating point addition per clock (no parallelism)



Figure 1: Image masks used to govern image manipulation by `brighten_image`

A. If your only use of the CPU will be to run the above code as fast as possible, and assuming the code will execute using mask image 1 above, rank all three machines in order of performance. Please explain how you determined your ranking by comparing execution times on the various processors. When considering execution time, you may assume that (1) the only operations you need to account for are the floating-point additions in the innermost 'k' loop. (2) The ISPC gang size will be set to the SIMD width of the CPU. (3) There are no stalls during execution due to data access.

(Hint: it may be easiest to consider the execution time of each row of the image.)

B. Rank all three machines in order of performance for mask image 2? Please justify your answer, but you are not required to perform detailed calculations like in part A.

## Be a Parallel Processor Architect

### PRACTICE PROBLEM 11:

You are hired to start the parallel processor design team at Lagunita Processors, Inc. Your boss tells you that you are responsible for designing the company's first shared address space multi-core processor, which will be constructed by cramming multiple copies of the company's best selling uniprocessor core on a single chip. Your boss expects the project to yield at least a  $5\times$  speedup on the performance of the program given below. You are not allowed to change the program, and assume that:

- Each Lagunita core can complete one floating point operation per clock
- Cores are clocked at 1 GHz, and each have a 1 MB cache using LRU replacement.
- All Lagunita processors (both single and multi-core) are attached to a 100 GB/s memory bus
- Memory latency is perfectly hidden (Lagunita processors have excellent pre-fetchers)

```
float A[N]; // let N = 100 million elements
float total = 0;

// ASSUME TIMER STARTS HERE //////////////////////////////

for (int i=0; i<N; i++)
    total += A[i];

for (int i=0; i<9; i++) {

    // made up syntax for brevity: 'parallel_for'
    // Assume iterations of this loop are perfectly partitioned
    // using blocked assignment among X pthreads each running on
    // one of the processor's X cores.
    parallel_for(int j=0; j<N; j++) {
        A[j] = A[j] / total;
    }
}

// ASSUME TIMER STOPS HERE //////////////////////////////
```

A. How do you respond to your boss' request? Do you believe you can meet the performance goal? If yes, how many cores should be included in the new multi-core processor? If no, explain why.

B. You tell your boss that if you were allowed to make a few changes to the code, you could deliver a much better speedup with your parallel processor design. How would you change the code to improve its performance by improving *speedup*? (A simple description of the new code is fine). If your answer was NO in part one, how many processors are required to achieve  $5\times$  speedup now? If your answer was YES, approximately what speedup do you expect from your previously proposed machine on the new code? (*Note: we are NOT looking for answers that optimize the program by rolling multiple divisions into one.*)

C. Assume that the following year, Lagunita Processors, Inc. decides to produce a 32-core version of your parallel CPU design. In addition to adding cores, your boss gives you the opportunity to further improve the processor through one of the following three options.

- You may double each processor's cache to 2 MB.
- You may increase memory bandwidth by 50%
- You may add a 4-wide SIMD unit to the core so that each core can perform 4 floating point operations per clock.

If each of these options has the same cost, given the code you produced in part B (and what you learned from assignment 1), which option do you recommend to your boss? Why?

## SPMD Tree Search

### PRACTICE PROBLEM 12:

NOTE: This question is tricky. If you can answer this question you really understand how the SPMD programming model maps to SIMD execution!

The figure below shows a collection of line segments in 1D. It also shows a binary tree data structure organizing the segments into a hierarchy. Leaves of the tree correspond to the line segments. Each interior tree node represents a spatial extent that bounds all its child segments. Notice that sibling leaves can (and do) overlap. Using this data structure, it is possible to answer the question “what is the largest segment that contains a specified point” without testing the point against all segments in the scene.

For example, the answer for point  $p = 0.15$  is segment 5 (in node N5). The answer for the point  $p = 0.75$  is segment 11 in node N11.

Line segments in 1D



Binary Search Tree:



```
struct Node {
    float min, max;      // if leaf: start/end of segment, else: bounds on all child segments.
    bool leaf;            // true if node is a leaf node
    int segment_id;        // segment id if this is a leaf
    Node* left, *right; // child tree nodes
};
```

On the following two pages, we provide you two ISPC functions, `find_segment_1` and `find_segment_2` that both compute the same thing: they use the tree structure above to find the id of the largest line segment that contains a given query point.

```

struct Node {
    float min, max;      // if leaf: start/end of segment, else: bounds on all child segments.
    bool leaf;           // true if node is a leaf node
    int segment_id;      // segment id if this is a leaf
    Node* left, *right; // child tree nodes
};

// -- computes segment id of the largest segment containing points[programIndex]
// -- root_node is the root of the search tree
// -- each program instance processes one query point
export void find_segment_1(uniform float* points, uniform int* results, uniform Node* root_node) {

    Stack<Node*> stack;
    Node* node;
    float max_extent = 0.0;

    // p is point this program instance is searching for
    float p = points[programIndex];
    results[programIndex] = NO_SEGMENT;

    stack.push(root_node);

    while(!stack.size() == 0) {
        node = stack.pop();

        while (!node->leaf) {
            // [I-test]: test to see if point is contained within this interior node
            if (p >= node->min && p <= node->max) {
                // [I-hit]: p is within interior node... continue to child nodes
                push(node->right);
                node = node->left;
            } else {
                // [I-miss]: point not contained within node, pop the stack
                if (stack.size() == 0)
                    return;
                else
                    node = stack.pop();
            }
        }

        // [S-test]: test if point is within segment, and segment is largest seen so far
        if (p >= node->min && p <= node->max && (node->max - node->min) > max_extent) {
            // [S-hit]: mark this segment as 'best-so-far'
            results[programIndex] = node->segment_id;
            max_extent = node->max - node->min;
        }
    }
}

```

```

export void find_segment_2(uniform float* points, uniform int* results, uniform Node* root_node) {

    Stack<Node*> stack;
    Node* node;
    float max_extent = 0.0;

    // p is point this program instance is search for
    float p = points[programIndex];

    results[programIndex] = NO_SEGMENT;

    stack.push(root_node);

    while(!stack.size() == 0) {
        node = stack.pop();

        if (!node->leaf) {
            // [I-test]: test to see if point is contained within interior node
            if (p >= node->min && p <= node->max) {
                // [I-hit]: p is within interior node... continue to child nodes
                push(node->right);
                push(node->left);
            }
        } else {
            // [S-test]: test if point is within segment, and segment is largest seen so far
            if (p >= node->min && p <= node->max && (node->max - node->min) > max_extent) {
                // [S-hit]: mark this segment as 'best-so-far'
                results[programIndex] = node->segment_id;
                max_extent = node->max - node->min;
            }
        }
    }
}

```

Begin by studying `find_segment_1`.

Given the input  $p = 0.1$ , the a single program instance will execute the following sequence of steps: (I-test,N0), (I-hit,N0), (I-test, N1), (I-hit, N1), (I-test, N2), (I-hit, N2) (S-test,N3), (S-hit, N3), (I-test, N4), (I-hit, N4), (S-test, N5), (S-hit, N5), (S-test, N6), (S-test,N7), (I-test, N8), (I-miss, N8). Where each of the above “steps” represents reaching a basic block in the code (see comments):

- $(I\text{-test}, Nx)$  represents a point-interior node test against node  $x$ .
- $(I\text{-hit}, Nx)$  represents logic of traversing to the child nodes of node  $x$  when  $p$  is determined to be contained in  $x$ .
- $(I\text{-miss}, Nx)$  represents logic of traversing to sibling/ancestor nodes when the point is not contained within node  $x$ .
- $(S\text{-test}, Nx)$  represents a point-segment (left node) test against the segment represented by node  $x$ .
- $(S\text{-hit}, Nx)$  represents the basic block where a new largest node is found  $x$ .

**The question is on the next page...**

A. Confirm you understand the above, then consider the behavior of a **gang of 4 program instances** executing the above two ISPC functions `find_segment_1` and `find_segment_2`. For example, you may wish to consider execution on the following array:

```
points = {0.15, 0.35, 0.75, 0.95}
```

Describe the difference between the traversal approach used in `find_segment_1` and `find_segment_2` in the context of SIMD execution. Your description might want to specifically point out conditions when `find_segment_1` suffers from divergence. (Hint 1: you may want to make a table of four columns, each row is a step by the entire gang and each column shows each program instance's execution. Hint 2: It may help to consider which solution is better in the case of large, heavily unbalanced trees.)

B. Consider a slight change to the code where as soon as a best-so-far line segment is found (inside [S-hit]) the code makes a call to a **very, very expensive function**. Which solution might be preferred in this case? Why?