### Stanford CS149: Parallel Computing Written Assignment 5

### **Feeling Relaxed**

## Problem 1. (20 points):

A. (10 pts) Consider the following code executed by three threads on a **cache-coherent**, **relaxed consistency** memory system. Specifically, the system allows reordering of writes (W->W reordering) and in these cases makes no guarantees about when notification of writes is delivered to other processors. You should assume that all variables are initialized to 0 prior to the code you see below.

| P1:                             | P2:            | P3:            |
|---------------------------------|----------------|----------------|
| =============================== |                |                |
| x = 10;                         | while (!flag); | while (!flag); |
| flag = true;                    | print x;       | print x;       |
| print x;                        |                |                |

You run the code and P2 prints "10". List what values might be printed by P1 and P3. Please also explain why your answer shows that the system does not provide sequentially consistent execution.

B. (10 pts) Imagine you are given a memory write fence instruction (wfence), which ensures that all writes prior to the fence are visible to all processors when the fence operation completes. Please add the **minimal number of write fences** to the code in Part A to ensure that the output is guaranteed to be that same as the output of a machine with sequentially consistent memory.

Transactions on a Doubly Linked List

# Problem 2. (40 points):

Consider a **SORTED doubly-linked list** that supports the following operations.

- insert\_front, which traverses the list from the front.
- delete\_front, which deletes a node by traversing from the front
- insert\_back, which traverses the list **backwards from the end** to insert a node in the opposite order as insert\_front.



In this problem, assume that the entire body of each function insert\_front, delete\_front, and insert\_back is placed in its own atomic block, and the code is run on a system **supporting optimistic (for both reads and writes) transactional memory**.

- A. (10 pts) Your friend writes three unit tests that each execute a pair of operations concurrently on the list shown above.
  - Test 1: insert\_front(2), delete\_front(14)
  - Test 2: insert\_front(12), delete\_front(6)
  - Test 3: insert\_front(13), insert\_back(4)

Assuming all unit tests start with the list in the state shown above, is the code correct? (By correct, we mean there are no race conditions and so all operations will modify the data structure according to their specification.) Why or why not?

B. (10 pts) Consider two transactions performing insert\_front(4) and delete\_front(14). Assume both transactions start at the same time on different cores and the transaction for insert\_front(4) proceeds to commit while the delete\_front(14) transaction has just iterated to the node with value 7. Must either of the two transactions abort in this situation? Why? (Remember this is an optimistic transactional memory system!)

C. (10 pts) Must either transaction abort if the transaction for delete\_front(14) proceeds to commit before the transaction for insert\_front(4) does? Why? Please assume that at the time of the attempted commit, insert\_front(4) has iterated to node 3, but has not begun to modify the list.

D. (10 pts) Must either transaction abort if the situation in part C is changed so that delete\_front(14) attempts to commit first, but by this time insert\_front(4) has made updates to the list (although not yet initiated its commit)? Why?

### **Coherence and Transactional Memory**

## Problem 3. (40 points):

- A. (20 pts) Consider a cache coherence protocol that implements an **eager**, **pessimistic hardware transactional memory**. Each cache line can be one of three states: Invalid (I), Shared (S), or Exclusive (E). The protocol has the following rules:
  - A cache miss occurs if the cache line is not present or is in the wrong state.
  - Reads change the line in the cache to shared (S) state if it is (I) or not present.
  - A line can be in the shared (S) state in multiple caches.
  - Writes change the line in the cache to exclusive state
  - Only one cache can have the line in exclusive state, in all other caches the line must be invalid (I) or not present.
  - On a cache miss data comes from memory... Unless another processor's cache has the line in exclusive state in which case data comes from that cache.
  - There are processor actions **Tbeg**: begin transaction, **Tend**: end and commit transaction. Remember that when a transaction commits, that might allow a stalled transaction to continue.
  - Aborts cause the processor's cache's read and write cache state to be invalidated.
  - If a conflict is detected on a write, the transaction issuing the current action "wins" (abort the other conflicting transaction). If a conflict is detected on a read, the transaction issuing the read should stall waiting for the conflicting transaction to commit. (This is exactly as we discussed in the pessimistic example diagrams in class.

Given the rules above, show what happens to the cache line state for address X for references made by three processors (P1, P2, P3) by filling in the table below (the first two rows are given). Initially none of the caches contain address X. **If an action causes a processor P to abort or stall a transaction, write "abort" or "stall" in the table entry at the row for that action and the column for P together with the state of P (e.g. "S, stall")**. If a transaction aborts, for simplicity assume that it never resumes for the rest of the duration of the table. (Also, if a transaction has already aborted, assume that the processor executing a **Tend** in a later row of the table does nothing.)

| Processor Action | Hit / Miss | P1 state | P2 state | P3 state | Data comes from |
|------------------|------------|----------|----------|----------|-----------------|
| P1,P2, P3 Tbeg   |            |          |          |          |                 |
| P1 read x        | miss       | S        |          |          | mem             |
| P3 read x        | miss       | S        |          | S        | mem             |
| P2 read x        |            |          |          |          |                 |
| P1 Tend, Tbeg    |            |          |          |          |                 |
| P2 Tend, Tbeg    |            |          |          |          |                 |
| P1 read x        |            |          |          |          |                 |
| P3 write x       |            |          |          |          |                 |
| P2 read x        |            |          |          |          |                 |
| P1 Tend          |            |          |          |          |                 |
| P3 Tend          |            |          |          |          |                 |
| P2 Tend          |            |          |          |          |                 |

- B. (20 pts) Now imagine a cache coherence protocol that implements a **lazy, optimistic** hardware transactional memory system. Each cache line can be one of four states: Invalid (I), Shared (S), Shared Write (SW) or Exclusive (E). The protocol has the following rules:
  - A cache miss occurs if the cache line is not present or is in the wrong state.
  - Reads change the line in the cache to shared (S) state if it is I or not present.
  - A line can be in shared (S) state in multiple caches.
  - Writes change the line in the cache to shared write (SW) state; allowed in multiple caches. (Note the SW state is functioning as the write log.)
  - SW and S can co-exist in different caches (convince yourself why this is true!)
  - Only one cache can have the line in exclusive state, in all other caches the line must be invalid (I) or not present.
  - On a cache miss data comes from memory... unless another processor's cache has the line in exclusive state in which case data comes from that cache.
  - There are processor actions **Tbeg:** begin transaction, **Tend:** end and commit transaction
  - Aborts cause read and write cache state to be invalidated
  - Transaction commit (Tend) causes cache lines to transition from shared write (SW) to exclusive (E) state and may cause other transactions to abort.

Given the rules above, show what happens to the cache line state for address X for references made by three processors (P1, P2, P3) by filling in the table below. Initially none of the caches contain address X. If an action causes a processor P to abort or stall a transaction, write "abort" or "stall" in the table entry at the row for that action and the column for P together with the state of P (e.g. "S, stall"). If a transaction aborts, assume that Tend does nothing.

| Processor Action | Hit / Miss | P1 state | P2 state | P3 state | Data comes from |
|------------------|------------|----------|----------|----------|-----------------|
| P1,P2, P3 Tbeg   |            |          |          |          |                 |
| P1 read x        |            |          |          |          |                 |
| P3 read x        |            |          |          |          |                 |
| P2 read x        |            |          |          |          |                 |
| P1 Tend, Tbeg    |            |          |          |          |                 |
| P2 Tend, Tbeg    |            |          |          |          |                 |
| P1 read x        |            |          |          |          |                 |
| P3 write x       |            |          |          |          |                 |
| P2 read x        |            |          |          |          |                 |
| P1 Tend          |            |          |          |          |                 |
| P3 Tend          |            |          |          |          |                 |
| P2 Tend          |            |          |          |          |                 |

#### **Practice Problem 1: Transactions on Trees**

Consider the binary search tree illustrated below.



The operations insert (insert value into tree, assuming no duplicates) and sum (return the sum of all elements in the tree) are implemented as transactional operations on the tree as shown below.

```
struct Node {
  Node *left, *right;
   int value;
};
Node* root; // root of tree, assume non-null
void insertNode(Node* n, int value) {
  if (value < n->value) {
    if (n->left == NULL)
       n->left = createNode(value);
    else
       insertNode(n->left, value);
  } else if (value > n->value) {
    if (n->right == NULL)
       n->right = createNode(value);
    else
       insertNode(n->right, value);
   // insert won't be called with a duplicate element, so there's no else case
  }
}
int sumNode(Node* n) {
   if (n == null) return 0;
   int total = n->value;
  total += sumNode(n->left);
  total += sumNode(n->right);
   return total;
}
void insert(int value) { atomic { insertNode(root, value); }
                                                                 }
                       { atomic { return sumNode(root); )
int sum()
                                                                 }
```

Consider the following four operations are executed against the tree in parallel by different threads.

insert(10); insert(25); insert(24); int x = sum();

A. Consider different orderings of how these four operations could be evaluated. Please draw all possible trees that may result from execution of these four transactions. (Note: it's fine to draw only subtrees rooted at node 20 since that's the only part of the tree that's effected.)

B. Please list all possible values that may be returned by sum().

C. Do your answers to parts A or B change depending on whether the implementation of transactions is optimistic or pessimistic? Why or why not?

D. Consider an implementation of lazy, optimistic transactional memory that manages transactions at the granularity of tree nodes (the read and writes sets are lists of nodes). Assume that the transaction insert(10) commits when insert(24) and insert(25) are currently at node 20, and sum() is at node 40. Which of the four transactions (if any) are aborted? Please describe why.

E. Assume that the transaction insert(25) commits when insert(10) is at node 15, insert(24) has already modified the tree but not yet committed , and sum() is at node 3. Which transactions (if any) are aborted? Again, please describe why.

F. Now consider a transactional implementation that is **pessimistic** with respect to writes (check for conflict on write) and **optimistic** with respect to reads. The implementation also employs a "writer wins" conflict management scheme – meaning that the transaction issuing a conflicting write will not be aborted (the other conflicting transaction will). Describe how a **livelock problem** could occur in this code.

G. Give one livelock avoidance technique that an implementation of a pessimistic transactional memory system might use. You only need to summarize a basic approach, but make sure your answer is clear enough to refer to how you'd schedule the *transactions*.

### **Practice Problem 2: Implementing Transactions**

In this problem we will explore the implementation of an **optimistic read**, **pessimistic write**, **eager versioning** software TM (STM). The STM operates over 32-bit values.

In your implementation, each transaction is encapsulated by a Txn object that maintains a local timestamp for the transaction as well as the transaction's read and write sets. Your implementation should have the following properties:

- 1. A global timestamp and a single global lock to protect commits.
- 2. A transaction's local timestamp is the value of the global timestamp when the transaction starts.
- 3. A table that maps memory locations to a version number.
- 4. Writes are stored to a write log wset as (address, value) pairs.
- 5. The version of committed writes is the current global timestamp; committing also increments the global timestamp.
- 6. The read set is validated on commit; if any read location has a version number greater than the local timestamp the transaction retries.

The skeleton code for the transactional memory system is given on the next page. You should write your answers in the space provided on the page after that. **Code for \_\_begin is provided, and you should provide code for read, write, and commit.** Don't get hung up on syntax; we don't expect you to pen down flawless, compilable C code - some pseudocode is acceptable as long as its meaning is clear. Example details of importance: What is added to the read and write sets, when are locks taken, when are conflicts validated (and how?).

```
// setjmp stores a snapshot of the registers (stack pointer, instruction pointer, etc.) into
// a buffer (t.rollback). A future call to longjmp restores the saved register values and thus
// restarts control flow at the point when setjmp was called.
#define TXN_BEGIN(t)
                       \ // TXN_BEGIN is called to begin a transaction.
       setjmp(t.rollback); \
       t.__begin();
typedef uint64_t timestamp_t;
class Txn {
public:
 Txn() {}
  virtual ~Txn() {}
  void retry() { longjmp(rollback, 1); } // return control flow to context saved by setjmp
  void __begin();
  void write(uint32_t* p, uint32_t v); // Students implement this!
  uint32_t read(uint32_t* p);
                                   // Students implement this!
  void
           commit();
                                      // Students implement this!
  jmp_buf
               rollback:
  typedef std::map<uint32_t*, mutex_t> write_lock_t; // Used for write locks
  write_lock_t wlock; // wlock is a map, so wlock[p] is the lock for the object p
private:
  #define TABLE_SZ
                     4096
  timestamp_t local_timestamp;
  static timestamp_t get_version(uint32_t* p) {
    return versions[(((intptr_t)(p)) / 4) % TABLE_SZ];
  }
  static void set_version(uint32_t* p, timestamp_t t) {
    versions[(((intptr_t)(p)) / 4) % TABLE_SZ] = t;
  }
  // Used to log writes
  typedef std::map<uint32_t*, uint32_t> write_set_t;
  write_set_t wset;
  // Used to keep track of reads that this transaction has made
  typedef std::set<uint32_t*> read_set_t;
  read_set_t
               rset;
  // Used to map memory addresses to a timestamp (e.g. to indicate most recent use)
  static timestamp_t versions[TABLE_SZ];
  static timestamp_t global_timestamp;
 static mutex_t
                     commit_lock;
};
timestamp_t Txn::global_timestamp = 0; // system-wide global
mutex_t
            Txn::commit_lock;
                                       // system-wide global
timestamp_t Txn::versions[TABLE_SZ]; // system-wide global
void Txn::__begin(void) {
 wset.clear();
  rset.clear();
  local_timestamp = global_timestamp;
}
```

```
void Txn::write(uint32_t* p, uint32_t v) {
    // YOUR CODE HERE
    // Your code can assume that a mutex supports lock(), unlock(),
    // and trylock() operations. Trylock() returns true if the lock
    // is currently locked, false otherwise.
    // e.g., trylock(wlock[p]) checks to see if the lock on object p is currently taken.
```

}

uint32\_t Txn::read(uint32\_t\* p) {
 // YOUR CODE HERE!
 // Your code can assume that a mutex supports lock(), unlock(),
 // and trylock() operations. Trylock() returns true if the lock
 // is currently locked, false otherwise
 // e.g., trylock(wlock[p]) checks to see if the lock on object p is currently taken.

void Txn::commit() {
 mutex\_lock(&commit\_lock);
 // YOUR CODE HERE!
 // Your code can assume that a mutex supports lock(), unlock(),
 // and trylock() operations. Trylock() returns true if the lock
 // is currently locked, false otherwise
 // e.g., trylock(wlock[p]) checks to see if the lock on object p is currently taken.

mutex\_unlock(&commit\_lock);
}

#### **Practice Problem 3: Controlling DRAM**

Consider a DRAM DIMM with 8 chips (8-bit interface per chip) just like what we talked about in class. Physical memory addresses are strided across the chips as in the figure below, so that 64 consecutive bits from the address space can be read in a single clock over the bus. The DRAM row size is 2 kilobits (256 bytes). There is only a single bank per chip. (We ignore banking in this problem.)



The memory controller processes requests with the following logic:

```
int active_row; // stores active row
handle_64bit_request(void* addr) {
    int row, col;
    compute_row_col(addr, &row, &col); // compute row/col from addr (0 cycles)
    if (row != active_row)
        activate_row(row); // this operation takes 15 cycles
        transfer_column(col); // this operation takes 1 cycle
}
```

Questions are on the next page...

Now consider the following C-program, which executes using two threads on a dual-core processor with a single shared cache.

```
struct ThreadArg {
  int threadId;
  double sum; // thread-local variable
  int N;
              // assume this is very large
  double* A; // pointer to shared array
};
// each thread processes one half of array A
void myfunc(ThreadArg* arg) {
  arg -> sum = 0.f;
  int offset = arg->threadId * arg->N / 2;
  for (int i=0; i<arg->N / 2; i++)
    arg->sum += arg->A[offset + i];
}
/* main code */
ThreadArg args[2];
args[0].threadId = 0; args[1].threadId = 1;
args[0].A = args[1].A = new double[N];
// initialize args[].sum, args[].N, args[].A, and launch two threads here that run myfunc
// Then wait for threads to complete
```

```
print("%f\n", args[0].sum + args[1].sum);
```

- A. Assume that the two threads run at approximately the same speed, so the memory controller receives requests from the two threads in interleaved order: thread0\_req0, thread1\_req0, thread0\_req1, thread1\_req1, etc. Given this stream, what is the effective bandwidth of the memory system as observed by the processor (the rate at which it receives data)? Assume that:
  - The program is bandwidth bound so that the memory system always has a deep queue of requests to process.
  - The granularity of transfer between the memory controller and the cache is 64 bits. (e.g., 8-byte cache line size)
  - Note that array elements are DOUBLES (8 bytes).

B. Modify the program code to significantly improve the effective memory system bandwidth. What is the new bandwidth you observe?

C. Return to the original code given in this assignment (ignore your solution to part B), and assume that requests now arrive at the memory controller every ten cycles. For example...

cycle 0: thread 0 req 0 cycle 10: thread 1 req 0 cycle 20: thread 0 req 1 cycle 30: thread 1 req 1 cycle 40: thread 0 req 2 cycle 50: thread 1 req 2 cycle 60: thread 0 req 3 ...

Write (rough) pseudocode for a memory request scheduling algorithm that allows the memory system to keep up with this request stream. Your implementation can assume there is an incoming request buffer called request\_buf that holds up to 4 requests. (The processor stalls if the request buffer is full.)

D. (TRICKY!) You add hardware multi-threading to your dual-core processor (2 threads-per core) and modify your code to spawn four threads. You assign contiguous blocks of the input array to each thread. Assuming the request arrival rate stays the same (but now requests from four threads, rather than two, are interleaved), how would you change your solution in part C to keep up with the request stream? (you may modify the buffer size if need be). Is overall memory latency higher or lower than in part C? Why?