What's Going On

The key is to make sure that enough locks are grabbed (and in the correct order) that no other possible competing operation can interject during this operation and change what we expect pointers point to in the nodes we are examining.

GraphCore, a company that makes AI processors especially for graph-related purposes, similarly replaces caches on their chips with scratch space.

bmehta22 commented on Memory Consistency, Slide 73

int atomic_increment(int addr, int x) { int val = addr; while (atomicCAS(addr, val, val + x) != val) { val = *addr; } return val + x; }

bmehta22 commented on Transactional Memory 2, Slide 60

Not all operations possible on a CPU are even needed for a specific purpose chip.

Not all of these threads can be executed at once, though, as only one warp per core executes at a time

bmehta22 commented on A Modern Multi-Core Processor, Slide 55

As do adding registers (memory built in the execution unit), but there is a cost to everything. Faster memory is more expensive, and adding more of it increases the search space, slowing it down, so there are limits to how much you can add of each cache in a given price range.

Atomic operations are still slower than others, so avoid this if possible. Sometimes other synchronization methods (syncthreads, multiple kernels) may be more optimal.

bmehta22 commented on Transactional Memory, Slide 53

These sets are saved in software (s in STM).

@sanketg Agreed, but if we know that a chip will be used for a particular purpose for a long time, and we can make it in large enough quantities to take advantages of 'economies of scale', the tradeoffs in terms of power/speed usage (we use a chip for less time with less energy usage, which results in many other benefits in terms of heat/battery usage, etc.) make it better to use ASICs, especially when one concerns that chips are not reused as frequently as one would wish.

Another reason why arithmetic intensity is a useful heuristic for finding a better algorithm -- not only is arithmetic generally easier to parallelize, but it is less costly in terms of energy, so even if we rearrange an algorithm in such a way that we have a lot more arithmetic instructions for the sake of eliminating a few memory instructions, this is not Goodhart's Law at work, because arithmetic intensity is still a useful metric.

Given that we have a fixed piece of hardware to use on networks of ever-increasing size, it takes more power to run the network, because we have to reuse the same ALUs multiple times: https://twitter.com/ylecun/status/1469690849357410306.

There are lots of challenges involved in efficiency because of the number of filters, trying to maintain numerical stability while using half/single-precision, etc., but using repetitive filters can allow for optimizations that can be seen in VGGNet.

How easy is it to deploy models built with these kinds of software, since GPUs often won't be available to consumers?

Is this why Apple silicon recently performs so much better than the previous Intel silicon did, they have the money to design extremely specialized hardware for their machines?

Have graph libraries like these ever been used to optimize programs in other domains (like how cuda was used for more than just graphics back in the day)?

How is this different than a convolution with a filter size of 3?

Would libraries that operate with very specific optimized data structures also fall under this category?

Are we just assuming the cache blocks it up like this rather than storing data that lives on the same row? If the cache worked that way, we wouldn't want it block up like this right?

@12345 if the cache doesn't fit a single row or col, why would it be useful to grab data from the next row(s)?

Is GraphLab still widely used today? What's the relationship between it and the other DSLs like Ligra in terms of how they are used by programmers in practice?

How do you program on an FPGA? Is it like CUDA where you can write specialized code in a library that interacts directly with the hardware?

Does the name have anything to do with Franz Liszt? If so, I wonder why :P

How common are these complex instructions in specialized hardware?

It looks like because there are "no threads" like prof Kunle mentioned in lecture, all we need to do in the code is describe how to wire up the accel module. This was a bit confusing especially the first time seeing the "loops in spatial I wasn't really sure how that works, but now it makes sense that it's almost like a map but for the different inputs, so no sequential thread of control.

It looks like Ligra stores graphs in the same way we used for assign 4 using two adjacency matrices for in/out edges: https://people.csail.mit.edu/jshun/ligra+.pdf

Liszt was actually developed by Stanford affiliates: https://www.researchgate.net/publication/220782958_Liszt_A_Domain_Specific_Language_for_Building_Portable_Mesh-based_PDE_Solvers Pretty cool!

How much work is done by the processor itself to schedule/assign work to these different specialized components?

amohamdy commented on Transactional Memory 2, Slide 28

To add to what @parthiv mentioned, we'd also need a write set and a read set to log the objects referenced by a transaction and do the necessary validation. We'd also need a way to keep a global time stamp that all processors can read/write.

amohamdy commented on Transactional Memory, Slide 54

The tradeoff here is that if we store transaction records on the object level, we might get cases where two processors that are acting on different fields of the same object have to pointlessly check and potentially stall or abort each other since they're not aware that they'd be acting on different parts of the data and can act independently. The cost is obviously a memory overhead that we incur to save transaction records for each of the fields instead of one per object.

I'm curious if there are more formal techniques to resolve livelock other than exponential back-off and random. Waiting a random amount of time seems like it should work, but it adds this unnecessary delay. Can there be a piece of software in the OS that detects and handles livelock situations? What would that look like?

amohamdy commented on Memory Consistency, Slide 47

I was looking into how processor reorder instructions while keeping the program's logic unchanged and I ran across this StackOverflow answer which is interesting and links some interesting videos: https://stackoverflow.com/questions/39062100/how-does-cpu-reorder-instructions

@ccheng In contrast, specialized hardware has a different "processing pipeline" than a general purpose CPU. One analogy you could use is with moving cargo. A CPU would be like putting a single package on your car and driving to/from the destination. A piece of hardware specializing in an algorithm like FFTs would be like a train: Very limited flexibility, but incredibly high throughput due to not needing to consider different use cases.

I would say that DSL's aren't quite the same as hardware accelerators and ASICs, though they both use domain-specific knowledge to improve the speed of your program.

@ccheng18 There's nothing inherent to these libraries that make it "more efficient" than code that could be written by hand. The reason they usually outperform code written in the traditional way is that the library has domain-specific optimizations enabled. In other words, it is possible to get the same speedup in C by using your computer's resources efficiently (meaning you use SIMD in cases of low divergence, use the cache intelligently, etc...)

For example, we made a thread pool in Assignment 2 that was subject to some design considerations. If we wanted to make a thread-pool that we can use to parallelize graph algorithms, we could make different decisions that could potentially speed it up more.

What are some other ways to introduce non-linearity?

How are these addresses assigned to variables in code themselves? And how do L1/L2/L3 caches work in terms of updating the caches? If a particular variable is stored within each cache, would it take a sum of time to update them? Or are they able to update in sync? In addition to this, does each thread have it's own cache?

How do warps and blocks relate? My understanding is that we have a core that runs x threads with x/block size blocks. And within each block there are x/block_size/32 warps. But how do these warps communicate with eachother within the block, if at all. And are all these warps and blocks and what not sharing execution contexts? At what level (warp, blk, core) do the contexts lie to be shared?

ccheng18 commented on Data-Parallel Thinking, Slide 16

On the fourth line form the bottom, why do we assign the 7th index as 0. And what are the steps to even begin thinking about these kinds of solutions? Do they just come with experience?

Do we create a new RDD with every function call?

ccheng18 commented on Cache Coherence, Slide 12

I understand that addresses are assigned to cache lines based on their address number and the size of the cache, but what confuses me is how variables are assigned these addresses. Particularly those that are initialized one after another.

ccheng18 commented on Memory Consistency, Slide 41

How does WR relaxed help hiding write latency? This means that we can read prior to a write being written into memory. So this implies that we can go on with our reads without waiting for our stores/write latency stuff to finish, which is how it hides latency. But what about the case where there are many stores and many reads from the same variable? Wouldn't this be super prone to error?

Are the MSI and MESI transitions those most commonly used by caches? I've learned about the MSI protocol in the past, but I was wondering if you had any suggestion on where to learn about the most recently used / most popular and important protocols.

ccheng18 commented on Transactional Memory, Slide 7

Which type of abstractions is most common in the workforce?

ccheng18 commented on Transactional Memory 2, Slide 9

So is it just better to generally not use pessimistic writes? As it seems that this case would likely happen often.....

I once read about using larger CPUs vs using smaller ones in the same chip. Why would we design such chips to be this way? Wouldn't it be advantageous to just use a larger, more capable processor?

What is the difference between performance and productivity here? I would assume they go hand in hand?

So Ligra and other very specialized programming languages, like specialized hardware are considered more energy efficient? How is this implemented to be the case - would we not be able to get the same speed up by programming parallel-y in C or OMP?

Are all of these considered programming chips? Or are processors considered a part of a chip or are these all CPU/GPUS?

What is the point of getting these gradients? I don't really see where the edge detection comes into play?

ccheng18 commented on Parallel Programming Basics, Slide 18

How are memory bus's implemented? Is the MSI versioning an example of it?

How is speed up calculated if we increase the bandwidth? Particularly if the processor becomes arithmetic bound prior to utilizing the max speed up given by the increased bandwidth.

ccheng18 commented on A Modern Multi-Core Processor, Slide 68

These threads are executed on a multicore processor right? A processor has a number of ALU and processing units, like one...that are shared by the cores, and the cores are the execution contexts that represent each thread right?

Each core can run one thread unless it is a hyperthreaded core, in which case it can run up to max 2 threads with two instruction contexts.

This Catapult project is apparently very different from the Catapult HLS used for Verification. Project Catapult combines programmable hardware and software that uses field-programmable gate arrays (FPGAs) to deliver performance improvements.

I think there's a typo, its Liszt instead of Lizst

Just to site the AMD website which defines HBM and mentions a lot of details mentioned in this slide- https://www.amd.com/en/technologies/hbm

Had a chance to implement a rasterizer in HLS in another course. The sacrifice in performance and the difficulty to use makes total sense.

Had an opportunity of learning this is Machine Learning Class. Correlating this to the concepts taught there makes total sense.

amohamdy commented on Cache Coherence, Slide 40

@stao18 It looks like there's a bunch of other cache coherency protocols that have certain advantages over both MSI and MESI. My guess is that different processors will implement different protocols of these, but I'm curious what the general computing intel or AMD x86 chip implements too. https://en.wikipedia.org/wiki/Cache_coherency_protocols_(examples)

This is yet another example of how changing the thinking model itself allowed for greater performance by splitting the operations into transformations and actions where only the actions are run when called whereas the transformations are simply registered until an action is performed.

amohamdy commented on Data-Parallel Thinking, Slide 49

In a way the Data Parallel operations form a DSL with the domain of problems with less complicated, often single-level, dependencies.

I know that trying to get a function like __syncthreads() to work across different blocks would introduce deadlock situations, but I'm curious if it would be helpful for some applications to have some cuda sync primitives across blocks as long as either the cuda compiler or the programmer can guarantee that there will be at least one block free so that useful work can be done, or perhaps some primitives to for priority scheduling for the block scheduler. I'm not sure if this would even be possible in CUDA.

The first time we were introduced to this concept I thought "clearly the shared address space is a better model," and, with the exception, of having your resources already split (perhaps you have a few different devices each with their own memory, there's no reason to use MPM over SAM, but I'm curious how the cache coherency, memory consistency, and transactional memory protocol affect this. I think they might make a combination that includes MPM more effective than having all the resources share the same address space, as, I suppose, there would be some kind of waiting queue that can take care of some of the synchronization that needs to happen instead of relying on the systems on the main memory and cache.

As alrcdilaliaou said, Electron is definitely inefficient, but easy to use, and therefore common. Even poor tools can be optimized, though, as Visual Studio Code uses Electron, but is relatively efficient.

bmehta22 commented on Cache Coherence, Slide 4

What happens if the master node crashes?

This was made to deal with the size of the Netflix Prize dataset efficiently!

Sambanova was co-founded by Professor Kunle!

Might i be better to have different hw for different types of layers?

There are increasing tradeoffs between speed and accuracy here; however, a drop in accuracy may be like adding noise to the input and cause the final output to be more robust, so it may ultimately be better.

These are often called half-precision (short), single-precision (float) and double precision (double).

Having the ability to specify a cache is important, but having one with the right specifications (size, cache line size, etc.) is even more important. Spatial is allowing for specifying cache size/line size, etc. here.

bmehta22 commented on Data-Parallel Thinking, Slide 25

This 'scan' method may seem inefficient until you consider the number of elements such algorithms are used on and how much parallelism it allows for.

Even if our solution required many more arithmetic instructions at the sake of reducing memory instructions, thereby increasing instructions, arithmetic is faster and more energy-efficient than memory instructions, so arithmetic intensity is a good heuristic for good moves.

It could be either, but it makes more sense to have 3 different threads only if there are 3 execution contexts given that each can work in parallel, so there is no reason to create more threads than execution contexts and increase overhead.

bmehta22 commented on Parallel Programming Basics, Slide 62

Gauss-Seidel is itself an optimization to speed up convergence by reducing number of iterations needed, so one must be careful to make sure that increasing parallelism does not come at cost of so many increased iterations that it is not worth the tradeoff.

"Side-effect free" - big advantage of functional programming

As someone from an AI background, I didn't previously understand why pytorch/tensorflow convert NCHW inputs to NHWC before applying various image operations, but this makes total sense. I find it fascinating how well-optimized these libraries are (especially having tried to implement my own vectorized convolutions before and still ending up considerably slower).

In Pipelined schedule, you have different stages for different computations. Longer the pipeline, greater the parallelism. As noted, "Closest thing to free lunch in hardware." -John Hennessey

This can be great for parallelism, as tasks specific to certain areas of memory may be relegated to the specific CPUs that can access that block of memory. This will allow for increased optimization and less competition for memory, because there may be less need for message passing.

However, as soon as CPUs need memory from blocks of memory owned by other CPUs, there will be increased message passing, higher latency and synchronization will be slow, so this could be worse for parallelism.

A great way to think about filters or convolutions is that they are simply another channel (or dimension) to an image, as in RGB, so it follows that you could have more channels

The classic model of the CPU is a five-stage RISC/MIPS CPU (although nowadays more stages are involved to help with throughput and allow for higher clock speed, though there are diminishing returns on adding stages).

The first two stages are instruction fetching/decoding. The next stage is execute. The two final stages are memory and register write back (whether reading/writing from cache or memory). Stalls are necessarily added when reading from memory (unless other work can be done in the interim, and the CPU orders out-of-order execution) because it is slower than cache.

The effects of compilation, especially with flags like -O3, may complicate how effective one's techniques to parallelize end up being, because the compiler already is making its own optimizations and trying to prematurely optimize in code may in fact hinder the efforts of the compiler. The optimal strategy is to 'expose' as much parallelism as possible, like vectorizing instructions when possible, so that when the compiler attempts it own optimizations, it only helps efforts to optimize, not hinders them.

What is the requirement for proving the language's correctness? Presumably the bar would be higher than for other programming languages?

FPGAs represent a middle ground between a general-purpose processor and an Application-Specific Integrated Circuit. It does that by being able to encode a specific circuit (say, matrix multiply), but not permanently (it can be repurposed).

From the equation and the code, it seems like alpha can be toggled to increase, or decrease, how much one's rank is determined by its neighbors, and thus how much variance there can be across the different ranks.

@gomi On a previous page, someone mentioned that Spatial compiles into Chisel which compiles into Verilog, so it may be very likely that there's a strong connection between the Accel block and Verilog.

It seems like these filters as represented in the middle can be useful for detecting features, like edges, shapes, or other spatial patterns in the input image.

I think this idea is especially more important with deploying machine learning models on edge devices which may can have a wide variety of different types of parallel systems, CPUs, and GPUs.

If you think of an automatic scheduler as a "pre-compiler" or part of the compiler, then it's remarkable that the "compiler" is doing so much work, even writing this gray region for you!!

lordpancake commented on Heterogeneous Parallel Processing, Slide 7

By the end of this example, there's a race condition whereby C can either be C = 4 or C = 7, but it seems that by the rules or by our choice in programming we are okay with that.

Predictable access patterns: Data accesses follow a pattern (so you can pre-fetch things, which should lead to future cache hits).

I don't think T4 would've been aborted earlier, because neither a pessimistic read nor a pessimistic write policy would have intersecting operations (from the perspective of T4)

lordpancake commented on Transactional Memory 2, Slide 19

@narwhale, yes, or rather that the local timestamp is a save of the global timestamp that represents the state version that the thread thinks it's acting on. So if the memory addresses that the thread uses still have an old timestamp, then it has permission to make changes on those memory addresses. And it is okay to ignore whatever the greater world around the thread is doing

gsamp commented on Transactional Memory 2, Slide 41

@pizza, I believe that either the cache line should be moved to the invalid state or it should be wiped out. We can't use the information in the cache since there were conflicts and we had to abort.

gsamp commented on Transactional Memory 2, Slide 37

Read and Write bits are kept per-transaction.

gsamp commented on Transactional Memory 2, Slide 19

@narwhale, I think this is the case. Global = "what does everyone else think the version of this object is?" Local = "what is the state this object was when I first opened it for read/write?" And then when committing, "do these two agree?"

gsamp commented on Transactional Memory 2, Slide 18

@narwhale, I think so, but this does not have to do with it being eager or lazy. Eager or lazy has to do with how the system resets the state when it needs to restart the transaction. Eager, writes to memory, saving the old value in a log. Lazy, writes to a write buffer, which is committed in the end (or not).

gsamp commented on Transactional Memory, Slide 60

@martigp and @apappu, we stall because we wanted to run a read operation followed by a write (we wait until the write is done and then read the updated value).

The contract here is that we think about programs in a declarative, mathematical way, and the DSL (e.g. GraphLab) takes care of the rest. In some sense, we would think in logic and math (e.g. latex), not code.

DSLs invent language so that we may do what we do best as humans.

I am thinking about it as "Abstractions with gears." Imagining compute blended into the edges of these abstractions (and also within them).

chenyecharchar commented on Programming for Hardware Specialization, Slide 9

Typically in specialized hardware, there is no threading. (So no atomicity, barriers and etc)

whats the difference between an embedded DSL and some library package?

(Sorry, I meant *Chisel for the edge TPU…)

What does it take to get a language like Spatial actually used in practice, at places like Google (E.g. for the edge TPU)?