I believe the R and W bits are enough to disambiguate?
Here we are mixing cache coherence and transactional memory concepts. But in cache coherence the bus transactions are visible to every processor's cache. So the Rd, RdX, WB requests of one transaction are seen by other transactions. Doesn't this violate the isolation property of TM?
My above question is answered in one of the next slides by Prof. Kunle. For example, the exclusive state is given to a processor for write operation only during the commit if the transaction does not get aborted due to conflicts with other on-going or pending transactions.
It' brilliant to see the use of a single bit to designate the read and write sets. However, it seems like the checking of these bits may be a little tedious. Maybe I missed this in class but how is the checking actually done? Is there a list that we store all of these writes and reads in?
Why is HTM so much faster than STM? Wouldn't an STM scheme using an object/struct with 3 bits for M, R, and W be somewhat equivalent to what's described above?
I believe that in almost all cases an optimized Hardware system will beat a Software system which is why building custom hardware for the task could theoretically provide 1000x of speedups over an optimized software system on a CPU. The reason for that is all software is limited by hardware but hardware isn't (necessarily?) limited by software. For example when you define a C struct the smallest granularity you'll achieve is 1 byte (so 8 bits). Doesn't matter if you're storing 1 bit of that byte or full 8. You can't save space/bandwidth more than a HW system allows you. but a hardware system's implementation of TM will define all parameters to be as efficient as possible for the task at hand. If you only need 1 bit the HW will only use 1 bit, if you need to double the bandwidth then it's as easy as changing a wire in your design. That level of design customization allows for a much higher efficiency but is of course much more costly and complicated than a software system.
@jchao1 Expanding on what @AnonyMouse said, the fact that these are all hardware paths makes it orders of magnitude more efficient. STM incurs a lot of overhead over HTM just because it runs on top of the CPU, rather than as part of it: instructions have to be loaded from memory, it has to run through the pipelines, etc., etc. Because it runs through the CPU's general purpose circuits it just can't be as fast as the dedicated paths in the cache hardware. This is part of the heterogeneous computing ideas that Kayvon talked about.
Please log in to leave a comment.
Does the "MESI state bit for line" actually need to be 2 bits (to represent 4 states, M, E, S, and I) or can we get away with 1 bit?