Wouldn't Blocked Assignment be better from a memory traffic consideration? The computation of middle point comes from the 4 corners which have been previously computed and if the result is stored in the process-local memory, then the middle point's computation processor need only fetch from local memory. In case 2 the corner points are stored on different processors' memories and therefore the computation of a middle point require a lot more inter process communication and use of interconnect bandwidth to fetch from a non-process-local storage. Further, the inter-process communications also leads to higher latency. So why is the answer "it depends on the system...."
Wouldn't Blocked Assignment be better from a memory traffic consideration? The computation of middle point comes from the 4 corners which have been previously computed and if the result is stored in the process-local memory, then the middle point's computation processor need only fetch from local memory. In case 2 the corner points are stored on different processors' memories and therefore the computation of a middle point require a lot more inter process communication and use of interconnect bandwidth to fetch from a non-process-local storage. Further, the inter-process communications also leads to higher latency. So why is the answer "it depends on the system...."