I believe we need 3 threads in this case for full 100% utilization since (12+6)/6 = 3. With a higher ratio of math to memory latency, we need fewer threads for latency hiding since each thread will be doing more useful work on the ALUs at any given time versus waiting on memory bus for a load operation.
I believe we need 3 threads in this case for full 100% utilization since (12+6)/6 = 3. With a higher ratio of math to memory latency, we need fewer threads for latency hiding since each thread will be doing more useful work on the ALUs at any given time versus waiting on memory bus for a load operation.