I think the way we are increasing efficiency through parallelism in this program (rather than have every parallel instance execute the same code, i.e. duplicating our code) is by building our knowledge that each iteration of the (sequential version of the) outer for-loop is independent into the structure of the code.
Defining idx
as a non-uniform value is what allows us to increment the outer loop by programCount rather than by 1.
As the programmers, we know the iterations of the inner loop are not independent (because value
depends on the previous iterations of the inner loop), and so we must simply increment j by 1 as would in the sequential program.
I guess both i and j are uniform because they must be processed by every program instance. The difference is that i is incremented by programCount (and then converted to idx differently for each program instance) whereas j is incremented by 1 (and converted to the jth index in the same way for each program instance).
At an abstracted level, one program instance does only 1 instruction of a program in serial, in order execution.
I am still having a hard time wrapping my head around ISPC tasks and gangs of program instances. Is it too much of an oversimplification to say that ISPC gangs take advantage of SIMD execution while tasks make sure to leverage the use of multi-core parallelism?
Does ISPC do anything to combat low utilization of vector instructions?
We talked about interleaving for SIMD instructions (ISPC non-task) as the best way to reap the benefit of space locality in memory intensive programs (reading continuous chunks is more efficient). What's the best equivalent memory access pattern for tasks? I find it hard to see what the most efficient memory access pattern is when we have M tasks, each task performing N-wide SIMD instructions. Anyone has ideas?
Still kind of confused about why there is no parallelism in the ISPC program instance. Although each line of instruction seems to depend on the previous one, why would the difference idx be not possible for parallel execution? (Would greatly appreciate someone answering my very basic question)
@kristinayige. You are observing that the outermost loop could be parallelized within a program instance because it's iterations are independent. That is correct.
However, this discussion was about exactly how ISPC parallelizes the code. The ISPC compiler does not attempt to analyze the ISPC function to discover parallelism for the programmer. This would be challenging, since it's basically general purpose C code. Instead it yields SIMD code by always executing a gang of programming instances simultaneously, and maps that gang to SIMD instructions. The logic, expressed above in the code, is carried out sequentially by each instance in the gang.
Consider the inner block of code, which is where most of the arithmetic happens. Because both i,j are uniform, we imagine (i,j) as being fixed over some interval of time (across any program instances). During this time, idx has 8 different values, unique to each of 8 program instances, and these program instances all run the code in the block in parallel using SIMD instructions. Zooming out to the entire program, it's clear how an isolated program instance does not execute any of the loops in parallel.
Please log in to leave a comment.
I find the idea of the nested for loop in the ispc function hard to wrap my head around. On the one hand, I can see that if one program instance is running the entire function, it can do a for loop inside. But this for loop seems to also have a uniform parameter controlling it. To me, this would imply that this program instance is spawning other gangs, which doesn't make sense to me. Why would the int j have to be uniform?