Is the idea here that you first have to construct a matrix of size NWH that we will later read from (which I guess would be O(N) with W and H being constant)?
It seems like this would call for a given pixel's data to be stored N times (for a filter with N elements) -- I wonder if there's a way to create the accesses such that there are a lot of cache hits that can take advantage of the fact that pixel X's data is already in the cache from the last dot product calculation and we don't have to fetch it from the second place we've stored it in our new input data matrix. Doing it this way would reduce the number of elements that needed to be fetched by a factor of N, but I'm not sure yet if there exists a viable such implementation.
Is the idea here that you first have to construct a matrix of size NWH that we will later read from (which I guess would be O(N) with W and H being constant)?
It seems like this would call for a given pixel's data to be stored N times (for a filter with N elements) -- I wonder if there's a way to create the accesses such that there are a lot of cache hits that can take advantage of the fact that pixel X's data is already in the cache from the last dot product calculation and we don't have to fetch it from the second place we've stored it in our new input data matrix. Doing it this way would reduce the number of elements that needed to be fetched by a factor of N, but I'm not sure yet if there exists a viable such implementation.