The buffering here is needed because of the pipelining. Pipelining basically separates a large circuit into smaller stages so you can get higher throughput. However, you would need some ways of storing the temporary data at each stage until the next step is ready to process the data.
pipeline == free_lunch