In order to achieve better performance for merging parallel code, we will make use of a concept called blocking. Blocking basically means that, rather than transferring the whole input and output arrays in one shot, we can create blocks of the array which can be transferred and operated in parallel. The following diagram demonstrates creating blocks and overlapping data transfers with the kernel execution:
The preceding diagram shows that different blocks are transferred and the kernel execution of these blocks can be independent of each block. In order for this to happen, we need the data transfer commands and kernel calls to be fired and executed asynchronously. In order to achieve blocking, we will be introducing more directives/clauses in this section: the structured/unstructured data directive and async clause. We will showcase...