## GPU Programming Principles
## Learning Objectives
* Learn about coalesced global memory access for performance
* Learn about local memory and how to use it.
#### Coalesced global memory
* Reading from and writing to global memory is generally very expensive.
* It often involves copying data across an off-chip bus.
* This means you generally want to avoid unnecessary accesses.
* Memory access operations are done in chunks.
* This means accessing data that is physically closer together in memory is more efficient.
#### Coalesced global memory
data:image/s3,"s3://crabby-images/58f54/58f54fb86291a18d38541ba4634c0dbe8bec415f" alt="SYCL"
#### Coalesced global memory
data:image/s3,"s3://crabby-images/6c4bd/6c4bde3ec07117cd218dc629d79f62c1ec677bcd" alt="SYCL"
#### Coalesced global memory
data:image/s3,"s3://crabby-images/159f8/159f86cd9cedebb4193422ff0a29d3e1f21003a1" alt="SYCL"
#### Coalesced global memory
data:image/s3,"s3://crabby-images/5aa8a/5aa8a2d9c5ffc83f7ceb6d93a01673d5fb2f8ddd" alt="SYCL"
#### Coalesced global memory
data:image/s3,"s3://crabby-images/93e54/93e54de80cc24f429b5e0e11c5263a7afc9d4ce3" alt="SYCL"
#### Coalesced global memory
data:image/s3,"s3://crabby-images/406d2/406d2e6fb07720b29119f6f952e049a5cfe01064" alt="SYCL"
#### Row-major vs Column-major
* Coalescing global memory access is particularly important when working in multiple dimensions.
* This is because when doing so you have to convert from a position in 2d space to a linear memory space.
* There are two ways to do this, generally referred to as row-major and column-major.
#### Row-major vs Column-major
data:image/s3,"s3://crabby-images/2dee7/2dee7e4c28b4a5a7a8eb8ba753179852a336c3cf" alt="SYCL"
data:image/s3,"s3://crabby-images/7a2da/7a2daa13fb413af1a4dea63a0448b88a24997d43" alt="SYCL"
data:image/s3,"s3://crabby-images/43851/43851335c7b74e43952b3c9b61ba9e6900d4ed2b" alt="SYCL"
#### Cost of accessing global memory
* Global memory is very expensive to access.
* Even with coalesced global memory access if you are accessing the same elements multiple times that can be expensive.
* Instead you want to cache those values in a lower latency memory.
#### Using local memory
data:image/s3,"s3://crabby-images/758d9/758d9d14321eb90c1fbae04bd0e801ea170e5bee" alt="SYCL"
* Local memory is a "manually managed cache", often referred to as scratchpad.
* Local memory is a dedicated on-chip cache, shared per work-group.
* Local memory can be accessed in an uncoalesced fashion without much performance degradation.
#### Tiling
data:image/s3,"s3://crabby-images/de2de/de2dec5c615d7d6c4e1189287c5644b8898f1d6b" alt="SYCL"
* The iteration space of the kernel function is mapped across multiple work-groups.
* Each work-group has its own allocation of local memory.
* You want to split the input image data into tiles, one for each work-group.
#### Local accessors
auto scratchpad = sycl::local_accessor<int, dims>(sycl::range{workGroupSize}, cgh);
* Local memory is allocated via a `local_accessor`.
* Unlike regular `accessor`s they are not created with a `buffer`, they allocate memory per work-group for the duration of the kernel function.
* The `range` provided is the number of elements of the specified type to allocate per work-group.
#### Synchronization
* Local memory can be used to share partial results between work-items.
* When doing so it's important to synchronize between writes and read to memory to ensure all work-items have reached the same point in the program.
#### Synchronization
data:image/s3,"s3://crabby-images/962a0/962a066d49c3f72c837f9ed242c9bbd033be4bd6" alt="SYCL"
* Remember that work-items within a workgroup are not guaranteed to execute in lockstep.
#### Synchronization
data:image/s3,"s3://crabby-images/71d5e/71d5eaabd020ab2ae6e106ac859b9873e5b54bec" alt="SYCL"
* A work-item can share results with other work-items via local (or global) memory.
#### Synchronization
data:image/s3,"s3://crabby-images/68ccf/68ccfebb7a3237b5684619baee71a59669b40999" alt="SYCL"
* This means it's possible for a work-item to read a result that hasn't been written to yet.
* This creates a data race.
#### Synchronization
data:image/s3,"s3://crabby-images/2de34/2de3495b5487b49bbbee6667c565e57018b9b8a6" alt="SYCL"
* This problem can be solved with a synchronization primitive called a work-group barrier.
#### Synchronization
data:image/s3,"s3://crabby-images/e075d/e075d6555e895a0c813182e6b688ab62f63ba440" alt="SYCL"
* When a work-group barrier is inserted work-items will wait until all work-items in the work-group have reached that point.
#### Synchronization
data:image/s3,"s3://crabby-images/1c788/1c788eec0cb118c20d32da7bf423e87096431af7" alt="SYCL"
* Only then can any work-items in the work-group continue execution.
#### Synchronization
data:image/s3,"s3://crabby-images/8088e/8088e44aa461611b61cc62bd0f47606d225b66fb" alt="SYCL"
* So now you can be sure that all of the results that you want to read have been written to.
#### Synchronization
data:image/s3,"s3://crabby-images/7e217/7e217b3ce96c4d27acbeea5575e4dc4d149addc1" alt="SYCL"
* However note that this does not apply across work-group boundaries.
* So if you write in a work-item of one work-group and then read it in a work-item of another work-group you again have a data race.
* Furthermore, remember that work-items can only access their own local memory and not that of any other work-groups.
#### Group_barrier
sycl::group_barrier(item.get_group());
* Work-group barriers can be invoked by calling `group_barrier` and passing a `group` object.
* You can retrieve a `group` object representing the current work-group by calling `get_group` on an `nd_item`.
* Note this requires the `nd_range` variant of `parallel_for`.
#### Matrix Transpose
data:image/s3,"s3://crabby-images/171ce/171ce96109439440c6b9dc90d25607962a66f4c7" alt="SYCL"
* In the next exercise we will transpose a matrix.
#### Matrix Transpose
data:image/s3,"s3://crabby-images/14e15/14e152dca0abfe7fc3919a25bafbbc34a63fe1a8" alt="SYCL"
* Reading naively from global memory and writing to global memory will give poor performance.
* This is because at least one of our memory transactions will be uncoalesced.
* Adjacent work items are reading a contiguous block from memory, and writing in a strided fashion into the out array.
#### Matrix Transpose
data:image/s3,"s3://crabby-images/baf31/baf31036464e6d5ae8a40490661f0062061cfdbf" alt="SYCL"
* Using scratchpad memory can allow us to make uncoalesced loads or stores into local memory, not global memory.
* Uncoalesced local memory transactions are less detrimental to performance than uncoalesced global memory transactions.
#### Matrix Transpose
data:image/s3,"s3://crabby-images/316e9/316e98383f5c32b3bd93daf689f45719c7419c2c" alt="SYCL"
* Global memory loads and stores are now coalesced.
* Adjacent work items are reading and writing contiguous blocks.
Code_Exercises/Matrix_Transpose
Use good memory access patterns to transpose a matrix.