SYCL and the SYCL logo are trademarks of the Khronos Group Inc.

## GPU Programming Principles

## Learning Objectives * Learn about coalesced global memory access for performance * Learn about local memory and how to use it.

#### Coalesced global memory

* Reading from and writing to global memory is generally very expensive. * It often involves copying data across an off-chip bus. * This means you generally want to avoid unnecessary accesses. * Memory access operations are done in chunks. * This means accessing data that is physically closer together in memory is more efficient.

#### Coalesced global memory

![SYCL](./coalesced_global_memory_1.png "SYCL")

#### Coalesced global memory

![SYCL](./coalesced_global_memory_2.png "SYCL")

#### Coalesced global memory

![SYCL](./coalesced_global_memory_3.png "SYCL")

#### Coalesced global memory

![SYCL](./coalesced_global_memory_4.png "SYCL")

#### Coalesced global memory

![SYCL](./coalesced_global_memory_5.png "SYCL")

#### Coalesced global memory

![SYCL](./coalesced_global_memory_6.png "SYCL")

#### Row-major vs Column-major

* Coalescing global memory access is particularly important when working in multiple dimensions. * This is because when doing so you have to convert from a position in 2d space to a linear memory space. * There are two ways to do this, generally referred to as row-major and column-major.

#### Row-major vs Column-major

![SYCL](./row_col_1.png "SYCL")

![SYCL](./row_col_2.png "SYCL")

![SYCL](./row_col_3.png "SYCL")

#### Cost of accessing global memory

* Global memory is very expensive to access. * Even with coalesced global memory access if you are accessing the same elements multiple times that can be expensive. * Instead you want to cache those values in a lower latency memory.

#### Using local memory

![SYCL](./local_memory.png "SYCL")

* Local memory is a "manually managed cache", often referred to as scratchpad. * Local memory is a dedicated on-chip cache, shared per work-group. * Local memory can be accessed in an uncoalesced fashion without much performance degradation.

#### Tiling

![SYCL](./tiling.png "SYCL")

* The iteration space of the kernel function is mapped across multiple work-groups. * Each work-group has its own allocation of local memory. * You want to split the input image data into tiles, one for each work-group.

#### Local accessors

auto scratchpad = sycl::local_accessor<int, dims>(sycl::range{workGroupSize}, cgh);

* Local memory is allocated via a `local_accessor`. * Unlike regular `accessor`s they are not created with a `buffer`, they allocate memory per work-group for the duration of the kernel function. * The `range` provided is the number of elements of the specified type to allocate per work-group.

#### Synchronization

* Local memory can be used to share partial results between work-items. * When doing so it's important to synchronize between writes and read to memory to ensure all work-items have reached the same point in the program.

#### Synchronization

![SYCL](./barrier_1.png "SYCL")

* Remember that work-items within a workgroup are not guaranteed to execute in lockstep.

#### Synchronization

![SYCL](./barrier_2.png "SYCL")

* A work-item can share results with other work-items via local (or global) memory.

#### Synchronization

![SYCL](./barrier_3.png "SYCL")

* This means it's possible for a work-item to read a result that hasn't been written to yet. * This creates a data race.

#### Synchronization

![SYCL](./barrier_4.png "SYCL")

* This problem can be solved with a synchronization primitive called a work-group barrier.

#### Synchronization

![SYCL](./barrier_5.png "SYCL")

* When a work-group barrier is inserted work-items will wait until all work-items in the work-group have reached that point.

#### Synchronization

![SYCL](./barrier_6.png "SYCL")

* Only then can any work-items in the work-group continue execution.

#### Synchronization

![SYCL](./barrier_7.png "SYCL")

* So now you can be sure that all of the results that you want to read have been written to.

#### Synchronization

![SYCL](./barrier_8.png "SYCL")

* However note that this does not apply across work-group boundaries. * So if you write in a work-item of one work-group and then read it in a work-item of another work-group you again have a data race. * Furthermore, remember that work-items can only access their own local memory and not that of any other work-groups.

#### Group_barrier

sycl::group_barrier(item.get_group());

* Work-group barriers can be invoked by calling `group_barrier` and passing a `group` object. * You can retrieve a `group` object representing the current work-group by calling `get_group` on an `nd_item`. * Note this requires the `nd_range` variant of `parallel_for`.

#### Matrix Transpose

![SYCL](./matrix_transpose1.png "SYCL")

* In the next exercise we will transpose a matrix.

#### Matrix Transpose

![SYCL](./matrix_transpose2.png "SYCL")

* Reading naively from global memory and writing to global memory will give poor performance. * This is because at least one of our memory transactions will be uncoalesced. * Adjacent work items are reading a contiguous block from memory, and writing in a strided fashion into the out array.

#### Matrix Transpose

![SYCL](./matrix_transpose4.png "SYCL")

* Using scratchpad memory can allow us to make uncoalesced loads or stores into local memory, not global memory. * Uncoalesced local memory transactions are less detrimental to performance than uncoalesced global memory transactions.

#### Matrix Transpose

![SYCL](./matrix_transpose5.png "SYCL")

* Global memory loads and stores are now coalesced. * Adjacent work items are reading and writing contiguous blocks.

## Questions

Code_Exercises/Matrix_Transpose

Use good memory access patterns to transpose a matrix.