SYCL and the SYCL logo are trademarks of the Khronos Group Inc.

## Further Optimizations

## Learning Objectives * Learn about compute and memory bound algorithms * Learn about optimizing for occupancy * Learn about optimizing for throughput

#### Compute-bound vs memory-bound

* When a particular algorithm is applied to a processor such as a GPU it will generally be either compute-bound or memory-bound. * If an algorithm is compute-bound then the limiting factor is occupancy, utilizing the available hardware resources. * If an algorithm is memory-bound then the limiting factor is throughput, reducing memory latency.

#### Optimizing for occupancy

* Occupancy is a metric used to measure how much of the computing power of a processor such as a GPU is being used. * Optimizing for occupancy means getting the most out of the resources of the GPU. * For memory-bound algorithms you don't necessarily need 100% occupancy, you simply need enough to hide the latency of global memory access. * For compute-bound algorithms occupancy becomes more crucial, even then it's not always possible to achieve 100% occupancy.

#### Limiting factors

There are a number of factors which can limit occupancy for a given kernel function and the data you pass to it.

* Number of work-items required. * Amount of private memory (registers) required by each work-item. * Number of work-items per work-group. * Amount of local memory required by each work-group.

#### Choosing an optimal work-group size

An important optimization for occupancy is to choose an optimal work-group size.

#### Local memory image convolution performance

![SYCL](../common-revealjs/images/image_convolution_performance_local_mem.png "SYCL")

#### Choosing an optimal work-group size

* It must be smaller than the maximum work-group size for the device you are targeting. * It must be large enough that you are utilizing the hardware concurrency (warp or wavefront) * It should be a power of 2 to allow multiple work-groups to fit into a compute unit. * It's best to experiment with different sizes and benchmark the performance of each.

#### Varying work-group size image convolution performance

![SYCL](../common-revealjs/images/image_convolution_performance_work_group_sizes.png "SYCL")

#### Reusing data

Another important optimization for occupancy is to re-use data as much as possible.

* Larger work-group sizes often mean that more work-items can share partial results and overall less memory operations are required. * Though you can hit limitations in the number of work-items required or the amount of local memory required. * If you hit this limitation then you can perform work in batches.

#### Work batching

![SYCL](../common-revealjs/images/work_distribution.png "SYCL")

* If you hit occupancy limitations then each compute unit must process multiple rounds of work-groups and work-items. * This often means reading and writing often the same data again. * Batching work for each work-item that share neighboring data allows you to further share local memory and registers.

#### Optimizing for throughput

* Throughput is a metric used to measure how much data is being passed through the GPU. * Optimizing for throughput means getting the most out of the memory bandwidth at the various levels of the memory hierarchy. * The key is to hide the latency of the memory transfers and avoid GPU resources being idle. * This is most important for memory-bound algorithms.

#### Optimizing for throughput

We have seen a number of the techniques to optimize throughput already.

* Coalescing global memory access. * Using local memory. * Explicit vectorization.

#### Optimizing for throughput

There are a number of other optimization to consider, some depend on the application or the target device.

* Double buffering. * Using constant memory. * Using texture memory.

#### Double buffering

* With particularly large data you can hit global memory limitations. * When this happens you need to pipeline the work, execute the kernel function itself multiple times, each time operating on a tile of the input data. * But copying to global memory can be very expensive. * To maintain performance you need to hide the latency of moving the data to the device.

#### Double buffering

![SYCL](../common-revealjs/images/double_buffer_1.png "SYCL")

* So in this case you have multiple invocation of the kernel function. * And between each invocation the data of the next tile must be moved to the GPU.

#### Double buffering

![SYCL](../common-revealjs/images/double_buffer_2.png "SYCL")

* With the cost of moving data to the GPU being very high so this causes significant gaps where the GPU is idle.

#### Double buffering

![SYCL](../common-revealjs/images/double_buffer_3.png "SYCL")

* An option for optimizing this is to double buffer the data being moved to and accessed on the GPU.

#### Double buffering

![SYCL](../common-revealjs/images/double_buffer_4.png "SYCL")

* This allows overlapping of compute and data movement. * This means that while one kernel invocation computes a tile of data the data for the next tile is being moved.

#### Use constant memory

![SYCL](../common-revealjs/images/constant_memory.png "SYCL")

* Some SYCL devices provide a benefit from using constant memory. * This a dedicated region of global memory that is read-only, and therefore can be faster to access. * Not all devices will benefit from using constant memory. * There is generally a much lower limit to what you can allocate in constant memory. * To use constant memory simply create an `accessor` with the `access::target::constant_buffer` access target.

#### Use texture memory

![SYCL](../common-revealjs/images/texture_memory.png "SYCL")

* Some SYCL devices provide a benefit from using constant memory. * This the texture memory used by the render pipeline. * This can be more efficient when accessing data in a pixel format. * To use texture memory use the `image` class.

#### Further tips

* Profile your kernel functions. * Follow vendor optimization guides. * Tune your algorithms.

## Questions

#### Exercise

Code_Exercises/Work_Group_Sizes/source

Try out different work-group sizes and measure the performance.

Try out some of the other optimization techniques we've looked at here.