SYCL and the SYCL logo are trademarks of the Khronos Group Inc.

## Coalesced Global Memory

## Learning Objectives * Learn about coalesced global memory access * Learn about the performance impact * Learn about row-major vs column-major * Learn about SoA vs AoS

#### Coalesced global memory

* Reading from and writing to global memory is generally very expensive. * It often involves copying data across an off-chip bus. * This means you generally want to avoid unnecessary accesses. * Memory access operations is done in chunks. * This means accessing data that is physically close together in memory is more efficient.

#### Coalesced global memory

![SYCL](../common-revealjs/images/coalesced_global_memory_1.png "SYCL")

#### Coalesced global memory

![SYCL](../common-revealjs/images/coalesced_global_memory_2.png "SYCL")

#### Coalesced global memory

![SYCL](../common-revealjs/images/coalesced_global_memory_3.png "SYCL")

#### Coalesced global memory

![SYCL](../common-revealjs/images/coalesced_global_memory_4.png "SYCL")

#### Coalesced global memory

![SYCL](../common-revealjs/images/coalesced_global_memory_5.png "SYCL")

#### Coalesced global memory

![SYCL](../common-revealjs/images/coalesced_global_memory_6.png "SYCL")

#### Row-major vs Column-major

* Coalescing global memory access is particularly important when working in multiple dimensions. * This is because when doing so you have to convert from a position in 2d space to a linear memory space. * There are two ways to do this; generally referred to as row-major and column-major.

#### Row-major vs Column-major

![SYCL](../common-revealjs/images/row_col_1.png "SYCL")

![SYCL](../common-revealjs/images/row_col_2.png "SYCL")

![SYCL](../common-revealjs/images/row_col_3.png "SYCL")

#### AoS vs SoA

* Another area this is a factor is when composing data structures. * It's often instinctive to have struct representing a collection of data and then have an array of this - often referred to as Array of Structs (AoS). * But for data parallel architectures such as a GPU it's more efficient to have sequential elements of the same type stored contiguously in memory - often referred to as Struct of Arrays (SoA).

#### AoS vs SoA

![SYCL](../common-revealjs/images/soa_vs_aos_1.png "SYCL")

#### AoS vs SoA

![SYCL](../common-revealjs/images/soa_vs_aos_2.png "SYCL")

#### AoS vs SoA

![SYCL](../common-revealjs/images/soa_vs_aos_3.png "SYCL")

#### AoS vs SoA

![SYCL](../common-revealjs/images/soa_vs_aos_4.png "SYCL")

#### AoS vs SoA

![SYCL](../common-revealjs/images/soa_vs_aos_5.png "SYCL")

#### AoS vs SoA

![SYCL](../common-revealjs/images/soa_vs_aos_6.png "SYCL")

#### AoS vs SoA

![SYCL](../common-revealjs/images/soa_vs_aos_7.png "SYCL")

#### AoS vs SoA

![SYCL](../common-revealjs/images/soa_vs_aos_8.png "SYCL")

#### AoS vs SoA

![SYCL](../common-revealjs/images/soa_vs_aos_9.png "SYCL")

#### AoS vs SoA

![SYCL](../common-revealjs/images/soa_vs_aos_10.png "SYCL")

#### Coalesced image convolution performance

![SYCL](../common-revealjs/images/image_convolution_performance_coalesced.png "SYCL")

#### Vec types

auto f4 = sycl::float4{1.0f, 2.0f, 3.0f, 4.0f}; // {1.0f, 2.0f, 3.0f, 4.0f}

auto f2 = sycl::float2{2.0f, 3.0f}; // {2.0f, 3.0f}
auto f4 = sycl::float4{1.0f, f2, 4.0f}; // {1.0f, 2.0f, 3.0f, 4.0f}

auto f4 = sycl::float4{0.0f};  // {0.0f, 0.0f, 0.0f, 0.0f}

* A `vec` can be constructed with any combination of scalar and vector values which add up to the correct number of elements. * A `vec` can also be constructed from a single scalar in which case it will initialize every element to that value.

#### Vec operators

auto f4a = sycl::float4{1.0f, 2.0f, 3.0f, 4.0f}; // {1.0f, 2.0f, 3.0f, 4.0f}

auto f4b = sycl::float4{2.0f}; // {2.0f, 2.0f, 2.0f, 2.0f}

auto f4r = f4a * f4b; // {2.0f, 4.0f, 6.0f, 8.0f}

* The `vec` class provides a number of operators such as `+`, `-`, `*`, `/` and many more, which perform the operation elemeent-wise.

#### Vec sizes

sycl::int2
sycl::int3 (N.B sizeof(int3) == sizeof(int4))
sycl::int4 
sycl::int8 
sycl::int16

* Vectors can be made from all char, integer or floating point types. * Using vector types: * Can make code more readable * Can give better memory access patterns.

## Questions

#### Exercise

Code_Exercises/Coalesced_Global_Memory/source

Try inverting the dimensions when calculating the linear address in memory and measure the performance.