## Coalesced Global Memory
## Learning Objectives
* Learn about coalesced global memory access
* Learn about the performance impact
* Learn about row-major vs column-major
* Learn about SoA vs AoS
#### Coalesced global memory
* Reading from and writing to global memory is generally very expensive.
* It often involves copying data across an off-chip bus.
* This means you generally want to avoid unnecessary accesses.
* Memory access operations is done in chunks.
* This means accessing data that is physically close together in memory is more efficient.
#### Coalesced global memory
![SYCL](../common-revealjs/images/coalesced_global_memory_1.png "SYCL")
#### Coalesced global memory
![SYCL](../common-revealjs/images/coalesced_global_memory_2.png "SYCL")
#### Coalesced global memory
![SYCL](../common-revealjs/images/coalesced_global_memory_3.png "SYCL")
#### Coalesced global memory
![SYCL](../common-revealjs/images/coalesced_global_memory_4.png "SYCL")
#### Coalesced global memory
![SYCL](../common-revealjs/images/coalesced_global_memory_5.png "SYCL")
#### Coalesced global memory
![SYCL](../common-revealjs/images/coalesced_global_memory_6.png "SYCL")
#### Row-major vs Column-major
* Coalescing global memory access is particularly important when working in multiple dimensions.
* This is because when doing so you have to convert from a position in 2d space to a linear memory space.
* There are two ways to do this; generally referred to as row-major and column-major.
#### Row-major vs Column-major
![SYCL](../common-revealjs/images/row_col_1.png "SYCL")
![SYCL](../common-revealjs/images/row_col_2.png "SYCL")
![SYCL](../common-revealjs/images/row_col_3.png "SYCL")
#### AoS vs SoA
* Another area this is a factor is when composing data structures.
* It's often instinctive to have struct representing a collection of data and then have an array of this - often referred to as Array of Structs (AoS).
* But for data parallel architectures such as a GPU it's more efficient to have sequential elements of the same type stored contiguously in memory - often referred to as Struct of Arrays (SoA).
#### AoS vs SoA
![SYCL](../common-revealjs/images/soa_vs_aos_1.png "SYCL")
#### AoS vs SoA
![SYCL](../common-revealjs/images/soa_vs_aos_2.png "SYCL")
#### AoS vs SoA
![SYCL](../common-revealjs/images/soa_vs_aos_3.png "SYCL")
#### AoS vs SoA
![SYCL](../common-revealjs/images/soa_vs_aos_4.png "SYCL")
#### AoS vs SoA
![SYCL](../common-revealjs/images/soa_vs_aos_5.png "SYCL")
#### AoS vs SoA
![SYCL](../common-revealjs/images/soa_vs_aos_6.png "SYCL")
#### AoS vs SoA
![SYCL](../common-revealjs/images/soa_vs_aos_7.png "SYCL")
#### AoS vs SoA
![SYCL](../common-revealjs/images/soa_vs_aos_8.png "SYCL")
#### AoS vs SoA
![SYCL](../common-revealjs/images/soa_vs_aos_9.png "SYCL")
#### AoS vs SoA
![SYCL](../common-revealjs/images/soa_vs_aos_10.png "SYCL")
#### Coalesced image convolution performance
![SYCL](../common-revealjs/images/image_convolution_performance_coalesced.png "SYCL")
#### Vec types
auto f4 = sycl::float4{1.0f, 2.0f, 3.0f, 4.0f}; // {1.0f, 2.0f, 3.0f, 4.0f}
auto f2 = sycl::float2{2.0f, 3.0f}; // {2.0f, 3.0f}
auto f4 = sycl::float4{1.0f, f2, 4.0f}; // {1.0f, 2.0f, 3.0f, 4.0f}
auto f4 = sycl::float4{0.0f}; // {0.0f, 0.0f, 0.0f, 0.0f}
* A `vec` can be constructed with any combination of scalar and vector values which add up to the correct number of elements.
* A `vec` can also be constructed from a single scalar in which case it will initialize every element to that value.
#### Vec operators
auto f4a = sycl::float4{1.0f, 2.0f, 3.0f, 4.0f}; // {1.0f, 2.0f, 3.0f, 4.0f}
auto f4b = sycl::float4{2.0f}; // {2.0f, 2.0f, 2.0f, 2.0f}
auto f4r = f4a * f4b; // {2.0f, 4.0f, 6.0f, 8.0f}
* The `vec` class provides a number of operators such as `+`, `-`, `*`, `/` and many more, which perform the operation elemeent-wise.
#### Vec sizes
sycl::int2
sycl::int3 (N.B sizeof(int3) == sizeof(int4))
sycl::int4
sycl::int8
sycl::int16
* Vectors can be made from all char, integer or floating point types.
* Using vector types:
* Can make code more readable
* Can give better memory access patterns.
#### Exercise
Code_Exercises/Coalesced_Global_Memory/source
Try inverting the dimensions when calculating the linear address in memory and measure the performance.