## Coalesced Global Memory
## Learning Objectives
* Learn about coalesced global memory access
* Learn about the performance impact
* Learn about row-major vs column-major
* Learn about SoA vs AoS
#### Coalesced global memory
* Reading from and writing to global memory is generally very expensive.
* It often involves copying data across an off-chip bus.
* This means you generally want to avoid unnecessary accesses.
* Memory access operations is done in chunks.
* This means accessing data that is physically close together in memory is more efficient.
#### Coalesced global memory
data:image/s3,"s3://crabby-images/81214/81214553e7e8cb89beb84a9c4bb3bb103dd44708" alt="SYCL"
#### Coalesced global memory
data:image/s3,"s3://crabby-images/926b4/926b4199cb8b18703b6807f29b8f1e5caaa92e8d" alt="SYCL"
#### Coalesced global memory
data:image/s3,"s3://crabby-images/37c5a/37c5a734cc29f503508dcf5192900b724b2a8acb" alt="SYCL"
#### Coalesced global memory
data:image/s3,"s3://crabby-images/f1094/f1094e621e3e9de1b59818aaacd7af71f20ffd02" alt="SYCL"
#### Coalesced global memory
data:image/s3,"s3://crabby-images/92331/92331ca9419a2965f5afb9062c001fca95c3448d" alt="SYCL"
#### Coalesced global memory
data:image/s3,"s3://crabby-images/dd243/dd243d067bc028d35c2e4de5bdd64d843e418308" alt="SYCL"
#### Row-major vs Column-major
* Coalescing global memory access is particularly important when working in multiple dimensions.
* This is because when doing so you have to convert from a position in 2d space to a linear memory space.
* There are two ways to do this; generally referred to as row-major and column-major.
#### Row-major vs Column-major
data:image/s3,"s3://crabby-images/0d7a1/0d7a1c3eb3062b334d86229818c39f1dded79b6a" alt="SYCL"
data:image/s3,"s3://crabby-images/1227d/1227d5966644f536add962c76adb0e559ca9427f" alt="SYCL"
data:image/s3,"s3://crabby-images/057d8/057d8491f46c085972727a03fbc690cfdb20d0d5" alt="SYCL"
#### AoS vs SoA
* Another area this is a factor is when composing data structures.
* It's often instinctive to have struct representing a collection of data and then have an array of this - often referred to as Array of Structs (AoS).
* But for data parallel architectures such as a GPU it's more efficient to have sequential elements of the same type stored contiguously in memory - often referred to as Struct of Arrays (SoA).
#### AoS vs SoA
data:image/s3,"s3://crabby-images/e642b/e642b5d7a4cefb2e741763b537e5db1df612b637" alt="SYCL"
#### AoS vs SoA
data:image/s3,"s3://crabby-images/32eaa/32eaa3295cea94f5d34d5c22179fa84cd8c1cb85" alt="SYCL"
#### AoS vs SoA
data:image/s3,"s3://crabby-images/5352a/5352a2fba55bae73e69b6a34b5afe622059a4f90" alt="SYCL"
#### AoS vs SoA
data:image/s3,"s3://crabby-images/bb26c/bb26ca7ed9b8a36b9c705dd4031d632324fed22f" alt="SYCL"
#### AoS vs SoA
data:image/s3,"s3://crabby-images/86bf1/86bf198d747834336997c3d73a4d3253d2a425b2" alt="SYCL"
#### AoS vs SoA
data:image/s3,"s3://crabby-images/7360d/7360de7b19c609188131b9ef0a9177cec55e1736" alt="SYCL"
#### AoS vs SoA
data:image/s3,"s3://crabby-images/94a57/94a57b685846d9c420be865c3f1b49e1e6772793" alt="SYCL"
#### AoS vs SoA
data:image/s3,"s3://crabby-images/4e7a3/4e7a365ee0f069824439b7c4857a938f776ef7c4" alt="SYCL"
#### AoS vs SoA
data:image/s3,"s3://crabby-images/ddb61/ddb610ac1dd9030d44db81cf2bb97ad9744d4c0a" alt="SYCL"
#### AoS vs SoA
data:image/s3,"s3://crabby-images/58f91/58f912635eafa25802febb62f340260e0d1eac3e" alt="SYCL"
#### Coalesced image convolution performance
data:image/s3,"s3://crabby-images/994dd/994dd7abbffd37a21fc60c7147d1d28ca0ea6c95" alt="SYCL"
#### Vec types
auto f4 = sycl::float4{1.0f, 2.0f, 3.0f, 4.0f}; // {1.0f, 2.0f, 3.0f, 4.0f}
auto f2 = sycl::float2{2.0f, 3.0f}; // {2.0f, 3.0f}
auto f4 = sycl::float4{1.0f, f2, 4.0f}; // {1.0f, 2.0f, 3.0f, 4.0f}
auto f4 = sycl::float4{0.0f}; // {0.0f, 0.0f, 0.0f, 0.0f}
* A `vec` can be constructed with any combination of scalar and vector values which add up to the correct number of elements.
* A `vec` can also be constructed from a single scalar in which case it will initialize every element to that value.
#### Vec operators
auto f4a = sycl::float4{1.0f, 2.0f, 3.0f, 4.0f}; // {1.0f, 2.0f, 3.0f, 4.0f}
auto f4b = sycl::float4{2.0f}; // {2.0f, 2.0f, 2.0f, 2.0f}
auto f4r = f4a * f4b; // {2.0f, 4.0f, 6.0f, 8.0f}
* The `vec` class provides a number of operators such as `+`, `-`, `*`, `/` and many more, which perform the operation elemeent-wise.
#### Vec sizes
sycl::int2
sycl::int3 (N.B sizeof(int3) == sizeof(int4))
sycl::int4
sycl::int8
sycl::int16
* Vectors can be made from all char, integer or floating point types.
* Using vector types:
* Can make code more readable
* Can give better memory access patterns.
#### Exercise
Code_Exercises/Coalesced_Global_Memory/source
Try inverting the dimensions when calculating the linear address in memory and measure the performance.