## Learning Objectives
* Learn about the SYCL execution and memory model
* Learn how to enqueue an nd-range kernel function
#### SYCL execution model
* SYCL kernel functions are executed by **work-items**
* You can think of a work-item as a thread of execution
* Each work-item will execute a SYCL kernel function from start to end
* A work-item can run on CPU threads, SIMD lanes, GPU threads, or any other kind of processing element
![Work-Item](../common-revealjs/images/workitem.png "Work-Item")
#### SYCL execution model
* Work-items are collected together into **work-groups**
* The size of work-groups is generally relative to what is optimal on the device being targeted
* It can also be affected by the resources used by each work-item
![Work-Group](../common-revealjs/images/workgroup.png "Work-Group")
#### SYCL execution model
* SYCL kernel functions are invoked within an **nd-range**
* An nd-range has a number of work-groups and subsequently a number of work-items
* Work-groups always have the same number of work-items
![ND-Range](../common-revealjs/images/ndrange.png "ND-Range")
#### SYCL execution model
* The nd-range describes an **iteration space**: how it is composed in terms of work-groups and work-items
* An nd-range can be 1, 2 or 3 dimensions
* An nd-range has two components
* The **global-range** describes the total number of work-items in each dimension
* The **local-range** describes the number of work-items in a work-group in each dimension
![ND-Range](../common-revealjs/images/ndrange-example.png "ND-Range")
#### SYCL execution model
* Each invocation in the iteration space of an nd-range is a work-item
* Each invocation knows which work-item it is on and can query certain information about its position in the nd-range
* Each work-item has the following:
* **Global range**: {12, 12}
* **Global id**: {5, 6}
* **Group range**: {3, 3}
* **Group id**: {1, 1}
* **Local range**: {4, 4}
* **Local id**: {1, 2}
![ND-Range](../common-revealjs/images/ndrange-example-work-item.png "ND-Range")
#### SYCL execution model
Typically an nd-range invocation SYCL will execute the SYCL kernel function on a very large number of work-items, often in the thousands
![ND-Range](../common-revealjs/images/ndrange-invocation.png "ND-Range")
#### SYCL execution model
* Multiple work-items will generally execute concurrently
* On vector hardware this is often done in lock-step, which means the same hardware instructions
* The number of work-items that will execute concurrently can vary from one device to another
* Work-items will be batched along with other work-items in the same work-group
* The order work-items and work-groups are executed in is implementation defined
![ND-Range](../common-revealjs/images/ndrange-lock-step.png "ND-Range")
#### SYCL execution model
* Work-items in a work-group can be synchronized using a work-group barrier
* All work-items within a work-group must reach the barrier before any can continue on
![ND-Range](../common-revealjs/images/work-group-0.png "ND-Range")
#### SYCL execution model
* SYCL does not support synchronizing across all work-items in the nd-range
* The only way to do this is to split the computation into separate SYCL kernel functions
![ND-Range](../common-revealjs/images/work-group-0-1.png "ND-Range")
#### SYCL memory model
* Each work-item can access a dedicated region of **private memory**
* A work-item cannot access the private memory of another work-item
![Private Memory](../common-revealjs/images/workitem-privatememory.png "Private Memory")
#### SYCL memory model
![Local Memory](../common-revealjs/images/workitem-localmemory.png "Local Memory")
* Each work-item can access a dedicated region of **local memory** accessible to all work-items in a work-group
* A work-item cannot access the local memory of another work-group
#### SYCL memory model
![Constant Memory](../common-revealjs/images/workitem-constantmemory.png "Constant Memory")
* Each work-item can access a single region of **global memory** that's accessible to all work-items in a ND-range
* Each work-item can also access a region of global memory reserved as **constant memory**, which is read-only
#### SYCL memory model
* Each memory region has a different size and access latency
* Global / constant memory is larger than local memory and local memory is larger than private memory
* Private memory is faster than local memory and local memory is faster than global / constant memory
![Memory Regions](../common-revealjs/images/memory-regions.png "Memory Regions")
#### Expressing parallelism
cgh.parallel_for<kernel>(range<1>(1024),
[=](id<1> idx){
/* kernel function code */
});
cgh.parallel_for<kernel>(range<1>(1024),
[=](item<1> item){
/* kernel function code */
});
cgh.parallel_for<kernel>(nd_range<1>(range<1>(1024),
range<1>(32)),[=](nd_item<1> ndItem){
/* kernel function code */
});
* Overload taking a **range** object specifies the global range, runtime decides local range
* An **id** parameter represents the index within the global range
____________________________________________________________________________________________
* Overload taking a **range** object specifies the global range, runtime decides local range
* An **item** parameter represents the global range and the index within the global range
____________________________________________________________________________________________
* Overload taking an **nd_range** object specifies the global and local range
* An **nd_item** parameter represents the global and local range and index
#### Accessing Data With Accessors
* There are a few different ways to access the data represented by an accessor
* The subscript operator can take an **id**
* Must be the same dimensionality of the accessor
* For dimensions > 1, linear address is calculated in row major
* Nested subscript operators can be called for each dimension taking a **size_t**
* E.g. a 3-dimensional accessor: acc[x][y][z] = …
* A pointer to memory can be retrieved by calling **get_pointer**
* This returns a raw pointer to the data
#### Accessing Data With Accessors
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
buffer<float, 1> bufB(dB.data(), range<1>(dB.size()));
buffer<float, 1> bufO(dO.data(), range<1>(dO.size()));
gpuQueue.submit([&](handler &cgh){
sycl::accessor inA{bufA, cgh, sycl::read_only};
sycl::accessor inB{bufB, cgh, sycl::read_only};
sycl::accessor out{bufO, cgh, sycl::write_only};
cgh.parallel_for<add>(range<1>(dA.size()),
[=](id<1> i){
out[i] = inA[i] + inB[i];
});
});
* Here we access the data of the `accessor` by
passing in the `id` passed to the SYCL kernel
function.
#### Accessing Data With Accessors
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
buffer<float, 1> bufB(dB.data(), range<1>(dB.size()));
buffer<float, 1> bufO(dO.data(), range<1>(dO.size()));
gpuQueue.submit([&](handler &cgh){
sycl::accessor inA{bufA, cgh, sycl::read_only};
sycl::accessor inB{bufB, cgh, sycl::read_only};
sycl::accessor out{bufO, cgh, sycl::write_only};
cgh.parallel_for<add>(rng, [=](item<3> i){
auto ptrA = inA.get_pointer();
auto ptrB = inB.get_pointer();
auto ptrO = out.get_pointer();
auto linearId = i.get_linear_id();
ptrA[linearId] = ptrB[linearId] + ptrO[linearId];
});
});
* Here we retrieve the underlying pointer for each
of the `accessor`s.
* We then access the pointer using the linearized
`id` by calling the `get_linear_id` member function
on the `item`.
* Again this linearization is calculated in
row-major order.
#### Exercise
Code_Exercises/ND_Range_Kernel/source
Implement a SYCL application that will perform a vector add using `parallel_for`, adding multiple elements in parallel.