## Learning Objectives
* Learn about task parallelism and data parallelism
* Learn about the SPMD model for describing data parallelism
* Learn about SYCL execution and memory models
* Learn about enqueuing kernel functions with `parallel_for`
#### Task vs data parallelism
![Task vs Data](../common-revealjs/images/task_parallelism_data_parallelism.png "Task parallelism vs data parallelism")
* **Task parallelism** is where you have several,
possibly distinct tasks executing in parallel.
* In task parallelism you optimize for latency.
* **Data parallelism** is where you have the same
task being performed on multiple elements of data.
* In data parallelism you optimize for throughput.
#### Vector processors
* Many processors are vector processors, which means
they can naturally perform data parallelism.
* GPUs are designed to be parallel.
* CPUs have SIMD instructions which perform the
same instruction on a number elements of data.
#### SPMD model for describing data parallelism
Sequential CPU code
void calc(const int in[], int out[]) {
// all iterations are run in the same
// thread in a loop
for (int i = 0; i < 1024; i++){
out[i] = in[i] * in[i];
}
}
// calc is invoked just once and all
// iterations are performed inline
calc(in, out);
Parallel SPMD code
void calc(const int in[], int out[], int id) {
// function is described in terms of
// a single iteration
out[id] = in[id] * in[id];
}
// parallel_for invokes calc multiple
// times in parallel
parallel_for(calc, in, out, 1024);
#### SYCL execution model
* In SYCL kernel functions are executed by
**work- items**.
* You can think of a work-item as a thread of
execution.
* Each work-item will execute a SYCL kernel function from start to end.
* A work-item can run on CPU threads, SIMD lanes,
GPU threads, or any other kind of processing
element.
![Work-Item](../common-revealjs/images/workitem.png "Work-Item")
#### SYCL execution model
* SYCL kernel functions are invoked within an **nd-range**
* An nd-range has a number of work-groups and subsequently a number of work-items
* Work-groups always have the same number of work-items
![ND-Range](../common-revealjs/images/ndrange.png "ND-Range")
#### SYCL execution model
* The nd-range describes an **iteration space**: how it is composed in terms of work-groups and work-items
* An nd-range can be 1, 2 or 3 dimensions
* An nd-range has two components
* The **global-range** describes the total number of work-items in each dimension
* The **local-range** describes the number of work-items in a work-group in each dimension
![ND-Range](../common-revealjs/images/ndrange-example.png "ND-Range")
#### SYCL execution model
* Each invocation in the iteration space of an nd-range is a work-item
* Each invocation knows which work-item it is on and can query certain information about its position in the nd-range
* Each work-item has the following:
* **Global range**: {12, 12}
* **Global id**: {5, 6}
* **Group range**: {3, 3}
* **Group id**: {1, 1}
* **Local range**: {4, 4}
* **Local id**: {1, 2}
![ND-Range](../common-revealjs/images/ndrange-example-work-item.png "ND-Range")
#### SYCL execution model
Typically an nd-range invocation SYCL will execute the SYCL kernel function on a very large number of work-items, often in the thousands
![ND-Range](../common-revealjs/images/ndrange-invocation.png "ND-Range")
#### SYCL execution model
* Multiple work-items will generally execute concurrently
* On vector hardware this is often done in lock-step, which means the same hardware instructions
* The number of work-items that will execute concurrently can vary from one device to another
* Work-items will be batched along with other work-items in the same work-group
* The order work-items and work-groups are executed in is implementation defined
![ND-Range](../common-revealjs/images/ndrange-lock-step.png "ND-Range")
#### SYCL execution model
* Work-items in a work-group can be synchronized using a work-group barrier
* All work-items within a work-group must reach the barrier before any can continue on
![ND-Range](../common-revealjs/images/work-group-0.png "ND-Range")
#### SYCL execution model
* SYCL does not support synchronizing across all work-items in the nd-range
* The only way to do this is to split the computation into separate SYCL kernel functions
![ND-Range](../common-revealjs/images/work-group-0-1.png "ND-Range")
#### SYCL execution model
* SYCL also provides a simplified execution model with `sycl::range` in place of `sycl::nd_range`
* Caller only provides the global range
* Local range is decided by the runtime and cannot be inspected
* No synchronization is possible between work items
* Useful for simple problems which don't require synchronization, local memory and ultimate performance
* Runtime may not always have enough information to choose the best-performing size
#### Parallel_for
cgh.parallel_for<my_kernel>(nd_range{{1024, 16}, {32, 4}},
[=](nd_item<2> item){
// SYCL kernel function is executed
// on a range of work-items
});
* In SYCL, kernel functions can be enqueued to execute
over a range of work-items using `parallel_for`
* The first argument to `parallel_for` is an `nd_range` or
a `range` which describes the iteration space over which
the kernel is to be executed
* The kernel function has to take an `nd_item` or `item`,
respectively, as the parameter (or any type they can be
implicitly converted to, commonly from `item` to `id`)
#### Expressing parallelism
cgh.parallel_for<kernel>((nd_range<1>{1024,32},
[=](nd_item<1> ndItem){
/* kernel function code */
id globalId = ndItem.get_global_id();
id localId = ndItem.get_local_id();
});
* Overload taking an `nd_range` object specifies the global and local range
* An `nd_item` parameter represents the global and local range and index
cgh.parallel_for<kernel>(range<1>{1024},
[=](item<1> item){
/* kernel function code */
id globalId = item.get_id();
});
* Overload taking a `range` object specifies the global range, runtime decides local range
* An `item` parameter represents the global range and the index within the global range
cgh.parallel_for<kernel>(range<1>{1024},
[=](id<1> globalId){
/* kernel function code */
});
* Overload taking a `range` object specifies the global range, runtime decides local range
* An `id` parameter represents the index within the global range
#### SYCL memory model
* Each work-item can access a dedicated region of **private memory**
* A work-item cannot access the private memory of another work-item
![Private Memory](../common-revealjs/images/workitem-privatememory.png "Private Memory")
#### SYCL memory model
![Local Memory](../common-revealjs/images/workitem-localmemory.png "Local Memory")
* Each work-item can access a dedicated region of **local memory** accessible to all work-items in a work-group
* A work-item cannot access the local memory of another work-group
#### SYCL memory model
![Global Memory](../common-revealjs/images/workitem-constantmemory.png "Global Memory")
* Each work-item can access a single region of **global memory** that's accessible to all work-items in a ND-range
#### SYCL memory model
* Each memory region has a different size and access latency
* Global memory is larger than local memory and local memory is larger than private memory
* Private memory is faster than local memory and local memory is faster than global memory
![Memory Regions](../common-revealjs/images/memory-regions.png "Memory Regions")
#### Accessing Data With Accessors
* There are a few different ways to access the data represented by an accessor
* The subscript operator can take an **id**
* Must be the same dimensionality of the accessor
* For dimensions > 1, linear address is calculated in row major
* Nested subscript operators can be called for each dimension taking a **size_t**
* E.g. a 3-dimensional accessor: acc[x][y][z] = …
* A pointer to memory can be retrieved by calling **get_pointer**
* This returns a raw pointer to the data
#### Accessing Data With Accessors
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
buffer<float, 1> bufB(dB.data(), range<1>(dB.size()));
buffer<float, 1> bufO(dO.data(), range<1>(dO.size()));
gpuQueue.submit([&](handler &cgh){
sycl::accessor inA{bufA, cgh, sycl::read_only};
sycl::accessor inB{bufB, cgh, sycl::read_only};
sycl::accessor out{bufO, cgh, sycl::write_only};
cgh.parallel_for<add>(range<1>(dA.size()),
[=](id<1> i){
out[i] = inA[i] + inB[i];
});
});
* Here we access the data of the `accessor` by
passing in the `id` passed to the SYCL kernel
function.
#### Accessing Data With Accessors
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
buffer<float, 1> bufB(dB.data(), range<1>(dB.size()));
buffer<float, 1> bufO(dO.data(), range<1>(dO.size()));
gpuQueue.submit([&](handler &cgh){
sycl::accessor inA{bufA, cgh, sycl::read_only};
sycl::accessor inB{bufB, cgh, sycl::read_only};
sycl::accessor out{bufO, cgh, sycl::write_only};
cgh.parallel_for<add>(rng, [=](item<3> i){
auto ptrA = inA.get_pointer();
auto ptrB = inB.get_pointer();
auto ptrO = out.get_pointer();
auto linearId = i.get_linear_id();
ptrA[linearId] = ptrB[linearId] + ptrO[linearId];
});
});
* Here we retrieve the underlying pointer for each
of the `accessor`s.
* We then access the pointer using the linearized
`id` by calling the `get_linear_id` member function
on the `item`.
* Again this linearization is calculated in
row-major order.
#### Exercise
Code_Exercises/Data_Parallelism/source.cpp
Implement a SYCL application using `parallel_for` to add two arrays of values
* Use buffers and accessors to manage data
* Try the `sycl::range` and `sycl::nd_range` variants