## Learning Objectives
* Learn about task parallelism and data parallelism
* Learn about the SPMD model for describing data parallelism
* Learn about SYCL execution and memory models
* Learn about enqueuing kernel functions with `parallel_for`
#### Task vs data parallelism
![Task vs Data](../common-revealjs/images/task_parallelism_data_parallelism.png "Task parallelism vs data parallelism")
* **Task parallelism** is where you have several,
possibly distinct tasks executing in parallel.
* In task parallelism you optimize for latency.
* **Data parallelism** is where you have the same
task being performed on multiple elements of data.
* In data parallelism you optimize for throughput.
#### Vector processors
* Many processors are vector processors, which means
they can naturally perform data parallelism.
* GPUs are designed to be parallel.
* CPUs have SIMD instructions which perform the
same instruction on a number elements of data.
#### SPMD model for describing data parallelism
Sequential CPU code
void calc(const int in[], int out[]) {
// all iterations are run in the same
// thread in a loop
for (int i = 0; i < 1024; i++){
out[i] = in[i] * in[i];
}
}
// calc is invoked just once and all
// iterations are performed inline
calc(in, out);
Parallel SPMD code
void calc(const int in[], int out[], int id) {
// function is described in terms of
// a single iteration
out[id] = in[id] * in[id];
}
// parallel_for invokes calc multiple
// times in parallel
parallel_for(calc, in, out, 1024);
#### SYCL execution model
* In SYCL kernel functions are executed by
**work- items**.
* You can think of a work-item as a thread of
execution.
* Each work-item will execute a SYCL kernel function from start to end.
* A work-item can run on CPU threads, SIMD lanes,
GPU threads, or any other kind of processing
element.
![Work-Item](../common-revealjs/images/workitem.png "Work-Item")
#### SYCL execution model
* Work-items are launched in parallel in a `sycl::range`.
* In order to maximize parallelism, the range should correspond to the problem size.
![Work-Group](../common-revealjs/images/SYCL_range.png "Work-Group")
#### Parallel_for
cgh.parallel_for<my_kernel>(range{64, 64},
[=](id<2> idx){
// SYCL kernel function is executed
// on a range of work-items
});
* In SYCL, kernel functions can be enqueued to execute
over a range of work-items using `parallel_for`.
* When using `parallel_for` you must also pass `range`
which describes the iteration space over which the kernel
is to be executed.
#### Parallel_for
cgh.parallel_for<my_kernel>(range{64, 64},
[=](id<2> idx){
// SYCL kernel function is executed
// on a range of work-items
});
* When using `parallel_for` you must also have the
function object which represents the kernel function take
an `id`.
* This represents the current work-item being executed and
its position within the iteration space.
#### Expressing parallelism
cgh.parallel_for<kernel>(range<1>(1024),
[=](id<1> idx){
/* kernel function code */
});
cgh.parallel_for<kernel>(range<1>(1024),
[=](item<1> item){
/* kernel function code */
});
cgh.parallel_for<kernel>(nd_range<1>(range<1>(1024),
range<1>(32)),[=](nd_item<1> ndItem){
/* kernel function code */
});
* Overload taking a **range** object specifies the global range, runtime decides local range
* An **id** parameter represents the index within the global range
____________________________________________________________________________________________
* Overload taking a **range** object specifies the global range, runtime decides local range
* An **item** parameter represents the global range and the index within the global range
____________________________________________________________________________________________
* Overload taking an **nd_range** object specifies the global and local range
* An **nd_item** parameter represents the global and local range and index
#### Exercise
Code_Exercises/Data_Parallelism/source.cpp
Implement a SYCL application that adds two arrays of
values together in parallel using `parallel_for`.