## A Quick Introduction to SYCL
## Learning Objectives
* Quick SYCL introduction * Writing a one-page application
#### What is SYCL?
![SYCL](../common-revealjs/images/sycl_and_cpp.png "SYCL")
SYCL is a single source, high-level, standard C++ programming model, that can target a range of heterogeneous platforms
#### What is SYCL?
A first example of SYCL code.
![SYCL](../common-revealjs/images/sycl_first_sample_code.png "SYCL")
#### SYCL Key concepts
* SYCL is a C++-based programming model: * Device code and host code exist in the same file * Device code can use templates and other C++ features * Designed with "modern" C++ in mind * SYCL provides high-level abstractions over common boilerplate code * Platform/device selection * Buffer creation and data movement * Kernel function compilation * Dependency management and scheduling
#### Image convolution sample application
* The algorithm is **embarrassingly parallel**. * Each work-item in the computation can be calculated entirely independently. * The algorithm is computation heavy. * A large number of operations are performed for each work-item in the computation, particularly when using large filters. * The algorithm requires a large bandwidth. * A lot of data must be passed through the GPU to process an image, particularly if the image is very high resolution.
#### Image convolution definition
![SYCL](../common-revealjs/images/image_convolution_definition.png "SYCL")
* A filter of a given size is applied as a stencil to the position of each pixel in the input image. * Each pixel covered by the filter is then multiples with the corresponding element in the filter. * The result of these multiplications is then summed to give the resulting output pixel. * Here we have a 3x3 gaussian blur approximation as an example.
#### SYCL System Topology
* A SYCL application can execute work across a range of different heterogeneous devices. * The devices that are available in any given system are determined at runtime through topology discovery.
#### Platforms and devices
* The SYCL runtime will discover a set of platforms that are available in the system. * Each platform represents a backend implementation such as Intel OpenCL or Nvidia CUDA. * The SYCL runtime will also discover all the devices available for each of those platforms. * CPU, GPU, FPGA, and other kinds of accelerators.
![SYCL](../common-revealjs/images/devices-1.png "SYCL-Devices")
#### Querying with a device selector
![Device Topology](../common-revealjs/images/topology-1.png "Device-Topology")
* To simplify the process of traversing the system topology SYCL provides device selectors. * A device selector is is a callable C++ object which defines a heuristic for scoring devices. * SYCL provides a number of standard device selectors, e.g. `default_selector_v`, `gpu_selector_v`, etc. * Users can also create their own device selectors.
#### Querying with a device selector
auto gpuDevice = device(gpu_selector_v); 
							
![Device Topology](../common-revealjs/images/topology-1.png "Device-Topology")
* A device selector takes a parameter of type `const device &` and gives it a "score". * Used to query all devices and return the one with the highest "score". * A device with a negative score will never be chosen.
#### SYCL Queues
![SYCL](../common-revealjs/images/out_of_order_execution.png "SYCL")
* Commands are submitted to devices in SYCL by means of a Queue * SYCL `queue`s are by default out-of-order. * This means commands are allowed to be overlapped, re-ordered, and executed concurrently, providing dependencies are honoured to ensure consistency.
#### In-of-order execution
![SYCL](../common-revealjs/images/in_order_execution.png "SYCL")
* SYCL `queue`s can be configured to be in-order. * This mean commands must execute strictly in the order they were enqueued.
#### Memory Models
* In SYCL there are two models for managing data: * The buffer/accessor model. * The USM (unified shared memory) model. * Which model you choose can have an effect on how you enqueue kernel functions.
#### CPU and GPU Memory
* A GPU has its own memory, separate to CPU memory. * In order for the GPU to use memory from the CPU, the following actions must take place (either explicitly or implicitly): * Memory allocation on the GPU. * Data migration from the CPU to the allocation on the GPU. * Some computation on the GPU. * Migration of the result back to the CPU.
#### SYCL Buffers & Accessors
* The buffer/accessor model separates the storage and access of data * A SYCL buffer manages data across the host and any number of devices * A SYCL accessor requests access to data on the host or on a device for a specific SYCL kernel function * Accessors are also used to access data within a SYCL kernel function * This means they are declared in the host code but captured by and then accessed within a SYCL kernel function
#### SYCL Buffers & Accessors
* When a buffer object is constructed it will not allocate or copy to device memory at first * This will only happen once the SYCL runtime knows the data needs to be accessed and where it needs to be accessed
![Buffer Host Memory](../common-revealjs/images/buffer-hostmemory.png "Buffer Host Memory")
#### SYCL Buffers & Accessors
* Constructing an accessor specifies a request to access the data managed by the buffer * There are a range of different types of accessor which provide different ways to access data
![Buffer Host Memory Accessor](../common-revealjs/images/buffer-hostmemory-accessor.png "Buffer Host Memory Accessor")
#### Accessing Data With Accessors
buffer<float, 1> bufA(dA.data(), range<1>(dA.size())); 
buffer<float, 1> bufB(dB.data(), range<1>(dB.size())); 
buffer<float, 1> bufO(dO.data(), range<1>(dO.size()));

gpuQueue.submit([&](handler &cgh){
  sycl::accessor inA{bufA, cgh, sycl::read_only};
  sycl::accessor inB{bufB, cgh, sycl::read_only};
  sycl::accessor out{bufO, cgh, sycl::write_only};
  cgh.parallel_for<add>(range<1>(dA.size()), 
    [=](id<1> i){ 
    out[i] = inA[i] + inB[i];
  });
});
							
* Here we access the data of the `accessor` by passing in the `id` passed to the SYCL kernel function.
#### USM: Malloc_device
void* malloc_device(size_t numBytes, const queue& syclQueue, const property_list &propList = {});

template <typename T>
T* malloc_device(size_t count, const queue& syclQueue,  const property_list &propList = {});
						
* A USM device allocation is performed by calling one of the `malloc_device` functions. * Both of these functions allocate the specified region of memory on the `device` associated with the specified `queue`. * The pointer returned is only accessible in a kernel function running on that `device`. * Synchronous exception if the device does not have aspect::usm_device_allocations * This is a blocking operation.
#### USM: Free
void free(void* ptr, queue& syclQueue);
						
* In order to prevent memory leaks USM device allocations must be free by calling the `free` function. * The `queue` must be the same as was used to allocate the memory. * This is a blocking operation.
#### USM: Memcpy
event queue::memcpy(void* dest, const void* src, size_t numBytes, const std::vector &depEvents);
						
* Data can be copied to and from a USM device allocation by calling the `queue`'s `memcpy` member function. * The source and destination can be either a host application pointer or a USM device allocation. * This is an asynchronous operation enqueued to the `queue`. * An `event` is returned which can be used to synchronize with the completion of copy operation. * May depend on other events via `depEvents`
#### SYCL execution model
* SYCL kernel functions are executed by **work-items** * You can think of a work-item as a thread of execution * Each work-item will execute a SYCL kernel function from start to end * A work-item can run on CPU threads, SIMD lanes, GPU threads, or any other kind of processing element
![Work-Item](../common-revealjs/images/workitem.png "Work-Item")
#### SYCL execution model
* Work-items are collected together into **work-groups** * The size of work-groups is generally relative to what is optimal on the device being targeted * It can also be affected by the resources used by each work-item
![Work-Group](../common-revealjs/images/workgroup.png "Work-Group")
#### SYCL execution model
* SYCL kernel functions are invoked within an **nd-range** * An nd-range has a number of work-groups and subsequently a number of work-items * Work-groups always have the same number of work-items
![ND-Range](../common-revealjs/images/ndrange.png "ND-Range")
#### SYCL execution model
* The nd-range describes an **iteration space**: how it is composed in terms of work-groups and work-items * An nd-range can be 1, 2 or 3 dimensions * An nd-range has two components * The **global-range** describes the total number of work-items in each dimension * The **local-range** describes the number of work-items in a work-group in each dimension
![ND-Range](../common-revealjs/images/ndrange-example.png "ND-Range")
#### Expressing parallelism
							
cgh.parallel_for<kernel>(range<1>(1024), 
  [=](id<1> idx){
    /* kernel function code */
});
							
							
cgh.parallel_for<kernel>(range<1>(1024), 
  [=](item<1> item){
    /* kernel function code */
});
							
							
cgh.parallel_for<kernel>(nd_range<1>(range<1>(1024), 
  range<1>(32)),[=](nd_item<1> ndItem){
    /* kernel function code */
});
							
* Overload taking a **range** object specifies the global range, runtime decides local range * An **id** parameter represents the index within the global range ____________________________________________________________________________________________ * Overload taking a **range** object specifies the global range, runtime decides local range * An **item** parameter represents the global range and the index within the global range ____________________________________________________________________________________________ * Overload taking an **nd_range** object specifies the global and local range * An **nd_item** parameter represents the global and local range and index
#### SYCL Kernels
* SYCL kernels (i.e. the device function the programmer wants executed) are expressed either as C++ function objects or lambdas. * Comparing with other GPU paradigms, kernel arguments are either data members or lambda captures, respectively * For function objects, the member function `operator()(sycl::id)` is the compute function, which is equivalent to the lambda style // then add slides explaining the practical work and how to // glue it all together from scratch
#### Function object
```c++ class MyKernel { sycl::accessor input_; float* output_; MyKernel(sycl::buffer buf, float* output, sycl::handler& h) : input{buf.get_access(h)}, output_{output} {} // const is required void operator()(sycl::item<1> i) const { ; // computation here } }; ```
The members are accessible on the device inside the function call operator.
#### Lambda function
```c++ sycl::buffer buf = /* normal init */; float * output = sycl::malloc_device(/* params */); ... queue submit as normal ... auto acc = buf.get_access(h); auto func = [=](sycl::item<1> i) { acc[i] = someVal; output[i.get_global_linear_id()] = someOtherVal; }); handler.parallel_for(range, func); ```
The variables used implicitly are captured by value and are usable in the kernel.
#### SYCL Kernels
* These forms are equivalent (as in normal C++) and which one to use depends on preference and use case * Each instance of the kernel has a uniquely valued `sycl::item` describing its position in the iteration space as covered in "SYCL execution model" * Can be used to index into accessors, pointers etc.
#### First Exercise
* Use Data Parallelism and write a SYCL kernel * The exercise README is in the "Code_Exercises/Data_Parallelism" folder * Follow the guidelines in the README and comments in the source.cpp file * There is a solution file but only use it if you need to
#### Image convolution example
![SYCL](../common-revealjs/images/image_convolution_example.png "SYCL")
#### Image convolution data flow
![SYCL](../common-revealjs/images/image_convolution_data_flow.png "SYCL")
* The kernel must read from the input image data and writes to the output image data. * It must also read from the filter. * The input image data and the filter don't need to be copied back to the host. * The output image data can be uninitialized.
#### Gaussian Blur
* The blur is implementable in two passes: one vertical, one horizontal * Each item in the iteration space would load a series of pixels to the left and right (or top and bottom) of its position in the image * Then, multiply with the corresponding element in the filter * Then sum these intermediate results and store them to the output * A dscrete convolution, in other words
#### Walkthrough
* Explain a more complex example through image convolution * Solution available in `Code_Exercises/Functors/solution.cpp`
#### Reference image
![SYCL](../common-revealjs/images/dogs.png "SYCL")
* We provide a reference image to use in the exercise. * This is in Code_Exercises/Images * This image is a 512x512 RGBA png. * Feel free to use your own image but we recommend keeping to this format. * The read and write functions can be found in `image_conv.h`.
#### Convolution filters
* `auto filter = util::generate_filter(util::filter_type filterType, int width);` * The utility for generating the filter data takes a `filter_type` enum which can be either `identity` or `blur` and a width. * Feel free to experiment with different variations. * Note that the filter width should always be an odd value. * The function can be found in `image_conv.h`