## Enqueueing a Kernel

## Learning Objectives * Learn about queues and how to submit work to them * Learn how to compose command groups * Learn how to define kernel functions * Learn about the rules and restrictions on kernel functions * Learn how to stream text from a kernel function to the console.

#### The queue

* In SYCL all work is submitted via commands to a `queue`. * The `queue` has an associated device that any commands enqueued to it will target. * There are several different ways to construct a `queue`. * The most straight forward is to default construct one. * This will have the SYCL runtime choose a device for you.

#### Precursor

* In SYCL there are two models for managing data: * The buffer/accessor model. * The USM (unified shared memory) model. * Which model you choose can have an effect on how you enqueue kernel functions. * For now we are going to focus on the buffer/accessor model.

#### Command groups

![SYCL](../common-revealjs/images/command_group.png "SYCL")

* In the buffer/accessor model commands must be enqueued via command groups. * A command group represents a series of commands to be executed by a device. * These commands include: * Invoking kernel functions on a device. * Copying data to and from a device. * Waiting on other commands to complete.

#### Composing command groups

![SYCL](../common-revealjs/images/composing_a_command_group.png "SYCL")

* Command groups are composed by calling the `submit` member function on a `queue`. * The `submit` function takes a command group function which acts as a factory for composing the command group. * The `submit` function creates a `handler` and passes it into the command group function. * The `handler` then composes the command group.

#### Composing command groups

gpuQueue.submit([&](handler &cgh){
  
  /* Command group function */
  
});

* The `submit` member function takes a C++ function object, which takes a reference to a `handler`. * The function object can be a lambda expression or a class with a function call operator. * The body of the function object represents the command group function.

#### Composing command groups

gpuQueue.submit([&](handler &cgh){
  
  /* Command group function */
  
});

* The command group function is processed exactly once when `submit` is called. * At this point all the commands and requirements declared inside the command group function are processed to produce a command group. * The command group is then submitted asynchronously to the scheduler.

#### Composing command groups

gpuQueue.submit([&](handler &cgh){

  /* Command group function */
  
}).wait();

* The `queue` will not wait for commands to complete on destruction. * However `submit` returns an `event` to allow you to synchronize with the completion of the commands. * Here we call `wait` on the `event` to immediately wait for it to complete. * There are other ways to do this, that will be covered in later lectures.

#### Scheduling

![SYCL](../common-revealjs/images/scheduling.png "SYCL")

* Once `submit` has created a command group it will submit it to the scheduler. * The scheduler will then execute the commands on the target device once all dependencies and requirements are satisfied.

#### Scheduling

![SYCL](../common-revealjs/images/common_scheduler.png "SYCL")

* The same scheduler is used for all queues. * This allows sharing dependency information.

#### Enqueueing SYCL Kernel Functions

class my_kernel;

gpuQueue.submit([&](handler &cgh){

  cgh.single_task<my_kernel>([=]() {
    /* kernel code */
  });
}).wait();

* SYCL kernel functions are defined using one of the kernel function invoke APIs provided by the `handler`. * These add a SYCL kernel function command to the command group. * There can only be one SYCL kernel function command in a command group. * Here we use `single_task`.

class my_kernel;

gpuQueue.submit([&](handler &cgh){
								
  cgh.single_task<my_kernel>([=]() {
    /* kernel code */
  }); 
}).wait();

* The kernel function invoke APIs take a function object representing the kernel function. * This can be a lambda expression or a class with a function call operator. * This is the entry point to the code that is compiled to execute on the device.

class my_kernel;

gpuQueue.submit([&](handler &cgh){
								
  cgh.single_task<my_kernel>([=]() {
    /* kernel code */
  }); 
}).wait();

* Different kernel invoke APIs take different parameters describing the iteration space to be invoked in. * Different kernel invoke APIs can also expect different arguments to be passed to the function object. * The `single_task` function describes a kernel function that is invoked exactly once, so there are no additional parameters or arguments.

class my_kernel;

gpuQueue.submit([&](handler &cgh){
								
  cgh.single_task<my_kernel>([=]() {
    /* kernel code */
  }); 
}).wait();

* The template parameter passed to `single_task` is used to name the kernel function. * This is necessary when defining kernel functions with lambdas to allow the host and device compilers to communicate. * SYCL 2020 allows kernel lambdas to be unnamed, but not all implementations support that yet.

#### SYCL kernel function rules

* Must be defined using a C++ lambda or function object, they cannot be a function pointer or std::function. * Must always capture or store members by-value. * SYCL kernel functions declared with a lambda ~~must be named using a forward declarable C++ type, declared in global scope~~ can be anonymous since SYCL 2020! * SYCL kernel function names follow C++ ODR rules, which means you cannot have two kernels with the same name.

#### SYCL kernel function restrictions

* No dynamic allocation * No dynamic polymorphism * No function pointers * No recursion

#### Kernels as function objects

class my_kernel;

queue gpuQueue;
gpuQueue.submit([&](handler &cgh){

  cgh.single_task<my_kernel>([=]() {
    /* kernel code */
  }); 
}).wait();

* All the examples of SYCL kernel functions up until now have been defined using lambda expressions.

#### Kernels as function objects

struct my_kernel { 
  void operator()() const {
    /* kernel function */
  }
};

* As well as defining SYCL kernels using lambda expressions, You can also define a SYCL kernel using a regular C++ function object. * Define a type with a public const-qualified `operator()` member function.

#### Kernels as function objects

struct my_kernel { 
  void operator()() const {
    /* kernel function */
  }
};

queue gpuQueue;
gpuQueue.submit([&](handler &cgh){
								
  cgh.single_task(my_kernel{}); 
}).wait();

* To use a C++ function object you simply construct an instance of the type and pass it to `single_task`. * Notice you no longer need to name the SYCL kernel.

#### Streams

* A `stream` can be used in a kernel function to print text to the console from the device, similarly to how you would with `std::cout`. * The `stream` is a buffered output stream so the output may not appear until the kernel function is complete. * The `stream` is useful for debugging, but should not be relied on in performance critical code.

#### Streams

sycl::stream(size_t bufferSize, size_t workItemBufferSize, handler &cgh);

* A `stream` must be constructed in the command group function, as a `handler` is required. * The constructor also takes a `size_t` parameter specifying the total size of the buffer that will store the text. * It also takes a second `size_t` parameter specifying the work-item buffer size. * The work-item buffer size represents the cache that each invocation of the kernel function (in the case of `single_task` 1) has for composing a stream of text.

#### Streams

class my_kernel;

queue gpuQueue;
gpuQueue.submit([&](handler &cgh){

  auto os = sycl::stream(1024, 1024, cgh);

  cgh.single_task<my_kernel>([=]() {
    /* kernel code */
  }); 
}).wait();

* Here we construct a `stream` in our command group function with a buffer size of `1024` and a work-item size of also `1024`. * This means that the total text that the stream can receive is 1024 bytes.

#### Streams

class my_kernel;

queue gpuQueue;
gpuQueue.submit([&](handler &cgh){

  auto os = sycl::stream(1024, 1024, cgh);

  cgh.single_task<my_kernel>([=]() {
    os << "Hello world!\n";
  }); 
}).wait();

* Next we capture the `stream` in the kernel function's lambda expression. * Then we can print `"Hello World!"` to the console using the `<<` operator. * This is where the work-item size comes in, this is the cache available to store text on the right-hand-size of the `<<` operator.

## Questions

#### Exercise

Code_Exercises/Enqueueing_a_Kernel/source

Implement a SYCL application which enqueues a kernel function to a device and streams "Hello world!" to the console.