## Asynchronous Execution
## Learning Objectives * Learn about how commands are enqueued asynchronously * Learn about the different reasons for synchronization * Learn about the different ways to perform synchronization
#### Asynchronous execution
* All command submitted to a `queue` are done so asynchronously. * The functions return immediately and the command is run in a background thread. * This includes individual commands like `memcpy` and collections of commands derived from a command group. * This means you have to synchronize with those commands.
#### Synchronization
There are a number of reasons why you need to synchronize with commands
* Await completion of a kernel function. * Await the results of a computation. * Await error conditions which come from a failure to execute any of the commands.
#### Synchronization with kernel functions
There are two ways ways to synchronize with kernel functions.
* Calling `wait` on an `event` object returned from enqueuing a kernel function command, either via a command group or a shortcut function. * Calling `wait` or `wait_and_throw` on the `queue` itself.
#### Synchronizing with kernel functions (buffers/accessors)
buf = sycl::buffer(data, sycl::range{1024});

gpuQueue.submit([&](sycl::handler &cgh){
  auto acc = sycl::accessor{buf, cgh};
      
  cgh.parallel_for<kernel_a>(sycl::range{1024},
    [=](sycl::id<1> idx){
    acc[idx] = /* some computation */
  });
}).wait();
							
* Calling `wait` on an `event` object returned from enqueuing a command group will wait for the commands from that command group to complete. * This is how we have synchronized in our examples so far. * This effectively creates a blocking operations that will complete in place by immediately synchronizing.
#### Synchronizing with kernel functions (buffers/accessors)
buf = sycl::buffer(data, sycl::range{1024});

gpuQueue.submit([&](sycl::handler &cgh){
  auto acc = sycl::accessor{buf, cgh};
      
  cgh.parallel_for<kernel_a>(sycl::range{1024},
    [=](sycl::id<1> idx){
    acc[idx] = /* some computation */
  });
});

gpuQueue.wait();
							
* Calling `wait` or `wait_and_throw` on a `queue` will wait for all commands enqueued to it to complete. * Note that command groups do not create commands to copy data back to the host application.
#### Synchronizing with kernel functions (USM)
auto devicePtr   = usm_wrapper<int>(
  malloc_device<int>(1024, gpuQueue));

gpuQueue.memcpy(devicePtr, data, sizeof(int)).wait();

gpuQueue.parallel_for<kernel_a>(sycl::range{1024},
  [=](sycl::id<1> idx){
  devicePtr[idx] = /* some computation */
}).wait();
							
* Calling `wait` on an `event` object returned from functions such as `memcpy` or the `queue` shortcuts will wait for that specific command to complete. * Again this is how we have synchronized in our examples so far.
#### Synchronizing with kernel functions (USM)
auto devicePtr   = usm_wrapper<int>(
  malloc_device<int>(1024, gpuQueue));

gpuQueue.memcpy(devicePtr, data, sizeof(int));

gpuQueue.wait();

gpuQueue.parallel_for<kernel_a>(sycl::range{1024},
  [=](sycl::id<1> idx){
  devicePtr[idx] = /* some computation */
});

gpuQueue.wait();
							
* Again calling `wait` or `wait_and_throw` on a `queue` will wait for all commands enqueued to it to complete. * Note you generally don't want to call `wait` on the `queue` after every command, instead you want to create dependencies between commands, which we cover in the next lecture.
#### Synchronizing with data
There are multiple ways ways to synchronize with data, but it differs depending on the data management model you are using.
* When using the USM data management model you can synchronize the same way you would for kernel functions, calling `wait` on an `event` or the `queue`. * When using the buffer/access data management model command groups don't automatically copy data back so there are other ways to synchronize with the data. * Creating a `host_accessor`. * Destroying the `buffer`.
#### Synchronizing with data (USM)
gpuQueue.memcpy(data, devicePtr, sizeof(int)).wait();
							
gpuQueue.memcpy(data, devicePtr, sizeof(int));
gpuQueue.wait();
							
* Simply call `wait` on the `event` returned from `memcpy`. * Alternatively call `wait` on the `queue`.
#### Synchronizing with data (buffer/accessor)
buf = sycl::buffer(data, sycl::range{1024});

gpuQueue.submit([&](sycl::handler &cgh){
  auto acc = sycl::accessor{buf, cgh};
									  
  cgh.parallel_for<kernel_a>(sycl::range{1024},
    [=](sycl::id<1> idx){
    acc[idx] = /* some computation */
  });
});

{
  auto hostAcc = buf.get_host_access();

  hostAcc[/* some index */] = /* some computation */
}
							
* A `host_accessor` gives immediate access to the data managed by a `buffer` in the host application. * This will wait for any kernel functions accessing the `buffer` to complete and then copying the data back to the host. * It will also block any other `accessor` accessing a `buffer` until it is destroyed. * Note that the data managed by the `buffer` may not be copied back to the original address on the host, in this case `data`.
#### Synchronizing with data (buffer/accessor)
{
  buf = sycl::buffer(data, sycl::range{1024});

  gpuQueue.submit([&](sycl::handler &cgh){
    auto acc = sycl::accessor{buf, cgh};
									  
    cgh.parallel_for<kernel_a>(sycl::range{1024},
      [=](sycl::id<1> idx){
      acc[idx] = /* some computation */
    });
  });
}
							
* A `buffer` will also synchronize the data it manages on destruction. * It will wait for any kernel functions accessing it to complete and copy the data back to the origin address before completing destruction.
#### Synchronizing with errors
* Errors are handled by a `queue` and any asynchronous errors can be produced during any of the synchronization methods we've looked at. * The best way to ensure all errors are caught is to synchronize by calling `wait` or `wait_and_throw` on the the `queue`.
## Questions
#### Exercise
Code_Exercises/Asynchronous_Execution/source
Try out the different methods of synchronizing with a kernel function and the resulting data from the computation.