## Learning Objectives
* Learn about how to create dependencies between kernel functions
* Learn about how to move data between the host and device(s)
* Learn about the differences between the buffer/accessor and USM data management models
* Learn how to represent basic data flow graphs
#### Access/buffer and USM
There are two ways to move data and create dependencies between kernel functions in SYCL
Buffer/accessor data movement model
* Data dependencies analysis
* Implicit data movement
USM data movement model
* Manual chaining of dependencies
* Explicit data movement
#### Creating dependencies
![SYCL](../common-revealjs/images/data_dependency.png "SYCL")
* Kernel A first writes to the data
* Kernel B then reads from and writes to the data
* This creates a read-after-write (RAW) relationship
* There must be a dependency created between Kernel A and Kernel B
#### Moving data
![SYCL](../common-revealjs/images/data_movement.png "SYCL")
* Here both kernel functions are enqueued to the same device, in this case a GPU
* The data must be copied to the GPU before the Kernel A is executed
* The data must remain on the GPU for Kernel B to be executed
* The data must be copied back to the host after Kernel B has executed
#### Data flow
![SYCL](../common-revealjs/images/data_flow.png "SYCL")
* Combining kernel function dependencies and the data movement dependencies we have a final data flow graph
* This graph defines the order in which all commands must execute in order to maintain consistency
* In more complex data flow graphs there may be multiple orderings which can achieve the same consistency
#### Data flow with buffers and accessors
sycl::buffer buf {data, sycl::range{1024}};
gpuQueue.submit([&](sycl::handler &cgh) {
sycl::accessor acc {buf, cgh};
cgh.parallel_for<kernel_a>(sycl::range{1024},
[=](sycl::id<1> idx) {
acc[idx] = /* some computation */
});
});
gpuQueue.submit([&](sycl::handler &cgh) {
sycl::accessor acc{buf, cgh};
cgh.parallel_for<kernel_b>(sycl::range{1024},
[=](sycl::id<1> idx) {
acc[idx] = /* some computation */
});
});
gpuQueue.wait();
* The buffer/accessor data management model data model is descriptive
* Dependencies and data movement is inferred from the access requirements of command groups
* The SYCL runtime is responsible for guaranteeing that data dependencies and consistency are maintained
#### Data flow with buffers and accessors
sycl::buffer buf {data, sycl::range{1024}};
gpuQueue.submit([&](sycl::handler &cgh) {
sycl::accessor acc {buf, cgh};
cgh.parallel_for<kernel_a>(sycl::range{1024},
[=](sycl::id<1> idx) {
acc[idx] = /* some computation */
});
});
gpuQueue.submit([&](sycl::handler &cgh) {
sycl::accessor acc {buf, cgh};
cgh.parallel_for<kernel_b>(sycl::range{1024},
[=](sycl::id<1> idx) {
acc[idx] = /* some computation */
});
});
gpuQueue.wait();
* A `buffer` object is responsible for managing data between the host and one or more devices
* It is also responsible for tracking dependencies on the data it manages
* It will also allocating memory and move data when necessary.
* Note that a `buffer` is lazy and will not allocate or move data until it is asked to
#### Data flow with buffers and accessors
sycl::buffer buf {data, sycl::range{1024}};
gpuQueue.submit([&](sycl::handler &cgh) {
sycl::accessor acc{buf, cgh};
cgh.parallel_for<my_kernel>(sycl::range{1024},
[=](sycl::id<1> idx) {
acc[idx] = /* some computation */
});
});
gpuQueue.submit([&](sycl::handler &cgh) {
sycl::accessor acc{buf, cgh};
cgh.parallel_for<my_kernel>(sycl::range{1024},
[=](sycl::id<1> idx) {
acc[idx] = /* some computation */
});
});
gpuQueue.wait();
* An `accessor` object is responsible for describing data access requirements
* It describes what data a kernel function is accessing and how it is accessing it
* The `buffer` object uses this information to create infer dependencies and data movement
#### Data flow with buffers and accessors
buf = sycl::buffer(data, sycl::range{1024});
gpuQueue.submit([&](sycl::handler &cgh) {
sycl::accessor acc {buf, cgh};
cgh.parallel_for<my_kernel>(sycl::range{1024},
[=](sycl::id<1> idx) {
acc[idx] = /* some computation */
});
});
gpuQueue.submit([&](sycl::handler &cgh) {
sycl::accessor acc {buf, cgh};
cgh.parallel_for<my_kernel>(sycl::range{1024},
[=](sycl::id<1> idx) {
acc[idx] = /* some computation */
});
});
gpuQueue.wait();
* Associating the `accessor` object with the `handler` connects the access dependency to the kernel function
* It also associates the access requirement with the device being targeted
#### Data flow with USM
auto devicePtr =
sycl::malloc_device<int>(1024, gpuQueue);
auto e1 = gpuQueue.memcpy(devicePtr, data, sizeof(int));
auto e2 = gpuQueue.parallel_for<kernel_a>(
sycl::range{1024}, e1, [=](sycl::id<1> idx) {
devicePtr[idx] = /* some computation */
});
auto e3 = gpuQueue.parallel_for<kernel_b>(
sycl::range{1024}, e2, [=](sycl::id<1> idx) {
devicePtr[idx] = /* some computation */
});
auto e4 = gpuQueue.memcpy(data, devicePtr,
sizeof(int), e3);
e4.wait();
sycl::free(devicePtr, gpuQueue);
* The USM data management model data model is prescriptive
* Dependencies are defined explicitly by passing around `event` objects
* Data movement is performed explicitly by enqueuing `memcpy` operations
* The user is responsible for ensuring data dependencies and consistency are maintained
#### Data flow with USM
auto devicePtr =
sycl::malloc_device<int>(1024, gpuQueue);
auto e1 = gpuQueue.memcpy(devicePtr, data, sizeof(int));
auto e2 = gpuQueue.parallel_for<kernel_a>(
sycl::range{1024}, e1, [=](sycl::id<1> idx) {
devicePtr[idx] = /* some computation */
});
auto e3 = gpuQueue.parallel_for<kernel_b>(
sycl::range{1024}, e2, [=](sycl::id<1> idx) {
devicePtr[idx] = /* some computation */
});
auto e4 = gpuQueue.memcpy(data, devicePtr,
sizeof(int), e3);
e4.wait();
* Each command enqueued to the `queue` produces an `event` object which can be used to synchronize with the completion of that command
* Passing those `event` objects when enqueueing other commands creates dependencies
#### Data flow with USM
auto devicePtr =
sycl::malloc_device<int>(1024, gpuQueue);
auto e1 = gpuQueue.memcpy(devicePtr, data, sizeof(int));
auto e2 = gpuQueue.parallel_for<kernel_a>(
sycl::range{1024}, e1, [=](sycl::id<1> idx) {
devicePtr[idx] = /* some computation */
});
auto e3 = gpuQueue.parallel_for<kernel_b>(
sycl::range{1024}, e2, [=](sycl::id<1> idx) {
devicePtr[idx] = /* some computation */
});
auto e4 = gpuQueue.memcpy(data, devicePtr,
sizeof(int), e3);
e4.wait();
sycl::free(devicePtr, gpuQueue);
* The `memcpy` member functions are used to enqueue data movement commands, moving the data to the GPU and then back again
#### Concurrent data flow
![SYCL](../common-revealjs/images/concurrent_data_flow.png "SYCL")
* If two kernels are accessing different buffers then there is no dependency between them
* In this case the two kernels and their respective data movement are independent
* By default `queue`s are out-of-order which means that these commands can execute in any order
* They could also execute concurrently if the target device is able to do so
#### Concurrent data flow with buffers and accessors
sycl::buffer bufA {dataA, sycl::range{1024}};
sycl::buffer bufB {dataB, sycl::range{1024}};
gpuQueue.submit([&](sycl::handler &cgh) {
sycl::accessor accA {bufA, cgh};
cgh.parallel_for<kernel_a>(sycl::range{1024},
[=](sycl::id<1> idx) {
accA[idx] = /* some computation */
});
});
gpuQueue.submit([&](sycl::handler &cgh) {
sycl::accessor accB {bufB, cgh};
cgh.parallel_for<kernel_b>(sycl::range{1024},
[=](sycl::id<1> idx) {
accB[idx] = /* some computation */
});
});
gpuQueue.wait();
* The buffer/accessor data management model automatically infers dependencies
* As each of the two kernel functions are accessing different `buffer` objects the SYCL runtime can infer there is no dependency between them
* Data movement is still performed for the two kernels as normal
* The two kernels and their respective copies collectively can be executed in any order
#### Concurrent data flow with USM
auto devicePtrA = sycl::malloc_device<int>(1024, gpuQueue);
auto devicePtrB = sycl::malloc_device<int>(1024, gpuQueue);
auto e1 = gpuQueue.memcpy(devicePtrA, dataA, sizeof(int));
auto e2 = gpuQueue.memcpy(devicePtrB, dataB, sizeof(int));
auto e3 = gpuQueue.parallel_for<kernel_a>(sycl::range{1024}, e1, [=](sycl::id<1> idx) {
devicePtrA[idx] = /* some computation */ });
auto e4 = gpuQueue.parallel_for<kernel_b>(sycl::range{1024}, e2, [=](sycl::id<1> idx) {
devicePtrB[idx] = /* some computation */ });
auto e5 = gpuQueue.memcpy(dataA), devicePtrA, sizeof(int), e3);
auto e6 = gpuQueue.memcpy(dataB, devicePtrB, sizeof(int), e4);
e5.wait(); e6.wait();
sycl::free(devicePtrA, gpuQueue);
sycl::free(devicePtrB, gpuQueue);
* Dependencies are defined explicitly
* We don't create dependencies between kernel functions but we do create dependencies on the data movement
#### Concurrent data flow with USM
auto devicePtrA = sycl::malloc_device<int>(1024, gpuQueue);
auto devicePtrB = sycl::malloc_device<int>(1024, gpuQueue);
auto e1 = gpuQueue.memcpy(devicePtrA, dataA, sizeof(int));
auto e2 = gpuQueue.memcpy(devicePtrB, dataB, sizeof(int));
auto e3 = gpuQueue.parallel_for<kernel_a>(sycl::range{1024}, e1, [=](sycl::id<1> idx) {
devicePtrA[idx] = /* some computation */ });
auto e4 = gpuQueue.parallel_for<kernel_b>(sycl::range{1024}, e2, [=](sycl::id<1> idx) {
devicePtrB[idx] = /* some computation */ });
auto e5 = gpuQueue.memcpy(dataA), devicePtrA, sizeof(int), e3);
auto e6 = gpuQueue.memcpy(dataB, devicePtrB, sizeof(int), e4);
e5.wait(); e6.wait();
sycl::free(devicePtrA, gpuQueue);
sycl::free(devicePtrB, gpuQueue);
* The dependencies of each chain of commands is independant of the other
* The two kernels and their respective copies collectively can be executed in any order
#### Which should you choose?
When should you use the buffer/accessor or USM data management models?
Buffer/accessor data movement model
* If you want to guarantee consistency and avoid errors
* If you want to iterate over your data flow quicker
USM data movement model
* If you need to use USM
* If you want more fine grained control over data movement
#### Exercise
Code_Exercises/Data_and_Dependencies/source
![SYCL](../common-revealjs/images/diamond_data_flow.png "SYCL")
Put together what you've seen here to create the above diamond data flow graph in either buffer/accessor or USM data management models