## Data and Dependencies

## Learning Objectives * Learn about how to create dependencies between kernel functions * Learn about how to move data between the host and device(s) * Learn about the differences between the buffer/accessor and USM data management models * Learn how to represent basic data flow graphs

#### Access/buffer and USM

There are two ways to move data and create dependencies between kernel functions in SYCL

Buffer/accessor data movement model
* Data dependencies analysis * Implicit data movement

USM data movement model
* Manual chaining of dependencies * Explicit data movement

#### Creating dependencies

![SYCL](../common-revealjs/images/data_dependency.png "SYCL")

* Kernel A first writes to the data * Kernel B then reads from and writes to the data * This creates a read-after-write (RAW) relationship * There must be a dependency created between Kernel A and Kernel B

#### Moving data

![SYCL](../common-revealjs/images/data_movement.png "SYCL")

* Here both kernel functions are enqueued to the same device, in this case a GPU * The data must be copied to the GPU before the Kernel A is executed * The data must remain on the GPU for Kernel B to be executed * The data must be copied back to the host after Kernel B has executed

#### Data flow

![SYCL](../common-revealjs/images/data_flow.png "SYCL")

* Combining kernel function dependencies and the data movement dependencies we have a final data flow graph * This graph defines the order in which all commands must execute in order to maintain consistency * In more complex data flow graphs there may be multiple orderings which can achieve the same consistency

#### Data flow with buffers and accessors

sycl::buffer buf {data, sycl::range{1024}};

gpuQueue.submit([&](sycl::handler &cgh) {
  sycl::accessor acc {buf, cgh};

  cgh.parallel_for<kernel_a>(sycl::range{1024},
    [=](sycl::id<1> idx) {
      acc[idx] = /* some computation */
    });
});

gpuQueue.submit([&](sycl::handler &cgh) {
  sycl::accessor acc{buf, cgh};

  cgh.parallel_for<kernel_b>(sycl::range{1024},
    [=](sycl::id<1> idx) {
      acc[idx] = /* some computation */
    });
});

gpuQueue.wait();

* The buffer/accessor data management model data model is descriptive * Dependencies and data movement is inferred from the access requirements of command groups * The SYCL runtime is responsible for guaranteeing that data dependencies and consistency are maintained

#### Data flow with buffers and accessors

sycl::buffer buf {data, sycl::range{1024}};

gpuQueue.submit([&](sycl::handler &cgh) {
  sycl::accessor acc {buf, cgh};

  cgh.parallel_for<kernel_a>(sycl::range{1024},
    [=](sycl::id<1> idx) {
      acc[idx] = /* some computation */
    });
});

gpuQueue.submit([&](sycl::handler &cgh) {
  sycl::accessor acc {buf, cgh};

  cgh.parallel_for<kernel_b>(sycl::range{1024},
    [=](sycl::id<1> idx) {
      acc[idx] = /* some computation */
    });
});

gpuQueue.wait();

* A `buffer` object is responsible for managing data between the host and one or more devices * It is also responsible for tracking dependencies on the data it manages * It will also allocating memory and move data when necessary. * Note that a `buffer` is lazy and will not allocate or move data until it is asked to

#### Data flow with buffers and accessors

sycl::buffer buf {data, sycl::range{1024}};

gpuQueue.submit([&](sycl::handler &cgh) {
  sycl::accessor acc{buf, cgh};

  cgh.parallel_for<my_kernel>(sycl::range{1024},
    [=](sycl::id<1> idx) {
      acc[idx] = /* some computation */
    });
});

gpuQueue.submit([&](sycl::handler &cgh) {
  sycl::accessor acc{buf, cgh};

  cgh.parallel_for<my_kernel>(sycl::range{1024},
    [=](sycl::id<1> idx) {
      acc[idx] = /* some computation */
    });
});

gpuQueue.wait();

* An `accessor` object is responsible for describing data access requirements * It describes what data a kernel function is accessing and how it is accessing it * The `buffer` object uses this information to create infer dependencies and data movement

#### Data flow with buffers and accessors

buf = sycl::buffer(data, sycl::range{1024});

gpuQueue.submit([&](sycl::handler &cgh) {
  sycl::accessor acc {buf, cgh};

  cgh.parallel_for<my_kernel>(sycl::range{1024},
    [=](sycl::id<1> idx) {
      acc[idx] = /* some computation */
    });
});

gpuQueue.submit([&](sycl::handler &cgh) {
  sycl::accessor acc {buf, cgh};

  cgh.parallel_for<my_kernel>(sycl::range{1024},
    [=](sycl::id<1> idx) {
      acc[idx] = /* some computation */
    });
});

gpuQueue.wait();

* Associating the `accessor` object with the `handler` connects the access dependency to the kernel function * It also associates the access requirement with the device being targeted

#### Data flow with USM

auto devicePtr =
  sycl::malloc_device<int>(1024, gpuQueue);

auto e1 = gpuQueue.memcpy(devicePtr, data, sizeof(int));

auto e2 = gpuQueue.parallel_for<kernel_a>(
  sycl::range{1024}, e1, [=](sycl::id<1> idx) {
    devicePtr[idx] = /* some computation */
  });

auto e3 = gpuQueue.parallel_for<kernel_b>(
  sycl::range{1024}, e2, [=](sycl::id<1> idx) {
    devicePtr[idx] = /* some computation */
  });

auto e4 = gpuQueue.memcpy(data, devicePtr,
                          sizeof(int), e3);

e4.wait();

sycl::free(devicePtr, gpuQueue);

* The USM data management model data model is prescriptive * Dependencies are defined explicitly by passing around `event` objects * Data movement is performed explicitly by enqueuing `memcpy` operations * The user is responsible for ensuring data dependencies and consistency are maintained

#### Data flow with USM

auto devicePtr =
  sycl::malloc_device<int>(1024, gpuQueue);

auto e1 = gpuQueue.memcpy(devicePtr, data, sizeof(int));

auto e2 = gpuQueue.parallel_for<kernel_a>(
  sycl::range{1024}, e1, [=](sycl::id<1> idx) {
    devicePtr[idx] = /* some computation */
  });

auto e3 = gpuQueue.parallel_for<kernel_b>(
  sycl::range{1024}, e2, [=](sycl::id<1> idx) {
    devicePtr[idx] = /* some computation */
  });

auto e4 = gpuQueue.memcpy(data, devicePtr,
                          sizeof(int), e3);

e4.wait();

* Each command enqueued to the `queue` produces an `event` object which can be used to synchronize with the completion of that command * Passing those `event` objects when enqueueing other commands creates dependencies

#### Data flow with USM

auto devicePtr =
  sycl::malloc_device<int>(1024, gpuQueue);

auto e1 = gpuQueue.memcpy(devicePtr, data, sizeof(int));

auto e2 = gpuQueue.parallel_for<kernel_a>(
  sycl::range{1024}, e1, [=](sycl::id<1> idx) {
    devicePtr[idx] = /* some computation */
  });

auto e3 = gpuQueue.parallel_for<kernel_b>(
  sycl::range{1024}, e2, [=](sycl::id<1> idx) {
    devicePtr[idx] = /* some computation */
  });

auto e4 = gpuQueue.memcpy(data, devicePtr,
                          sizeof(int), e3);

e4.wait();

sycl::free(devicePtr, gpuQueue);

* The `memcpy` member functions are used to enqueue data movement commands, moving the data to the GPU and then back again

#### Concurrent data flow

![SYCL](../common-revealjs/images/concurrent_data_flow.png "SYCL")

* If two kernels are accessing different buffers then there is no dependency between them * In this case the two kernels and their respective data movement are independent * By default `queue`s are out-of-order which means that these commands can execute in any order * They could also execute concurrently if the target device is able to do so

#### Concurrent data flow with buffers and accessors

sycl::buffer bufA {dataA, sycl::range{1024}};
sycl::buffer bufB {dataB, sycl::range{1024}};

gpuQueue.submit([&](sycl::handler &cgh) {
  sycl::accessor accA {bufA, cgh};

  cgh.parallel_for<kernel_a>(sycl::range{1024},
    [=](sycl::id<1> idx) {
      accA[idx] = /* some computation */
    });
});

gpuQueue.submit([&](sycl::handler &cgh) {
  sycl::accessor accB {bufB, cgh};

  cgh.parallel_for<kernel_b>(sycl::range{1024},
    [=](sycl::id<1> idx) {
      accB[idx] = /* some computation */
    });
});

gpuQueue.wait();

* The buffer/accessor data management model automatically infers dependencies * As each of the two kernel functions are accessing different `buffer` objects the SYCL runtime can infer there is no dependency between them * Data movement is still performed for the two kernels as normal * The two kernels and their respective copies collectively can be executed in any order

#### Concurrent data flow with USM

auto devicePtrA = sycl::malloc_device<int>(1024, gpuQueue);
auto devicePtrB = sycl::malloc_device<int>(1024, gpuQueue);

auto e1 = gpuQueue.memcpy(devicePtrA, dataA, sizeof(int));
auto e2 = gpuQueue.memcpy(devicePtrB, dataB, sizeof(int));

auto e3 = gpuQueue.parallel_for<kernel_a>(sycl::range{1024}, e1, [=](sycl::id<1> idx) {
  devicePtrA[idx] = /* some computation */ });

auto e4 = gpuQueue.parallel_for<kernel_b>(sycl::range{1024}, e2, [=](sycl::id<1> idx) {
  devicePtrB[idx] = /* some computation */ });

auto e5 = gpuQueue.memcpy(dataA), devicePtrA, sizeof(int), e3);

auto e6 = gpuQueue.memcpy(dataB, devicePtrB, sizeof(int), e4);

e5.wait(); e6.wait();

sycl::free(devicePtrA, gpuQueue);
sycl::free(devicePtrB, gpuQueue);

* Dependencies are defined explicitly * We don't create dependencies between kernel functions but we do create dependencies on the data movement

#### Concurrent data flow with USM

auto devicePtrA = sycl::malloc_device<int>(1024, gpuQueue);
auto devicePtrB = sycl::malloc_device<int>(1024, gpuQueue);

auto e1 = gpuQueue.memcpy(devicePtrA, dataA, sizeof(int));
auto e2 = gpuQueue.memcpy(devicePtrB, dataB, sizeof(int));

auto e3 = gpuQueue.parallel_for<kernel_a>(sycl::range{1024}, e1, [=](sycl::id<1> idx) {
  devicePtrA[idx] = /* some computation */ });

auto e4 = gpuQueue.parallel_for<kernel_b>(sycl::range{1024}, e2, [=](sycl::id<1> idx) {
  devicePtrB[idx] = /* some computation */ });

auto e5 = gpuQueue.memcpy(dataA), devicePtrA, sizeof(int), e3);

auto e6 = gpuQueue.memcpy(dataB, devicePtrB, sizeof(int), e4);

e5.wait(); e6.wait();

sycl::free(devicePtrA, gpuQueue);
sycl::free(devicePtrB, gpuQueue);

* The dependencies of each chain of commands is independant of the other * The two kernels and their respective copies collectively can be executed in any order

#### Which should you choose?

When should you use the buffer/accessor or USM data management models?

Buffer/accessor data movement model
* If you want to guarantee consistency and avoid errors * If you want to iterate over your data flow quicker

USM data movement model
* If you need to use USM * If you want more fine grained control over data movement

## Questions

#### Exercise

Code_Exercises/Data_and_Dependencies/source

![SYCL](../common-revealjs/images/diamond_data_flow.png "SYCL")

Put together what you've seen here to create the above diamond data flow graph in either buffer/accessor or USM data management models