## Learning Objectives
* Learn about tiling using local memory
* Learn about how to synchronize work-groups
#### Cost of accessing global memory
* As we covered earlier global memory is very expensive to access.
* Even with coalesced global memory access if you are accessing the same elements multiple times that can be expensive.
* Instead you want to cache those values in a lower latency memory.
#### Why are image convolutions good on a GPU?

* Looking at the image convolution example.
* For each output pixel we are reading up to NxM pixels from the input image, where N and M are the dimensions of the filter.
* This means each input pixel is being read up to NxM times:
* 3x3 filter: up to 9 ops.
* 5x5 filter: up to 25 ops.
* 7x7 filter: up to 49 ops.
* If each of these operations is a separate load from global memory this becomes very expensive.
#### Using local memory

* The solution is local memory.
* Local memory is generally on-chip and doesn't have a cache as it's managed manually so is much lower latency.
* Local memory is a smaller dedicated region of memory per work-group.
* Local memory can be used to cache, allowing us to read from global memory just once and then read from local memory instead, often referred to as a scratchpad.
#### Tiling

* The iteration space of the kernel function is mapped across multiple work-groups.
* Each work-group has it's own region of local memory.
* You want to split the input image data into tiles, one for each work-group.
#### Local accessors
auto scratchpad = sycl::accessor<int, 1, sycl::access::target::local>(sycl::range{workGroupSize}, cgh);
* Local memory is allocated via an `accessor` with the `access::target::local` access target.
* Unlike regular `accessor`s they are not created with a `buffer`, they allocate memory per work-group for the duration of the kernel function.
* The `range` provided is the number of elements of the specified type to allocate per work-group.
#### Synchronization
* Local memory can be used to share partial results between work-items.
* When doing so it's important to synchronize between writes and read to memory to ensure all work-items have reached the same point in the program.
#### Synchronization

* Remember that work-items are not guaranteed to all execute at the same time (in parallel).
#### Synchronization

* A work-item can share results with other work-items via local (or global) memory.
#### Synchronization

* This means it's possible for a work-item to read a result that hasn't been written to yet.
* This creates a data race.
#### Synchronization

* This problem can be solved with a synchronization primitive called a work-group barrier.
#### Synchronization

* When a work-group barrier is inserted work-items will wait until all work-items in the work-group have reached that point.
#### Synchronization

* Only then can any work-items in the work-group continue execution.
#### Synchronization

* So now you can be sure that all of the results that you want to read have been written to.
#### Synchronization

* However note that this does not apply across work-group boundaries.
* So if you write in a work-item of one work-group and then read it in a work-item of another work-group you again have a data race.
* Furthermore, remember that work-items can only access their own local memory and not that of any other work-groups.
#### Group_barrier
sycl::group_barrier(item.get_group());
* Work-group barriers can be invoked by calling `group_barrier` and passing a `group` object.
* You can retrieve a `group` object representing the current work-group by calling `get_group` on an `nd_item`.
* Note this requires the `nd_range` variant of `parallel_for`.
#### Local memory image convolution performance

#### Exercise
Code_Exercises/Local_Memory_Tiling/source
Use local memory to cache a tile of the input image data per work-group.