# Challenges Faced When Porting Ginkgo to the SYCL Ecosystem

Yu-Hsiang Tsai<sup>1</sup>, Terry Cojean<sup>1</sup>, Tobias Ribizel<sup>1</sup>, Hartwig Anzt<sup>1,2</sup>

<sup>1</sup>Karlsruhe Institute of Technology, <sup>2</sup>University of Tennessee

#### **Overview**

- Ginkgo
- Challenges and Workarounds
  - kernel attributes propagation rules changes
  - device\_code\_split per\_kernel and per\_source
  - devices with varying parameters
- Ginkgo Performance on Intel GPUs

# **Cross-Platform Sparse Linear Algebra Library**



# **Ginkgo for cross-platform**

Ginkgo is an open source C++ math library with focus on sparse linear algebra functionality on GPUs. It provides the same interface across different devices to users.





Library core contains architectureagnostic algorithm implementation;

#### Core

Library Infrastructure Algorithm Implementations

- Iterative Solvers
- Preconditioners
- ...

Library core contains architectureagnostic algorithm implementation;

Architecture-specific kernels execute the algorithm on target architecture;

#### Reference

**Reference kernels** 

- SpMV
- Solver kernels
- Precond kernels
- ...

Reference kernels are sequential kernels to check correctness of algorithm design and optimized kernels;

#### Core

Library Infrastructure Algorithm Implementations

- Iterative Solvers
- Preconditioners
- ...





# **Ginkgo supports oneAPI**

oneMKL has comprehensive support for dense linear algebra but features few basic routines for sparse linear algebra.

Ginkgo provides sparse linear algebra functionality and numerical algorithms beyond basic building blocks like SpMV. With Ginkgo, complete numerical simulation workflows can be realized: iterative solvers, preconditioners, algebraic multigrid, etc.

oneAPI allows to execute on GPUs and CPUs from Intel and oneDPL gives similar functions like Thrust.

## Ginkgo brings more functionality to oneAPI

Matrix: Dense, Csr, Coo, Ell, Sellp, Hybrid, fixed-blockCsr, SparseCsr

Solver: CG, BiCG, BiCGStab, CGS, FCG, (CB-)GMRES, IDR, Multigrid

Preconditioner: Jacobi, ISAI, (Par)ILU, (Par)IC

Mixed Precision support:

Mixed precision preconditioners Mixed precision GMRES (CB-GMERS) Mixed precision Multigrid

# **Challenges and Workarounds**

## Kernel attributes propagation rule changes

When SYCL 1.2.1 with Intel extension (before DPC++ 2022), the kernel attributes from inner components can be propagated to the top level.

Rely on it, Ginkgo introduces the cooperative group for DPC++. It allows that we set the subgroup\_size inside the kernel, and we can use it like CUDA cooperative group.

However, SYCL 2020 and DPC++ 2022 disabled this rule. It can only set the kernel attributes from the top call.

We also show our high interest about the sub-subgroup size

# SYCL 1.2.1: use subgroup like cooperative groups

template <unsigned Size, typename Group>

[[intel::reqd\_sub\_group\_size(Size)]] (const Group& group)

return detail::thread\_block\_tile<Size>(group);

template<unsigned subgroup\_size, typename ValueType>
void reduce(float \*a, sycl::nd\_item<3> item\_ct1) {
 auto subwarp = group::tiled\_partition<subgroup\_size>[]
 group::this\_thread\_block(item\_ct1)];
 auto local\_data = a[item\_ct1.get\_local\_id(2)];
 #pragma unroll
 for (int bitmask = 1; bitmask < subwarp.size(); bitmask <<= 1) {
 const auto remote\_data = subwarp.shfl\_xor(local\_data, bitmask);
 local\_data = local\_data + remote\_data;
 }
 a[item\_ct1.get\_local\_id(2)] = local\_data;
</pre>

create the subgroup with template the kernel attributes are propagated to outside

```
q_ct1.submit([&](sycl::handler &cgh) {
    cgh.parallel for(
```

```
range,
[=](svcl::n
```

[=](sycl::nd\_item<3> item\_ct1) {
 reduce<sg\_size>(d\_A, item\_ct1);

});

# SYCL 1.2.1: use subgroup like cooperative groups

template <unsigned Size, typename Group>

#### \_dpct\_inline\_

tiled\_partition
[[intel::reqd sub group size(Size)]] (const Group& group)

return detail::thread\_block\_tile<Size>(group);



template<unsigned subgroup\_size, typename ValueType>
void reduce(float \*a, sycl::nd\_item<3> item\_ct1) {
 auto subwarp = group::tiled\_partition<subgroup\_size>[]
 group::this\_thread\_block(item\_ct1)];
 auto local\_data = a[item\_ct1.get\_local\_id(2)];
 #pragma unroll
 for (int bitmask = 1; bitmask < subwarp.size(); bitmask <<= 1) {
 const auto remote\_data = subwarp.shfl\_xor(local\_data, bitmask);
 local\_data = local\_data + remote\_data;
 }
</pre>

a[item\_ct1.get\_local\_id(2)] = local\_data;

create the subgroup with template the kernel attributes are propagated to outside

```
q_ct1.submit([&](sycl::handler &cgh) {
    cgh.parallel_for(
        range,
        [=](sycl::nd_item<3> item_ct1) {
        reduce<sg_size>(d_A, item_ct1);
        ]);
});
```

## SYCL 2020: decide the subgroup size before using it



The kernel attributes can not be propagated. Need to put it in the top call. q\_ct1.submit([&](sycl::handler &cgh) {
 cgh.parallel\_for(
 range,
 [=](sycl::nd\_item<3> item\_ct1) {
 reduce<sg\_size>(d\_A, item\_ct1);
 ]);
});

### Why we need to adjust the subgroup size

The performance of subset reduction on 10,000,000 elements on Gen12 GPU (Intel(R) Iris(R) Xe MAX Graphics - Gen12LP) "float per 32 elements" means using float as the elements type and performing reduction 32 by 32 elements



### Auto generate selection

We need to generate the kernels for different situation in compile time and then use the desired one to fit the current situation in runtime.

possible configs list = <A, B, C, ...>

#### Auto generate selection

We need to generate the kernels for different situation in compile time and then use the desired one to fit the current situation in runtime.

possible configs list = <A, B, C, ...>

Compile time: template expanding is in compile time, so the kernel<A>, kernel <B> are already compiled



#### Auto generate selection

We need to generate the kernels for different situation in compile time and then use the desired one to fit the current situation in runtime.

possible configs list = <A, B, C, ...>

Compile time: template expanding is in compile time, so the kernel<A>, kernel <B> are already compiled



Run time: we generate the nested if-else function. We pass the desired config into the function in runtine, and the nested conditions run the matched kernel.

# Selection example - usage

```
21 template <typename ValueType>
22
23
   void reduce_add_array(shared_ptr <DpcppExecutor> exec, ...) {
       auto queue = exec->get_queue();
       // get first cfg will return the first valid number in the first input array
24
25
26
       const auto cfg = get_first_cfg(cfg1d_array, [&](int cfg){
           if (cfg_t::decode<0>(cfg) in exec->get_subgroup_list()
27
               and cfg_t::decode<1>(cfg) < exec->get_max_workgroup()) {
28
               return true;
29
30
      });
      // generate the corresponding grid and block according to the cfg
31
32
33
       // each block performs reduction on partial array and results in a block-size array
       reduction_call(cfg, grid, block, 0, queue, ...);
34
       // one block performs reduction on the block-size array
35
36
       reduction_call(cfg, 1, block, 0, queue, ....);
```

Choose the valid configiration and pass it to function.



DPC++ provides different way to split the device code: **per\_kernel** or

per\_source

template <typename VT>
void kernel\_1(){}

template <typename VT>
void kernel 2(){}

device\_code

: dependency or kernel essential components



essential components

DPC++ provides different way to split the device code: per\_kernel or



per\_kernel

# per\_kernel

Using **per\_kerne1** will make each kernel instantiation in different units.

Pro:

- Putting invalid kernel is okay.
- JIT compilation time comes with its own kernel only, so its JIT relatively faster than

#### per\_source

Con:

- Give more complexity to the compiler because each instantiation needs to be complete
- Compilation time is long
- Dependency duplication
- It will make the library big especially for debug build.
- The corresponding error is

It will throw relocation truncated to fit: R\_X86\_64\_GOTPCREL.... and PC-relative offset overflows in PLT entry ...

The error makes sense because each instantiation is isolated and complete.

# device\_code\_split

DPC++ provides different way to split the device code: per\_kernel or



#### per\_source

Pro:

- Reduce the size of debug library.
- Compile faster than **per\_kernel**.
- Reuse the kernel essential part or dependency

Con:

- All kernel instantiations need to be valid on the device.
- Takes more time on the first kernel of each device source file.

Issue: Too many kernels lead OOM issue on CPU. GPU does not face this issue. With the Intel team, we already submitted this issue and they are working on it

For example, we have a function which needs to select valuetype, workgroup size, subgroup size, (virtual) sub-subgroup size. It gives ~360 kernels in one function and leads this issue.

## **Devices with varying parameters**

CPU can support 4, 8, 16, 32, 64 subgroup size and larger max workgroup size than 1024.

(32, 64 can be used after one of DPC++ 2021 release.)

GPU can support 8, 16, 32 subgroup size. However, different GPUs support different max workgroup size like 256, 512.

Gen9 Integrated GPU uses 256 as max workgroup size. Gen12 Integrated GPU/Gen12LP Discrete GPU use 512 as max workgroup size.

# Changes from per\_kernel to per\_source

Originally, we went for the **per\_kernel** way which instantiated all possible kernels into Ginkgo library.

However, we faced the too large debug library issue and we need to support the AOT compilation in the future.

Thus, we need to make subgroup size and workgroup size adjustable such that we can use the valid configuration for using Ahead of Time(AOT) compilation or **per\_source**.

## CMake adjusts the subgroup/workgroup size list

Ginkgo introduces the cmake options for subgroup/workgroup size.

-DGINKGO\_DPCPP\_WORKGROUPS=256,512

-DGINKGO\_DPCPP\_SUBGROUPS=8,16,32

Ginkgo will produce the pair from these two lists.

<8, 256>, <16, 256>, <32, 256>, <8, 512>, <16, 512>, <32, 512>

#### **Two level selection - split device and kernel selection**

We use two levels selection to generate the corresponding kernels.

The first level only chooses which device attributes to use.

Then, the second level chooses which kernel arguments to run.

Note. The kernel is invalid only when the kernel is compiled with invalid subgroup or workgroup size. The local memory larger than the allowed size of device is not considered invalid in JIT.

## **Readability - type for device attributes**

Originally, we use ConfigSet to embed the device attributes and the kernel arguments together. However, it leads some challenges when reading the codes.

```
wg_size = Config::decode<0>(cfg);
sg size = Config::decode<1>(cfg);
```

Because we use two level selection, those attributes are splitted now. We can also transfer the device attributes to type information.

```
Cfg::workgroup_size;
Cfg::subgroup_size;
```

We support both in the selection phase, but we will transfer it to the corresponding type in the device kernels.

## **Ginkgo Performance on Intel GPUs**

#### **Benchmark detail**

We use two Intel GPUs from devcloud in the following benchmark.

- Intel Integrated GPU: Intel(R) UHD Graphics P630, which is a integrated gpu with Intel(R) Xeon(R) E-2146G CPU (Gen9)
   It supports float and double precision.
- Intel Discrete GPU: Intel(R) Iris(R) Xe MAX Graphics, which is a Gen12 discrete GPU.

It only supports float precision. (Gen12LP)

We write the GPU kernels. We do not use OpenMP offloading for GPU.

## Ginkgo SpMV performance on Intel Gen9/Gen12 GPU



#nonzeros

Intel Gen9 GPU [double precision results] Intel Gen12LP GPU [single precision results]

#nonzeros

Sparse Matrix vector Multiplication (SpMV) kernel, performance comparison of Ginkgo functionality and oneMKL kernels.



#### Ginkgo solvers performance on Intel Gen9/Gen12 GPU



Intel Gen9 GPU [double precision results]



[single precision results]

#### Ginkgo preconditioner effectiveness on Gen12LP GPU



Effectiveness of Ginkgo's preconditioners on Intel GPUs in terms of convergence improvement [iteration count] and time-to-solution acceleration. [solver runtime]

# Ginkgo Algebraic Multigrid (AMG)



Standalone AMG on Gen9

Ginkgo's Algebraic Multigrid (AMG) method in double precision and mixed precision mode running on Intel GPUs.

## Conclusion

We use the oneAPI ecosystem to prepare Ginkgo for Intel GPUs

Ginkgo provides comprehensive sparse linear algebra support for devices supporting DPC++/SYCL including intel GPUs.

We use oneAPI to make Ginkgo be a SYCL-available library. We demonstrate our workaround for some issues on SYCL.

We are looking forward to the addition of the sub-subgroup feature, more oneDPL functionality, and adoption of oneDPL by other architectures.

#### **Thanks!**

Y

ZX







