CUDA/OpenCL Scientific Computation Library

This passage has not yet been complete, it will be updated in two weeks.

These days I have been interested in exploiting the full potential of GPU to use it replacing most floating-point calculation in my scientific computation code, both in Python and C/C++. This tendency is invoked by vendors like NVIDIA and now the HPC community is actively adapting to GPGPU.

Here is a list of the popular libraries I would like to introduce (2020 Dec 12). Generally, users are required to install CUDA toolkit before having a trial of the following packages.

For C/C++

PETSc

PETSc, pronounced PET-see (the S is silent), is a suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations. It supports MPI, and GPUs through CUDA or OpenCL, as well as hybrid MPI-GPU parallelism. PETSc (sometimes called PETSc/Tao) also contains the Tao optimization software library.

ViennaCL

ViennaCL author has turned to support PETSc, so I would suggest new users consider PETSc first.

OpenCL Libraries

The clMath libraries can be found here and consist of these 4 repositories.

CUDA Official Libraries

Math Libraries

For these math libraries, I would suggest scientists and engineers to use cupy/numba to enjoy the GPGPU computation joy because these libraries have mostly been well wrapped by those python packages.

Thrust

Thrust is a powerful library of parallel algorithms and data structures. Thrust provides a flexible, high-level interface for GPU programming that greatly enhances developer productivity. Using Thrust, C++ developers can write just a few lines of code to perform GPU-accelerated sort, scan, transform, and reduction operations orders of magnitude faster than the latest multi-core CPUs. For example, the thrust::sort algorithm delivers 5x to 100x faster sorting performance than STL and TBB.

Communication Libraries

Deep Learning Libraries

For Python

我目前没有动力从 Python 转向 Julia，毕竟 Python 的硬件兼容性、易用性、能够和 CUDA 这么丝滑地配合都超乎想象。当然，来日方长，如果是十年之后的话，说不定就会用 Julia 了。

Python Virtual Environment

For python users, it is much safer to handle CUDA-relevant packages with a virtual environment. While some packages are maintained mostly on Pypi, users are suggested to know some basic commands of virtualenv package.

CuPy

CuPy gives users a seamless experience corporating numpy-style code and CUDA power together. Most code does not need to be modified if numpy-style code good enough.

Prequisites

These components must be installed to use CuPy:

NVIDIA CUDA GPU with the Compute Capability 3.0 or larger.
CUDA Toolkit: v9.0 / v9.2 / v10.0 / v10.1 / v10.2 / v11.0 / v11.1
Python: v3.5.1+ / v3.6.0+ / v3.7.0+ / v3.8.0+ / ~~v3.9.0+~~

Python 3.9 has not been officially supported by cupy as shown on pypi.org cupy page (2020 Dec 12). Just be careful not to use a very strange Python or package version when using CUDA.

Installation

The most easy way is to install the pre-built package depending on you CUDA version.

# For users who have CUDA 11.1 installed
pip install cupy-cuda111
# Other packages for various CUDA versions 
# could be cupy-cuda90, cupy-cuda92, cupy-cuda110, etc.

Attention: do not use conda to install cupy, that package has been outdated for years.

Users may want to fully exploit the power of cupy by corporating it with additional packages like cuTENSOR, NCCL and cuDNN, for more dedicated installation help, refer to official document.

Techniques

Modification on Numpy Code

cupy 对 numpy 的兼容已经做到了相当漂亮的程度，我的程序上很多地方直接将 numpy 换成 cupy 不用添加多余的代码。

# import numpy as np 注释掉，替换成 cupy
import cupy as np

通常来说可能都不需要修改代码，但是读取/写入硬盘还是需要 numpy, 所以如果涉及到文件操作的话需要增加以下几行代码。

# Transform numpy array to cupy
cupy.asarray()
# Transform cupy array to numpy
cupy.asnumpy()

通常仅仅是前后加上 cupy.asarray() 和 cupy.asnumpy() 这两个函数用于把数组从 CPU 内存挪到 GPU 上，运算完再运回来。

Custom Kernel

不过自定义 kernel 函数还是沿用着 C/CUDA 给字符串文件调用 nvvc 编译器那一套，并不是特别优雅，这方面推荐使用 numba。

numba

Website | Github

Accelerate Python Functions

Numba translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library. Numba-compiled numerical algorithms in Python can approach the speeds of C or FORTRAN.

You don't need to replace the Python interpreter, run a separate compilation step, or even have a C/C++ compiler installed. Just apply one of the Numba decorators to your Python function, and Numba does the rest.

numba 不追求对 numpy 丝滑的无缝对接，但是用一种 Pythonic 的方式给了用户自定义 CUDA kernel 的方式。当然，numba 主要擅长的还是在 CPU 端的优化，只需要加一个修饰符 jit 或者 cuda.jit 就可以相当程度上自动优化运行速度，可以说很智能了。