keywords: ParallelComputing, CUDA, HIP, DPC++


CUDA Books

Learn CUDA Programming, published by Packt

SYCL Books

Data Parallel C++. Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL


Heterogeneous Computing

Heterogeneous computing

OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems

A Heterogeneous Parallel Processor for High-Speed Vision Chip


Numba supports CUDA GPU programming by directly compiling a restricted subset of Python code into CUDA kernels and device functions following the CUDA execution model.

This notebook is an attempt to teach beginner GPU programming in a completely interactive fashion. Instead of providing text with concepts, it throws you right into coding and building GPU kernels.

AMD Offical Docs

AMD’s Performance Guide is a nice collection of tips on how to program the GCN and RDNA architectures efficiently.

ROCm Docs

AMD ROCm Tensorflow


Building PyTorch on ROCm

DPC++ Docs

A Standards-Based, Cross-Architecture Language

Intel Data Parallel C++ Tutorial


Look ma, no CUDA! Programming GPUs with modern C++ and SYCL

Accelerating your C++ on GPU with SYCL

C++ Single-source Heterogeneous Programming for Acceleration Offload

GPU based Source

Cross-Platform Frameworks

stdgpu: Efficient STL-like Data Structures on the GPU

CUDA Source

Samples for CUDA Developers which demonstrates features in CUDA Toolkit.

Thin C++-flavored wrappers for the CUDA Runtime API

HIP Source

HIP: C++ Heterogeneous-Compute Interface for Portability

OpenCL Source

A C++ GPU Computing Library for OpenCL

SYCL Source (OpenCL Based)

Open Source Parallel STL implementation

Experimental fusion of triSYCL with Intel SYCL upstreaming effort into Clang/LLVM.

CPU based Source

TBB (CPU) Source

Official Threading Building Blocks (TBB) GitHub repository.
For Commercial Intel® TBB distribution, please click here:

SIMD Instructions (CPU) Source

The Vector Class Library is a C++ tool that allows programmers to use Single Instruction Multiple Data (SIMD) instructions to process data in parallel

Vector class library, latest version

Data parallel C++ mathematical object library

Concurrent Data Structures

A C++ library of Concurrent Data Structures

Parallel Utils & Frameworks

Simple header-only implementation of “parallel_for” and “parallel_map” for C++11

Powerful multi-threaded coroutine dispatcher and parallel execution engine


Distributed Memory Dense Matrix Computations

Distributed-memory, arbitrary-precision, dense and sparse-direct linear algebra, conic optimization, and lattice reduction


ROCm Platform

ROCm Software Platform Repository

Tensors and Dynamic neural networks in Python with strong GPU acceleration

how often', he said,'does a man ruin his disciples by remaining always with them. ― Romain Rolland, Life of Vivekananda and the Universal Gospel