[ParallelComputing]CUDA & HIP & DPC++ & TBB Notes
keywords: ParallelComputing, CUDA, HIP, DPC++
Learn CUDA Programming, published by Packt
OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems
A Heterogeneous Parallel Processor for High-Speed Vision Chip
Numba supports CUDA GPU programming by directly compiling a restricted subset of Python code into CUDA kernels and device functions following the CUDA execution model.
This notebook is an attempt to teach beginner GPU programming in a completely interactive fashion. Instead of providing text with concepts, it throws you right into coding and building GPU kernels.
AMD Offical Docs
AMD’s Performance Guide is a nice collection of tips on how to program the GCN and RDNA architectures efficiently.
AMD ROCm Tensorflow
Building PyTorch on ROCm
A Standards-Based, Cross-Architecture Language
Intel Data Parallel C++ Tutorial
Look ma, no CUDA! Programming GPUs with modern C++ and SYCL
Accelerating your C++ on GPU with SYCL
C++ Single-source Heterogeneous Programming for Acceleration Offload
GPU based Source
stdgpu: Efficient STL-like Data Structures on the GPU
Samples for CUDA Developers which demonstrates features in CUDA Toolkit.
Thin C++-flavored wrappers for the CUDA Runtime API
HIP: C++ Heterogeneous-Compute Interface for Portability
A C++ GPU Computing Library for OpenCL
SYCL Source (OpenCL Based)
Open Source Parallel STL implementation
Experimental fusion of triSYCL with Intel SYCL upstreaming effort into Clang/LLVM.
CPU based Source
TBB (CPU) Source
SIMD Instructions (CPU) Source
The Vector Class Library is a C++ tool that allows programmers to use Single Instruction Multiple Data (SIMD) instructions to process data in parallel
Vector class library, latest version
Data parallel C++ mathematical object library
Concurrent Data Structures
A C++ library of Concurrent Data Structures
Parallel Utils & Frameworks
Simple header-only implementation of “parallel_for” and “parallel_map” for C++11
Powerful multi-threaded coroutine dispatcher and parallel execution engine
Distributed Memory Dense Matrix Computations
Distributed-memory, arbitrary-precision, dense and sparse-direct linear algebra, conic optimization, and lattice reduction
ROCm Software Platform Repository
Tensors and Dynamic neural networks in Python with strong GPU acceleration
how often', he said,'does a man ruin his disciples by remaining always with them. ― Romain Rolland, Life of Vivekananda and the Universal Gospel