Parallel Computing
Introduction
Some Useful Parallel Algorithms
Reduce
Generates a single element from a list of elements. Can be parallelized if the reduction operator is such that it operates on two input elements (binary) and the order of operation does not matter (associativity). Example operators: Addition, multiplication, logical OR, bitwise AND. Parallelization can be achieved by working on pairs of values at a time. N elements can then be processed in log2(N) steps.
Scan
From an input list generates an output list in which the output element at position i is the result of applying a binary, associative operation to all the input elements at positions j < i (exclusive scan) or j <= i (inclusive scan). There are two popular parallel algorithms:
Hillis & Steele Scan
The figure below shows the inclusive variant of the Hillis & Steele algorithm.
Blelloch Scan
Although this exclusive scan algorithm is more complicated and requires twice as many steps than the Hillis & Steele algorithm, for large enough input arrays it requires fewer (2N vs. N*log(N)) operations and is therefore more work efficient.
Histogram
A histogram is a discrete distribution chart that shows us how many items of a data set fall into given consecutive and non-overlapping value ranges. Usually the ranges have equal length. For example, in image processing the histogram of a digital image shows how the intensity values of its pixels are distributed. Parellelization can be achieved by dividing the image into equally-sized blocks, for which the compute units create a local histogram, and then reducing the multiple block histograms into a single histogram.
File:Parallel computing histogram.png
Popular Platforms
OpenCL
OpenCL (https://www.khronos.org/opencl) is an API dedicated to generic parallel computing and can distribute compute kernels to CPUs and/or GPUs.
CUDA
Nvidia's parallel programming language and runtime for use with Nvidia stream processor hardware (GPUs or Tesla cards).
Metal
Metal is developed by Apple (https://developer.apple.com/metal) and is mostly a graphics-focused API but also provides convenience functions for running compute kernels without the hassle of configuring a rendering pipeline.
OpenGL / Vulkan / DirectX
These APIs are primarily for hardware-accelerated graphics rendering on consumer graphics cards. They are not designed for general purpose parallel computing. Nevertheless, graphics cards are becoming more capable in that regard and are still the only affordable option for most people. In these graphics-oriented APIs compute kernels are called shader programs. Shader programming languages include many graphics-related convenience functions and can be called in either the vertex processing stage or the fragment processing stage. See GPGPU on Wikipedia for more information.