Introduction to Pipeline Parallelism in AI
Tensor Parallelism: Splitting Operations within a Layer
Sequence Parallelism: Parallelizing Inputs for Efficient Computation
Pipeline parallelism is a technique in AI that splits a model “vertically” by layer. Another method called Tensor Parallel training enables the “horizontal” splitting of specific operations within a layer. Matrix multiplication, a crucial computation bottleneck in many modern models like the Transformer, involves multiplying an activation batch matrix with a large weight matrix. This operation can be optimized by computing independent dot products on different GPUs or by assigning different GPUs to work on parts of each dot product and combining the results later. By slicing the weight matrix into evenly-sized “shards” and assigning each shard to a different GPU, efficient parallel computation becomes possible.
One example of this technique is Megatron-LM, which parallelizes matrix multiplications within the Transformer’s self-attention and MLP layers. Another approach, PTD-P, utilizes tensor, data, and pipeline parallelism, assigning multiple non-consecutive layers to each device in a carefully designed pipeline schedule. Although it introduces more network communication, this approach reduces bubble overhead and improves overall performance.
Moreover, certain scenarios allow for parallelizing the input to the network across a dimension that offers a high degree of parallel computation compared to cross-communication. This idea is known as sequence parallelism and involves splitting an input sequence into multiple sub-examples over time. This approach effectively decreases peak memory consumption and enables more granular computation.