Title: FlashAttention-2: The Next Leap in Language Models
The field of natural language processing has experienced significant advancements in the past year. New language models like GPT-4, MosaicML’s MPT, and Anthropic’s Claude have emerged with longer context capabilities, enabling applications such as long document querying and story writing. However, scaling up the context length of Transformers poses challenges due to computational and memory requirements. To address this, FlashAttention, an innovative algorithm, gained popularity for accelerating attention computation without sacrificing accuracy. Now, FlashAttention-2 takes a giant leap forward with even better performance.
Enhancements in FlashAttention-2:
1. Improved Parallelism: FlashAttention-2 utilizes better parallelism and work partitioning strategies. It now parallelizes over the sequence length dimension for long sequences with smaller batch sizes or fewer heads, resulting in significant speedup.
2. Efficient Work Partitioning: FlashAttention-2 efficiently partitions work between different warps within each thread block. By splitting Q across warps while keeping K and V accessible to all warps, unnecessary communication between warps is eliminated, leading to enhanced performance.
New Features of FlashAttention-2:
1. Expanded Head Dimensions: FlashAttention-2 supports head dimensions up to 256, making it compatible with models like GPT-J, CodeGen, CodeGen2, and StableDiffusion 1.x. This broadens its applicability and provides more opportunities for speedup and memory-saving.
2. Multi-Query Attention (MQA) and Grouped-Query Attention (GQA): FlashAttention-2 introduces MQA and GQA variants, where multiple heads of the query can attend to the same head of key and value. This results in higher inference throughput and improved performance.
FlashAttention-2 achieves remarkable speedup compared to its predecessor and standard attention implementations. Benchmarked on an A100 80GB SXM4 GPU, it achieves up to 2x speedup compared to its predecessor and up to 9x speedup compared to a standard attention implementation in PyTorch. It unlocks up to 225 TFLOPs/s on A100 GPUs for end-to-end training of GPT-style models.
Future Applications and Collaborations:
FlashAttention-2 opens up possibilities for analyzing long books, reports, high-resolution images, audio, and video. The developers are working on broader applicability for devices like H100 GPUs and AMD GPUs, as well as optimizing for new data types. Furthermore, combining low-level optimizations with high-level algorithmic changes could enable training AI models with even longer context. Collaboration with compiler researchers is also in the works to enhance programmability and usher in the next generation of language models.
FlashAttention-2 is a groundbreaking advancement in language models, offering remarkable performance improvements. Its enhanced parallelism, work partitioning strategies, and new features make it a valuable tool for various applications. With ongoing developments and collaborations, FlashAttention-2 holds immense potential for the future of AI.