How to Optimize Mixed-Input Matrix Multiplication for AI Hardware Accelerators
AI-driven technology can improve productivity and knowledge access. These applications use large language models (LLMs), which require specialized hardware for efficient computing power delivery. One way to overcome the computational challenges is to optimize memory usage.
Memory and compute consumption in LLMs are mainly due to weights in matrix multiplication. Using narrower data types can reduce memory consumption by a significant amount. For example, storing weights in 8-bit integer format reduces the memory footprint by 4 times compared to single-precision and 2 times compared to half-precision or bfloat16.
Previous work has shown that using 8-bit integer weights and half-precision input is an effective method for increasing efficiency with acceptable accuracy trade-offs. However, this technique requires efficient implementation of mixed-input matrix multiplication, which can be achieved through software transformations.
In this blog post, we focus on mapping mixed-input matrix multiplication onto the NVIDIA Ampere architecture. Efficient implementation of mixed-input multiplication involves addressing data type conversion and layout conformance in software, which contribute minimal overhead and enable performance close to peak hardware capabilities.
Modern AI hardware accelerators, such as Google’s TPU and NVIDIA’s GPU, use specialized processing elements to accelerate matrix operations through native matrix multiplication in hardware. For NVIDIA Ampere Tensor Cores, the mma operation is used, which natively supports mixed-precision. However, mixed-input matrix multiplication isn’t supported by the hardware and needs to be implemented in software.
Addressing the challenges of mixed-input matrix multiplication requires solving data type conversion and layout conformance issues. Our software techniques focus on reducing latency and improving performance by efficiently addressing these challenges.
To optimize and reduce the overhead of data type conversion and layout conformance, we have implemented FastNumericArrayConvertor and FragmentShuffler. FastNumericArrayConvertor uses permute byte to rearrange bytes of 4xU8 into two registers, reducing the number of instructions in the data type conversion. FragmentShuffler handles layout conformance by shuffling data to increase shared memory bandwidth utilization and reduce the total number of operations. These software strategies contribute to enhanced performance in mixed-input matrix multiplication.
By addressing the computational challenges of mixed-input matrix multiplication, we can improve the efficiency and performance of AI-related applications, ultimately enhancing user experience and productivity.