AI-driven technologies are weaving themselves into the fabric of our daily routines, with the potential to enhance our access to knowledge and boost our overall productivity. The backbone of these applications lies in large language models (LLMs). LLMs are memory-intensive and typically require specialized hardware accelerators to efficiently deliver tens of exaflops of computing power. This blog post shows how we can start addressing the computational challenges by utilizing memory more effectively.

The bulk of an LLMâ€™s memory and compute are consumed by weights in matrix multiplication operations. Using narrower *data types* reduces memory consumption. For example, storing weights in the 8-bit integer (i.e., U8 or S8) data type reduces the memory footprint by 4Ã— relative to single-precision (F32) and 2Ã— relative to half-precision (F16) or bfloat16 (BF16). Furthermore, previous work has shown that LLM models running matrix multiplications with *weights* in S8 and *input* in F16 (preserving higher precision of the user-input) is an effective method for increasing the efficiency with acceptable trade-offs in accuracy. This technique is known as *weight-only quantization* and requires efficient implementation of matrix multiplication with *mixed-inputs*, e.g., half-precision input multiplied with 8-bits integer. Hardware accelerators, including GPUs, support a fixed set of data types, and thus, mixed-input matrix multiplication requires software transformations to map to the hardware operations.

To that end, in this blog we focus on mapping mixed-input matrix multiplication onto the NVIDIA Ampere architecture. We present software techniques addressing data type conversion and layout conformance to map mixed-input matrix multiplication efficiently onto hardware-supported data types and layouts. Our results show that the overhead of additional work in software is minimal and enables performance close to the peak hardware capabilities. The software techniques described here are released in the open-source NVIDIA/CUTLASS repository.

Memory footprint for an 175B parameter LLM model with various data types formats. |

## The matrix-multiply-accumulate operation

Modern AI hardware accelerators such as Googleâ€™s TPU and NVIDIAâ€™s GPU multiply matrices natively in the hardware by targeting Tensor Cores, which are specialized processing elements to accelerate matrix operations, particularly for AI workloads. In this blog, we focus on NVIDIA Ampere Tensor Cores, which provide the *matrix-multiply-accumulate* (`mma`

) operation. For the rest of the blog the reference to `mma`

is for Ampere Tensor Cores. The supported data types, shapes, and data layout of the two input matrices (called operands) for the `mma`

operation are fixed in hardware. This means that matrix multiplications with various data types and larger shapes are implemented in the software by tiling the problem onto hardware-supported data types, shapes, and layouts.

The Tensor Core `mma`

operation is defined by specifying two input matrices (e.g., *A* & *B*, shown below) to produce a result matrix, *C*. The `mma`

operation natively supports mixed-precision. *Mixed-precision Tensor Cores* allow mixing input (*A* and *B*) data type with the result (*C*) data type. In contrast, *mixed-input *matrix multiplication involves mixing the input data types, and it is not supported by the hardware, so it needs to be implemented in the software.

Tensor Core operation of M-by-N-by-K on input matrix A of M-by-K and matrix B of K-by-N produces output matrix C of M-by-N. |

## Challenges of mixed-input matrix multiplication

To simplify the discussion, we restrict to a specific example of mixed-input matrix multiplication: F16 for user input and U8 for the model weights (written as F16 * U8). The techniques described here work for various combinations of mixed-input data types.

A GPU programmer can access a hierarchy of memory, including global memory, shared memory, and registers, which are arranged in order of decreasing capacity but increasing speed. NVIDIA Ampere Tensor Core `mma`

operations consume input matrices from registers. Furthermore, input and output matrices are required to conform to a layout of data within a group of 32 threads known as a *warp*. The supported data type *and* layout within a warp are fixed for an `mma`

operation, so to implement mixed-input multiplication efficiently, it is necessary to solve the challenges of data type conversion and layout conformance in software.

### Data type conversion

The `mma`

operation requires two input matrices with the same data type. Thus, mixed-input matrix multiplication, where one of the operands is stored in U8 in global memory and other in F16, requires a data type conversion from U8 to F16. The conversion will bring two operands to F16, mapping the *mixed-input* matrix multiplication to hardware-supported *mixed-precision* Tensor Cores. Given the large number of weights, there are a large number of such operations, and our techniques show how to reduce their latency and improve performance.

### Layout conformance

The `mma`

operation also requires the layout of two input matrices, within the registers of a warp, to be conformat with hardware specification. The layout for the input matrix *B* of U8 data type in mixed-input matrix multiplication (F16 * U8) needs to conform with the converted F16 data type. This is called *layout conformance* and needs to be achieved in the software.

The figure below shows an `mma`

operation consuming matrix *A* and matrix *B* from registers to produce matrix *C* in registers, distributed across one warp. The thread *T0* is highlighted and zoomed in to show the weight matrix *B* goes through data type conversion and needs a layout conformance to be able to map to the hardware-supported Tensor Core operation.

## Software strategies addressing challenges

A typical data type conversion involves a sequence of operations on 32-bit registers, shown below. Each rectangular block represents a register and the adjoining text are the operations. The entire sequence shows the conversion from 4xU8 to 2x(2xF16). The sequence involves roughly 10 operations.

There are many ways of achieving layout conformance. Two of the existing solutions are:

*Narrower bitwidth shared memory loads*: In this approach, threads issue narrow bitwidth memory loads moving the U8 data from shared memory to registers. This results in*two*32-bit registers, with each register containing 2xF16 values (shown above for the matrix*B*â€™s thread*T0*). The narrower shared memory load achieves layout conformance directly into registers without needing any shuffles; however, it does not utilize the full shared memory bandwidth.*Pre-processing in global memory*: An alternative strategy involves rearranging the data within the global memory (one level above the shared memory in memory hierarchy), allowing wider shared memory loads. This approach maximizes the shared memory bandwidth utilization and ensures that the data is loaded in a conformant layout directly in the registers. Although the rearrangement process can be executed offline prior to the LLM deployment, ensuring no impact on the application performance, it introduces an additional, non-trivial hardware-specific pre-processing step that requires an extra program to rearrange the data. NVIDIA/FasterTransformer adopts this method to effectively address layout conformance challenges.

## Optimized software strategies

To further optimize and reduce the overhead of data type conversion and layout conformance, we have implemented `FastNumericArrayConvertor`

and `FragmentShuffler`

, respectively.

`FastNumericArrayConvertor`

operates on 4xU8 in 32-bit registers without unpacking individual 1xU8 values. Furthermore, it uses less expensive arithmetic operations which reduces the number of instructions and increases the speed of the conversion.

The conversion sequence for U8-to-F16 is shown below. The operations use packed 32b registers, avoiding explicit unpacking and packing. `FastNumericArrayConvertor`

uses the `permute byte`

to rearrange bytes of 4xU8 into two registers. Additionally, `FastNumericArrayConvertor`

does not use expensive integer to floating-point conversion instructions and employs vectorized operations to obtain the packed results in *two* 32-bit registers containing 2x(2xF16) values. The `FastNumericArrayConvertor`

for U8-to-F16 approximately uses six operations, a 1.6Ã— reduction relative to the approach shown above.

`FastNumericArrayConvertor` utilizes `permute bytes` and packed arithmetic, reducing the number of instructions in the data type conversion. |

`FragmentShuffler`

handles the layout conformance by shuffling data in a way that allows the use of wider bitwidth load operation, increasing shared memory bandwidth utilization and reducing the total number of operations.

NVIDIA Ampere architecture provides a load matrix instruction (`ldmatrix`

). The `ldmatrix`

is a warp-level operation, where 32 threads of a warp move the data from shared memory to registers in the *shape* and *layout* that `mma`

matrix *A* and *B* consume. The use of `ldmatrix`

*reduces* the number of load instructions and *increases* the memory bandwidth utilization. Since the `ldmatrix`

instruction moves U8 data to registers, the layout after the load conforms with U8*U8 `mma`

operation, and not with F16*F16 `mma`

operation. We implemented `FragmentShuffler`

to rearrange the data within registers using shuffle (`shfl.sync)`

operations to achieve the layout conformance.

The most significant contribution of this work is to achieve layout conformance through register shuffles, avoiding offline pre-processing in global memory or narrower bitwidth shared memory loads. Furthermore, we provide implementations for `FastNumericArrayConvertor`

covering data type conversion from U8-to-F16, S8-to-F16, U8-to-BF16, and S8-to-BF16.

## Performance results

We measured the performance of eight mixed-input variants of *our method* (shown below in blue and red; varying the data types of matrix *A* and *B*) and two *mixed-precision* data types (shown in green) on an NVIDIA A100 SXM chip. The performance results are shown in FLOPS (higher is better). Notably, the first eight matrix-multipications require additional operations relative to the last two, because the mixed-precision variants directly target hardware-accelerated Tensor Core operations and do not need data type conversion and layout conformance. Even so, our approach demonstrates mixed-input matrix multiplication performance only slightly below or on par with mixed-precision.

Mixed-input matrix multiplication performance on NVIDIA A100 40GB SMX4 chip for a compute-bound matrix problem shape `m=3456, n=4096, k=2048.` |

## Acknowledgements

*We would like to mention several folks who have contributed through technical brainstorming and improving the blog post including, Quentin Colombet, Jacques Pienaar, Allie Culp, Calin Cascaval, Ashish Gondimalla, Matt Walsh, Marek Kolodziej, and Aman Bhatia. We would like to thank our NVIDIA partners Rawn Henry, Pradeep Ramani, Vijay Thakkar, Haicheng Wu, Andrew Kerr, Matthew Nicely, and Vartika Singh.*