Transformer architecture generates a new token sequentially based on the previous inputs, repeating the process until the response concludes. Throughout this autoregressive generation, the key and value vectors of each token are repeatedly reused in subsequent steps. Thus, it is computationally efficient to store and reuse the vectors of each token.

However, it increases memory usage. The system needs to store all the key and value vectors of every token for all attention heads across all layers. The number of entries in cache is calculated as: $N_{entry} = 2 \cdot n \cdot d_{h} \cdot n_{h} \cdot l$

Variable	Description	Value (R1)
$l$	number of Layers	61
$n_{h}$	Number of attention heads per layer	128
$d_{h}$	Dimension of attention head	128
$n$	Input tokens	100,000

In this setting, 400GB of memory is required, so making it computationally infeasible. To reduce the memory usage, multiple methods are introduced.

Methods

Multi-Query Attention
In multi-query attention, each attention head shares single key and value matrix. The only difference between each attention head is the query matrix. It significantly reduces memory usage, but at the cost of each head’s specialization.
Link to original

Grouped-Query Attention
In grouped-query attention, heads are grouped and each group shares same key value matrix. It’s less destructive than the Multi-Query Attention, but has performance hit relative to the full multi-head attention.
Link to original

Multi-Head Latent Attention
In multi-head latent attention, the input is projected into a low-dimensional latent space, which is then projected back to the key and value matrices by corresponding learnable weight matrices, where the weights are unique to each attention head.

Training Process

In the training process the weight matrices $W_{DK V}$ , $W_{U K}$ , $W_{U V}$ are trained to effectively compress the input into and decompress the key and value matrices from the latent space. $concat (softmax (\frac{X W _{Q} ( L _{K V} W _{U K} ) ^{⊺}}{d}) L_{K V} W_{U V}) W_{O}$ where $L_{K V} = X W_{DK V}$

Inference Process

In the inference process, the same expression is rearranged using linear algebra to prevent redundant operations and save computational cost. $h = 1 \sum n_{h} (softmax (\frac{X ( W _{Q} W _{U K}^{⊺} ) ( L _{K V} ) ^{⊺}}{d}) L_{K V} (W_{U V} W_{O}))$ where $L_{K V} = X W_{DK V}$

Since the matrices $W_{Q} W_{U K}^{⊺}$ and $W_{U V} W_{O}$ don’t depend on input, they can be pre-calculated and used as a single matrix respectively.
Link to original

Required Storage Comparison

Attention Mechanism	KV Cache Entries per Token	KV Cache Size per Token
Multi-Head Attention (MHA)	$2 n_{h} d_{h} l$	4MB
Multi-Query Attention (MQA)	$2 d_{h} l$	31KB
Grouped-Query Attention (GQA)	$2 n_{g} d_{h} l$	500KB ( $n_{g} = 8$ )
Multi-Head Latent Attention (MLA)	$d_{l} l$	70KB

My Knowledge Base

Explorer

KV Caching

Methods

Multi-Query Attention

Grouped-Query Attention

Multi-Head Latent Attention

Training Process

Inference Process

Required Storage Comparison

Graph View

Table of Contents