Multi-Head Latent Attention

In multi-head latent attention, the input is projected into a low-dimensional latent space, which is then projected back to the key and value matrices by corresponding learnable weight matrices, where the weights are unique to each attention head.

Training Process

In the training process the weight matrices $W_{DK V}$ , $W_{U K}$ , $W_{U V}$ are trained to effectively compress the input into and decompress the key and value matrices from the latent space. $concat (softmax (\frac{X W _{Q} ( L _{K V} W _{U K} ) ^{⊺}}{d}) L_{K V} W_{U V}) W_{O}$ where $L_{K V} = X W_{DK V}$

Inference Process

In the inference process, the same expression is rearranged using linear algebra to prevent redundant operations and save computational cost. $h = 1 \sum n_{h} (softmax (\frac{X ( W _{Q} W _{U K}^{⊺} ) ( L _{K V} ) ^{⊺}}{d}) L_{K V} (W_{U V} W_{O}))$ where $L_{K V} = X W_{DK V}$

Since the matrices $W_{Q} W_{U K}^{⊺}$ and $W_{U V} W_{O}$ don’t depend on input, they can be pre-calculated and used as a single matrix respectively.

My Knowledge Base

Explorer

Multi-Head Latent Attention

Training Process

Inference Process

Graph View

Table of Contents

Backlinks