In multi-query attention, each attention head shares single key and value matrix. The only difference between each attention head is the query matrix. It significantly reduces memory usage, but at the cost of each head’s specialization.
In multi-query attention, each attention head shares single key and value matrix. The only difference between each attention head is the query matrix. It significantly reduces memory usage, but at the cost of each head’s specialization.