A³: an Analytical Low-Rank Approximation Framework for Attention

Jeffrey T.H. Wong^*, Cheng Zhang^*, Xinye Cao, Pedro Gimenes, Christos-Savvas Bouganis, George Anthony Constantinides, Wayne Luk, Yiren Zhao

Department of Electrical and Electronic Engineering, Imperial College London

* Equal contribution

ICML 2026

Paper Code TL;DR

TL;DR

Large language models have demonstrated remarkable performance; however, their massive parameter counts make deployment highly expensive. Low-rank approximation offers a promising compression solution, yet existing approaches have two main limitations: (1) They focus on minimizing the output error of individual linear layers, without considering the architectural characteristics of Transformers, and (2) they decompose a large weight matrix into two small low-rank matrices, introducing runtime overhead such as extra GEMM kernel launches.

To address these limitations, we propose A³, a post-training low-rank approximation framework. A³ splits a Transformer layer into three functional components—QK, OV, MLP—and provides analytical solutions that reduce the hidden dimension inside each component while minimizing the component's functional loss. This directly reduces model sizes, KV cache sizes, and FLOPs without introducing any runtime overhead.

Through extensive experiments, we show that A³ maintains superior performance compared to SoTAs. For example, our low-rank approximated LLaMA 3.1-70B achieves a perplexity of 4.69 on WikiText-2, outperforming the previous SoTA's 7.87 by 3.18.

Key Result: Under the same reduction budget in computation and memory, A³ on LLaMA 3.1-70B achieves WikiText-2 perplexity of 4.69 vs. previous SoTA's 7.87 — a 58.6% reduction in perplexity error.

Method Overview

Most prior low-rank methods target individual linear layers, optimizing weight reconstruction without regard to the Transformer's functional structure. A³ instead decomposes each Transformer layer into three components and derives closed-form optimal solutions for each:

QK A³-QK: minimizes pre-softmax attention score error by reducing the query/key head dimension.
OV A³-OV: minimizes per-head attention output error by reducing the value/output head dimension.
MLP A³-MLP: minimizes MLP output error via CUR decomposition, reducing the intermediate size.

A3 method overview diagram — **Overview of A³.** A Transformer layer is decomposed into three components: QK, OV, and MLP. For each component, A³ derives an analytical solution that reduces the hidden dimension (head dimension or intermediate size) while minimizing the component's functional loss. This results in reduced model size, KV cache, and FLOPs with no runtime overhead—unlike classical low-rank methods that introduce extra GEMM operations.

QK A³-QK Query-Key Attention Score Approximation

Objective. For each head, the pre-softmax attention score is \( A_i = X_q W_{qk,i} X_{kv}^T \). We seek a rank-\(r\) approximation \(\widetilde{W}_{qk,i}\) minimising the score error:

\arg\min_{\widetilde{W}_{qk,i},\;\operatorname{rank}=r} \bigl\| X_q (W_{qk,i} - \widetilde{W}_{qk,i}) X_{kv}^T \bigr\|_F^2

This is equivalent to minimising \(\| R_{xx,q}^{1/2}(W_{qk,i}-\widetilde{W}_{qk,i})R_{xx,kv}^{1/2} \|_F^2\), where \(R_{xx,q} = \tfrac{1}{l_q}X_q^T X_q\) and \(R_{xx,kv} = \tfrac{1}{l_{kv}}X_{kv}^T X_{kv}\) are the activation autocorrelation matrices.

Closed-form solution.

\widetilde{W}_{qk,i} = \bigl(R_{xx,q}^{1/2}\bigr)^{-1} \operatorname{SVD}_r\!\Bigl(R_{xx,q}^{1/2}\,W_{qk,i}\,R_{xx,kv}^{1/2}\Bigr) \bigl(R_{xx,kv}^{1/2}\bigr)^{-1}

The query and key weights are then assigned as two separate projections with the new (smaller) head dimension \(r < d_{qk}\), reducing both model size and KV-cache without any extra GEMM at inference.

OV A³-OV Output-Value Attention Output Approximation

Objective. The attention output of head \(i\) is \(O_i = P_i W_{vo,i}\), where \(P_i = A'_i X_{kv}\) is the post-softmax context matrix. We minimise the per-head output error:

\arg\min_{\widetilde{W}_{vo,i},\;\operatorname{rank}=r} \bigl\| P_i (W_{vo,i} - \widetilde{W}_{vo,i}) \bigr\|_F^2

This is equivalent to minimising \(\| R_{pp,i}^{1/2}(W_{vo,i}-\widetilde{W}_{vo,i}) \|_F^2\), where \(R_{pp,i} = \tfrac{1}{l_q}P_i^T P_i\).

Closed-form solution.

\widetilde{W}_{vo,i} = \bigl(R_{pp,i}^{1/2}\bigr)^{-1} \operatorname{SVD}_r\!\Bigl(R_{pp,i}^{1/2}\,W_{vo,i}\Bigr)

Value and output weights are assigned with a new head dimension \(r < d_{vo}\). Treating each head independently upper-bounds the total attention output error and admits a simple parallel implementation.

MLP A³-MLP MLP Intermediate-Size Reduction

Objective. The non-linear activation in MLP prevents direct SVD. Instead, we find a diagonal selection matrix \(U = \operatorname{diag}(u_1,\dots,u_{d_{\text{inter}}})\) that keeps the \(r\) most important intermediate neurons:

\arg\min_{U=\operatorname{diag}(\cdot),\;\operatorname{rank}=r} \bigl\| X_{\text{down}} U W_{\text{down}} - X_{\text{down}} W_{\text{down}} \bigr\|_F^2

Equivalently, minimise \(\| R_{xx,d}^{1/2} U W_{\text{down}} - R_{xx,d}^{1/2} W_{\text{down}} \|_F^2\), where \(R_{xx,d} = \tfrac{1}{l}X_{\text{down}}^T X_{\text{down}}\).

Solution (CUR decomposition). Score each intermediate dimension by

\lambda_i = \| r_i \|_2^2 \cdot \| w_i \|_2^2

where \(r_i\) is the \(i\)-th column of \(R_{xx,d}^{1/2}\) and \(w_i\) the \(i\)-th row of \(W_{\text{down}}\). Keep the top-\(r\) indices; the corresponding rows/columns of \(W_{\text{down}}\), \(W_{\text{up}}\), and \(W_{\text{gate}}\) are retained, directly reducing \(d_{\text{inter}}\) to \(r\) with no extra GEMM.

A³ supports vanilla MHA, MHA with RoPE (via CUR approximation that respects RoPE frequency pairing), and GQA (via joint SVD across query heads sharing a key head), making it applicable to most modern LLMs. In this work, we report LLaMA-2/3, Phi-3, and MPT family.

Experiments

We evaluate A³ on a range of LLM architectures: MHA-NoPE (MPT), MHA-RoPE (LLaMA 1&2), and GQA-RoPE (LLaMA 3.1, Phi 3). Tasks span pretraining perplexity (WikiText-2, C4, SlimPajama) and KV cache compression. All results are post-training without fine-tuning.

Main Results: Perplexity

Perplexity (↓) on WikiText-2, C4, and SlimPajama at 10% and 20% compression ratios. Bold indicates the best result per setting. A³ outperforms SVD-LLM by a large margin across all models and benchmarks. The gap is especially pronounced on GQA-RoPE models (LLaMA 3.1), where existing methods struggle.

Model	Method	10% Compression			20% Compression
Model	Method	WikiText-2	C4	SlimPajama	WikiText-2	C4	SlimPajama
LLaMA-2-7B MHA-RoPE	SVD-LLM	8.78	11.73	9.49	11.58	14.91	11.93
LLaMA-2-7B MHA-RoPE	A³ (ours)	5.96	8.34	6.68	7.22	9.91	7.91
LLaMA-2-13B MHA-RoPE	SVD-LLM	7.09	9.98	7.95	9.03	12.35	9.75
LLaMA-2-13B MHA-RoPE	A³ (ours)	5.32	7.65	7.65	6.24	8.99	7.15
LLaMA-3.1-8B GQA-RoPE	SVD-LLM	19.12	19.37	15.14	42.28	33.60	27.44
LLaMA-3.1-8B GQA-RoPE	A³ (ours)	7.93	12.56	9.52	11.36	17.87	13.58
LLaMA-3.1-70B GQA-RoPE	SVD-LLM	7.87	11.30	8.43	9.75	13.77	10.44
LLaMA-3.1-70B GQA-RoPE	A³ (ours)	4.69	8.83	6.59	8.32	13.94	10.02
Phi-3-medium-14B GQA-RoPE	SVD-LLM	6.81	10.47	8.40	8.14	11.90	9.67
Phi-3-medium-14B GQA-RoPE	A³ (ours)	5.44	9.48	7.28	6.40	10.59	8.16

Main Results: KV Cache Compression

Perplexity (↓) when compressing both the KV cache and parameters simultaneously, compared against CLOVER and Palu on MPT-7B and MPT-30B (MHA-NoPE). A³ achieves the best perplexity at every compression ratio, with an especially large margin at high compression (60–80%) where baselines degrade catastrophically.

Model	CRatio	SlimPajama			C4			WikiText-2
Model	CRatio	CLOVER	Palu	A³	CLOVER	Palu	A³	CLOVER	Palu	A³
MPT-7B MHA-NoPE	20%	48.11	9.67	8.88	53.29	11.74	10.77	77.78	8.73	8.05
	40%	383	11.51	9.90	408	14.18	12.20	795	10.60	9.19
	60%	5397	25.73	15.34	4919	32.26	18.71	7895	25.09	15.58
	80%	15467	5270	388	11661	3210	373	14434	13714	849
MPT-30B MHA-NoPE	20%	11.52	7.91	7.71	14.53	9.87	9.59	13.07	7.04	6.73
	40%	18.00	8.99	8.33	22.43	11.30	10.44	23.47	8.40	7.40
	60%	54.97	15.59	11.52	70.65	18.91	14.22	95.45	18.88	11.28
	80%	779	211	37.09	732	253	42.85	1524	339	46.72

Runtime Performance

⚡

Zero inference overhead — works with existing stacks out of the box.

Unlike low-rank methods that replace \(W\) with two matrices \(AB\) and therefore require an extra GEMM kernel launch per layer, A³ simply reduces the hidden dimensions inside each component (head dim \(D\), head dim \(D\), intermediate size \(I\)). The compressed model retains the exact same architecture layout as the original — same number of linear layers, same data flow — so it runs unchanged on any inference stack (PyTorch Eager, FlashAttention, SDPA, vLLM, etc.) without custom kernels or operator fusion.

How we count theoretical FLOPs

We measure theoretical FLOPs for a single decoder block during prefill by summing four contributions — attention projections + dot-products, MLP, layer-norm, and residuals:

\text{FLOPs}_\text{total} = \underbrace{8BLH D + 4BL^2 A D + BL^2 A}_{\text{attention}} + \underbrace{6BLIH}_{\text{MLP}} + \underbrace{4BLH}_{\text{norm + residual}}

\(B\) = batch size, \(L\) = sequence length, \(H\) = hidden size, \(D\) = head dimension, \(A\) = number of heads, \(I\) = MLP intermediate size.

A³ compresses by reducing \(D\) (QK and OV) and \(I\) (MLP) proportionally, so the theoretical FLOPs reduction closely tracks the compression ratio. The small gap arises from dimension-independent terms (normalization, residual, softmax).

LLaMA-2-13B — Throughput & Memory on 1× H100

Measured throughput (tokens/s ↑) and peak GPU memory (MB ↓) for Eager and SDPA attention kernels. Speedup and memory ratios are relative to the uncompressed original.

Compression	Theoretical FLOPs	Kernel	Absolute		vs. Original
Compression	Theoretical FLOPs	Kernel	Throughput (tok/s ↑)	Peak Mem (MB ↓)	FLOPs ratio	Speedup ↑	Mem ratio ↓
Original 128/128/13824	\(2.77{\times}10^{12}\)	Eager	7,285	35,004	1.00×	1.00×	1.00×
Original 128/128/13824	\(2.77{\times}10^{12}\)	SDPA	12,319	32,917	1.00×	1.69×	0.94×
20% compression	\(2.16{\times}10^{12}\)	Eager	8,077	28,114	0.78×	1.11×	0.80×
20% compression	\(2.16{\times}10^{12}\)	SDPA	15,096	26,037	0.78×	2.07×	0.74×
40% compression	\(1.56{\times}10^{12}\)	Eager	9,350	21,336	0.56×	1.28×	0.61×
40% compression	\(1.56{\times}10^{12}\)	SDPA	20,237	19,270	0.56×	2.78×	0.55×
60% compression	\(1.08{\times}10^{12}\)	Eager	10,405	16,139	0.39×	1.43×	0.46×
60% compression	\(1.08{\times}10^{12}\)	SDPA	25,554	14,078	0.39×	3.51×	0.40×

Throughput at Scale

To validate robustness, we benchmark across GPU types (A6000, H100), batch sizes, model sizes (1B–32B), compression ratios (20%, 40%), sequence lengths (1024, 2048), and attention kernels. A³ consistently achieves speedups close to the theoretical FLOPs reduction. Larger models and SDPA benefit the most, since the reduced head dimension also lowers attention FLOPs quadratic in sequence length.

GPU throughput across models, GPUs and settings — Throughput gains (tokens/sec ↑) across a wide range of deployment settings. Each bar compares A³-compressed vs. original model throughput.

TPS TPS Comparison with SVD-LLM A³ always achieves a speedup; SVD-LLM only gains at aggressive compression due to extra kernel launches

Low-rank methods that target general linear layers (e.g. SVD-LLM) split each weight \(W\) into two smaller matrices \(AB\), which saves GEMM FLOPs in linear layers but introduces extra kernel launches and memory read/writes for the decomposed matrices. These overheads offset the savings at moderate compression ratios, meaning SVD-LLM only achieves a net speedup when compression is aggressive enough to overcome them.

A³ avoids this entirely: it reduces head dimension \(D\) and intermediate size \(I\) directly, so both the linear-layer GEMMs and the quadratic attention computation shrink, with no extra operations added. As a result, A³ always achieves a TPS speedup regardless of compression ratio.

TPS comparison: A3 vs SVD-LLM on LLaMA-2-13B — Tokens per second (TPS ↑) of A³ vs. SVD-LLM on LLaMA-2-13B (A100 40GB, batch size 2, sequence length 2048, Eager and SDPA backends). SVD-LLM only reaches a net speedup at high compression, whereas A³ consistently outperforms the uncompressed baseline at every compression ratio.

Ablation Study

We ablate each component to understand its contribution. Adding QK, OV, and MLP compression together is consistently better than any subset.

Ablation: QK vs OV components — QK & OV component ablation (WikiText-2 perplexity)

Ablation: QK-RoPE and MLP components — QK-RoPE & MLP component ablation (WikiText-2 perplexity)

Local Objective Reduction vs. End-to-End Perplexity

A³ minimises the error of each component (QK, OV, MLP) independently. A natural question is how well these local objectives predict the end-to-end perplexity when components are compressed jointly. The table below shows the ΔPPL (perplexity increase over the uncompressed baseline) on WikiText-2 for each component individually, their arithmetic sum, and the actual joint result across five compression ratios.

At small compression ratios (≤ 20%), the joint ΔPPL closely tracks the sum of individual contributions — particularly for MPT-7B (MHA-NoPE), where the QK and OV objectives are truly independent. This validates A³'s design: minimising each component's local loss is an effective proxy for end-to-end performance. At larger ratios, interactions emerge, but the joint perplexity remains on the same order of magnitude as the sum, confirming that the independence assumption does not severely compound errors.

LLaMA-3.1-8B GQA-RoPE

Component	5%	10%	15%	20%	40%
QK only	0.07	0.16	0.32	0.56	13.28
OV only	0.27	0.39	0.58	0.78	2.79
QK + OV (sum)	0.34	0.55	0.90	1.34	16.07
Both (joint)	0.35	0.59	1.00	1.58	25.07

MPT-7B MHA-NoPE

Component	5%	10%	15%	20%	40%
QK only	−0.004	0.005	0.040	0.092	0.75
OV only	0.048	0.097	0.166	0.248	0.98
QK + OV (sum)	0.045	0.103	0.206	0.340	1.73
Both (joint)	0.044	0.102	0.197	0.313	1.52

All values are ΔPPL (perplexity increase over uncompressed baseline) on WikiText-2. For MPT-7B, the joint result at low compression is actually slightly below the sum, suggesting mild beneficial interaction between components.

HQQ Combination with Quantization A³ is orthogonal to weight-only quantization and improves the Pareto frontier

A³ is orthogonal to weight-only quantization methods such as HQQ. Applying HQQ 4-bit quantization on top of an A³-compressed model introduces only a small additional perplexity overhead, comparable to quantizing the uncompressed model — confirming that the two techniques do not interfere.

In extreme compression regimes, combining A³ with quantization enables compression levels that are otherwise unreachable. Sub-3-bit quantization alone destabilizes the model, whereas A³ + 4-bit HQQ achieves a markedly better accuracy–compression trade-off and a continuous Pareto frontier beyond what quantization alone can offer.

LLaMA-3.1-8B perplexity with HQQ quantization + A3 — Perplexity of LLaMA-3.1-8B with HQQ 4-bit quantization, with and without A³ compression. The small additional degradation from A³ confirms orthogonality to quantization.

Table: LLaMA-3.1-8B — Perplexity (↓) with HQQ 4-bit + A³

Method	10% compression		20% compression
Method	WikiText-2	C4	WikiText-2	C4
Original	6.25 / 10.04		6.26 / 10.04
Original + HQQ	6.72 (+0.47)	10.76 (+0.72)	6.72 (+0.46)	10.76 (+0.72)
A³	7.93	12.56	12.63	19.09
A³ + HQQ	8.73 (+0.80)	13.61 (+1.05)	12.86 (+0.22)	20.49 (+1.40)

Table: MPT-30B — ΔPPL on WikiText-2 under extreme compression (4-bit HQQ + A³)

Method	Compression ratio	Without fine-tuning	With fine-tuning
Dense	1.00×	0	0
4-bit HQQ	4.00×	+0.11	—
2-bit HQQ	8.00×	+12.78	+2.80
4-bit HQQ + A³ @ 20%	5.00×	+0.99	+0.59
4-bit HQQ + A³ @ 40%	6.67×	+1.15	+0.99
4-bit HQQ + A³ @ 60%	10.00×	+18.15	+2.74

Perplexity vs compression rate: A3 + quantization Pareto frontier — Pareto frontier of perplexity vs. compression rate. A³ combined with quantization extends the frontier continuously beyond what quantization alone can achieve, especially in the sub-4-bit regime.

LoRA Combination with LoRA Fine-Tuning A³'s strong initialization makes it highly receptive to lightweight fine-tuning

Following the SVD-LLM setup, we apply LoRA (rank 8) on A³-compressed models using 50K Alpaca-cleaned samples over 2 epochs. Because A³ produces a structurally clean compressed model (no extra matrices, same architecture), LoRA adapts it efficiently from a strong starting point.

Table: LLaMA-2-7b — WikiText-2 Perplexity (↓), A³ with and without LoRA fine-tuning

Compression Ratio	A³	A³ + LoRA Fine-Tuning
20%	7.22 (+1.73)	6.94 (+1.45)
40%	32.04 (+24.31)	10.53 (+5.04)

Δ values are perplexity increases relative to the uncompressed original. Fine-tuning provides the largest gains at high compression (40%), recovering most of the lost performance.

Table: MPT-30B — ΔPPL under 4-bit HQQ + A³, with and without fine-tuning

At a 6.67× overall compression (4-bit HQQ + A³ @ 40%), fine-tuning reduces ΔPPL from +1.15 to just +0.99. Even at 10× compression (60% + 4-bit), fine-tuning brings ΔPPL from +18.15 down to +2.74, highlighting A³'s receptiveness to LoRA recovery.

Method	Compression	w/o Fine-Tuning	w/ Fine-Tuning
Dense	1.00×	0	0
2-bit HQQ	8.00×	+12.78	+2.80
4-bit HQQ + A³ @ 20%	5.00×	+0.99	+0.59
4-bit HQQ + A³ @ 40%	6.67×	+1.15	+0.99
4-bit HQQ + A³ @ 60%	10.00×	+18.15	+2.74

Open Questions

A³ leaves several directions open that we find genuinely interesting. We hope these questions inspire follow-up work.

Can non-uniform rank allocation unlock more performance?

A³ currently uses a uniform rank across all layers and heads. Yet different layers are known to vary in their sensitivity to compression — earlier layers tend to be more robust, while later layers carry more task-specific information and degrade faster under rank reduction.

Is there a better way to compress the MLP at high compression ratios?

A³-MLP uses CUR decomposition — it selects the top-\(r\) intermediate neurons by a magnitude-based score \(\lambda_i\). CUR is fast and analytically motivated, but unlike SVD it does not guarantee the globally optimal rank-\(r\) approximation. At high compression ratios (≥ 40%), this sub-optimality becomes significant: the selected neurons may leave substantial residual error, and performance degrades much faster than the QK/OV SVD-based components.

How far can joint low-rank decomposition and quantization go like QERA, CALDERA, SLiM, OATS?

A³ combined with HQQ 4-bit quantization can push the compression Pareto frontier beyond what either technique achieves alone. But the combination is currently quite simple: A³ compress weight, then quantize the compressed weights.

There are advanced methods that explore better ways to combine low-rank + quantization. QERA reconstructs quantisation error with a low-rank correction; CALDERA decomposes weights into a quantised plus low-rank sum; SLiM combines one-shot quantisation with sparse and low-rank components; and OATS jointly handles outliers via sparse and low-rank decomposition. How to effectively integrate A³ with these ideas is interesting too?

BibTeX


    @article{wong2025a3,
      title={A3: an analytical low-rank approximation framework for attention},
      author={Wong, Jeffrey TH and Zhang, Cheng and Cao, Xinye and Gimenes, Pedro and Constantinides, George A and Luk, Wayne and Zhao, Yiren},
      journal={arXiv preprint arXiv:2505.12942},
      year={2025}
    }

A3: an Analytical Low-Rank Approximation Framework for Attention