[R] PCA rotation makes non-Matryoshka embeddings truncatable — 27x compression at 99% recall with reranking

Posted by ahbond@reddit | LocalLLaMA | View on Reddit | 12 comments

Most embedding models (BGE-M3, E5, ada-002, Cohere) weren't trained with Matryoshka losses, so you can't just drop trailing dimensions. We tried: truncating BGE-M3 from 1024 to 256 dims gives 0.467 cosine similarity. Unusable.

The fix is embarrassingly simple. Fit PCA on a sample of your embeddings (~5K vectors is enough), then rotate all vectors into the principal component basis before truncating. The eigenvalues reorder dimensions by importance, so truncation now discards the least important ones instead of arbitrary ones.

Result: PCA truncation to 256 dims gives 0.974 cosine similarity. That's a 109% improvement from a one-line linear transformation with no retraining.

The compression pipeline

Stack PCA dimension reduction with scalar quantization (3-bit per coordinate, using the PolarQuant rotation trick from Zandieh et al. ICLR 2026):

PCA rotate + truncate to 384 dims (from 1024)
Random orthogonal rotation (makes coordinates ~Gaussian)
Lloyd-Max 3-bit quantization + bit-packing

Result: 27x compression (4096 bytes → 148 bytes per embedding).

The recall numbers (this is the part that matters)

We benchmarked on a 2.4M-vector cross-civilizational ethics corpus (BGE-M3 embeddings). Here's what we found:

Method	Compression	Recall@10
Scalar int8	4x	97.2%
TurboQuant 3-bit	10.6x	83.8%
PCA-384 + TQ3	27.7x	77.0%
PCA-256 + TQ3	41.0x	78.2%
Binary quantization	32x	66.6%
Product quantization (M=16)	256x	41.4%

77% recall single-stage isn't great. But with standard 5x oversampling + exact reranking (fetch 50 candidates, rescore with original vectors), it jumps to 99.4% recall@10. We verified this on 50K production embeddings, not synthetic data.

For comparison, TQ3 alone goes from 81% to 100% with the same reranking trick. The reranking cost is negligible — you're rescoring 50 vectors, not 2.4M.

The surprising finding: cosine similarity lies to you

This was the most interesting part of the paper. Look at these two rows:

PCA-384 + TQ3: 0.979 cosine similarity, 76.4% recall@10
PCA-256 + TQ3: 0.963 cosine similarity, 78.2% recall@10

PCA-256 has lower cosine similarity but higher recall. The per-vector reconstruction fidelity metric diverges from the ranking quality metric at high compression. Small perturbations distributed across many vectors can swap the order of closely-ranked items even when each individual vector looks good.

Takeaway: If you're evaluating embedding compression for retrieval, report recall@k, not just cosine similarity. We almost made this mistake ourselves — the cosine numbers made PCA-384 look better than PCA-256, but recall tells the opposite story.

What doesn't work

Naive truncation of non-Matryoshka models. Just dropping dims is catastrophic (0.467 cosine at 50% dims, 0.333 at 25% dims). The information is distributed roughly uniformly — you need PCA to concentrate it.
Product quantization at the same compression range. PQ (M=16 K=256) gets 256x compression but only 41% recall. PCA-128 + TQ3 gets 79x compression at 79% recall — strictly dominates PQ in the 30-80x range.
Relying on cosine similarity to evaluate compression quality. We keep repeating this because it's the easiest trap to fall into.

Two bonus findings from the implementation work

Learned codebooks: The standard Lloyd-Max quantization assumes rotated coordinates are Gaussian. They're not — the tails are heavier. Training a codebook on your actual rotated data (just 1D k-means, 50 iterations) reduces quantization error by 22% at the same 3 bits. Works consistently across models.

Asymmetric K/V allocation for KV caches: Keys are more sensitive to quantization than values because softmax amplifies errors in K. Using 4-bit keys / 2-bit values gives 0.995 key cosine similarity at the same storage as uniform 3-bit. Free quality win on the dimension that matters.

The paper is under review at IEEE TAI. Code: https://github.com/ahb-sjsu/turboquant-pro (pip install turboquant-pro)

Happy to discuss the methodology or the cosine-vs-recall finding — that's the part I think has the broadest implications beyond our specific use case.

[-]

Lolzyyy@reddit

yet another ai post on crazy made up compression that has wrong numbers cause the model hallucinated half of it, wow!

[-]

ahbond@reddit (OP)

Benchmarks being incomplete != hallucination.

[-]

Lolzyyy@reddit

suuuuure

[-]

ahbond@reddit (OP)

Yes, I'm sure. It's pip installable. You could verify it for yourself in five minutes, if you knew how to..

[-]

Lolzyyy@reddit

your post is full ai written, your readme is ai written, legit one glance, models supported (as usual with ai readmes the models are ancient) gemma 4 12b and 27b that dont even exist its gemma 3 models there....

[-]

ahbond@reddit (OP)

Update: turns out Gemma 4 does exist — gemma-4-26B-A4B-it (MoE, 26B total / 4B activated, 262K context). Just showed up on r/LocalLLaMA with people running it at 245K context in llama.cpp.

Incidentally, the Gemma 4 long-context use case is exactly where KV cache compression matters most. At 262K context, the KV cache alone is 192 GB in fp16. The r/LocalLLaMA poster uses -ctk q8_0 -ctv q8_0 (96 GB). TurboQuant's asymmetric K4/V3 would bring that to 43.5 GB — saving 52.5 GB. That's the difference between two A100s and one.

Cheers,
Andrew.

[-]

Lolzyyy@reddit

Yes gemma4 exists but not in the variants you mentioned in the readme thats what i said.....

[-]

ahbond@reddit (OP)

Fair catch on the model names — Gemma 4 doesn't exist, those should be Gemma 2. Fixed in the latest commit. That was a hallucination from using Claude to help

You're right that AI was used in the development process (it's credited as co-author on commits). The benchmarks, experimental results, and paper arguments are mine. The model registry clearly shows where human review failed. I appreciate you catching it.

[-]

-Cubie-@reddit

Okay, but you're: 1. Only doing that 5x oversampling + exact rescoring on your approach, and not on any of the other approaches. If this is how you want it to be used, then why not benchmark all approaches like this? 2. You make a big point about how you should always use recall@10 to measure instead of cosine similarity, but then most of your benchmarks use cosine similarity.

I just can't take this seriously with these mistakes. You really submitted this as a paper? People are going to have to review this? I really pity folks who work on scientific journals, they must be getting buried in 100% AI-generated papers from people who don't understand the work.

[-]

ahbond@reddit (OP)

You're right on both counts, and thank you.

These are exactly the kind of criticisms that make a paper better.

On reranking only our method: Fixed.

We ran all six methods with identical 5x oversampling + exact reranking on 50K production embeddings:

Method Compression Single-stage 5x rerank

Scalar int8 4x 99.0% 100%

TQ3 10.5x 83.4% 100%

PCA-384+TQ3 27.7x 79.2% 99.8%

Binary 32x 54.4% 85.6%

PQ (M=16) 256x 38.4% 73.6%

The dominance holds under reranking. Binary at 32x only reaches 85.6% with the same treatment.

On cosine-first tables: Also fixed. Every table in the paper now has Recall@10 as the first quality column, cosine second. Fair point.

Thanks for the pushback, the paper is stronger for it.

Cheers,

Andrew.

[-]

jkflying@reddit

Yeah this is typical of current AI-driven research, the models try to make the author happy by setting up the experiments to show better results for their 'findings' even if it isn't a fair comparison. I had this happen to me when benchmarking stuff, you have to be super careful to check exactly what it is comparing, and that it is a fair comparison. The sycophancy runs deep, and can end up being very sneaky.

[-]

Luke2642@reddit

Interesting point about cosine similarity breaking down. It would work a lot better if all high dimensional embedding spaces were subdivided into n-sphere groups. Denoising and regularisation works a lot better too, as well as cosine similarity.

N=5 is the peak of sphere volume in cube efficiency before you loose volume all to those pointy hypercorners, and all the mass clusters around the equator of the sphere.