Open-source embeddings give better results than OpenAI and Cohere on cross-lingual retrieval of EPG data for a low-resource language

Posted by FigAltruistic2086@reddit | LocalLLaMA | View on Reddit | 5 comments

TL;DR: On Armenian cross-lingual retrieval, free local models beat every paid API. On EN↔HY, LaBSE R@1 = 0.83 vs OpenAI R@1 = 0.21 (same pairs, same 245 candidates). OpenAI is best on EN↔RU (0.89), but fails to generalize to Armenian. Bonus: mean cosine can disagree sharply with R@1 — measure retrieval, not alignment.

I'm building a recommendation system for an IPTV operator in a CIS country. Most programs have English, Russian, and Armenian titles — Armenian has its own alphabet (non-Latin, non-Cyrillic), and most embedding models have seen very little of it during training.

Started with OpenAI text-embedding-3-large as the baseline. My assumption going in: commercial embeddings are the best option, just pricey. Bi-encoder retrieval looked great — until Armenian titles started coming back wrong. Quietly, systematically wrong.

That kicked off a full benchmark: 19 runs across 18 unique checkpoints — 14 local (SentenceTransformers + FlagEmbedding; bge-m3 tested on both) and 5 paid APIs — on 245 trilingual triplets (238 from TMDB + 7 hand-written EPG) plus 783 abbreviation duplets. Sample size is modest — absolute scores may not generalize to noisier real-world EPG, but relative ranking was stable (Spearman ρ = 0.80 between a 7-triplet pilot and the full 245-triplet set).

I was very wrong. For a low-resource language with a unique script, free local models crush paid APIs — the retrieval winner is LaBSE (2022), a 4-year-old free model beating every paid API from 2024–2025. And a reminder that's easy to miss in practice: alignment (mean cosine) and retrieval (R@1 / MRR) can rank the same models completely differently — e5-large-v2 is #5 by alignment but #17 by R@1, because it maps every non-Latin pair into one dense cluster, so cosine stays high but discrimination is gone. If you work with anything else off the Latin/Cyrillic path, this might be useful.

Alignment vs Retrieval: two different stories

We measured two things:

Alignment (mean cosine between correct translation pairs) — how close are the right answers?
Retrieval R@1 (find the correct match among 245 candidates) — can the model actually pick the right one?

These rankings don't match:

Model	Alignment rank	R@1 rank	Shift
`e5-large-v2`	#5	#17	+12
`e5-large`	#6	#18	+12
`bge-m3`	#15	#4	-11
`LaBSE`	#8	#1	-7

e5-large and e5-large-v2 are monolingual traps. They map all non-Latin text into one dense cluster — cosine is high for every pair, but R@1 = 0.12-0.16. The model "matches" everything equally, which means it matches nothing.

LaBSE, purpose-built in 2022 for cross-lingual sentence retrieval (parallel corpora + contrastive loss), has moderate alignment (0.746) but the best retrieval in the benchmark (R@1 = 0.834, MRR = 0.864). Task-fit matters more than recency — a 2022 model designed for exactly this job still beats general-purpose 2024/2025 APIs.

Results — Retrieval ranking (sorted by MRR)

Note: E5 family models (multilingual-e5-*, e5-*) were run without the documented "query: " prefix, so their scores are a lower bound — real performance may be higher.

#	Model	R@1	MRR	Cost
1	`LaBSE`	0.834	0.864	free
2	`multilingual-e5-large`	0.802	0.837	free
3	`armenian-text-embeddings-1`	0.778	0.816	free
4	`bge-m3` (SentenceTransformers)	0.766	0.807	free
5	`bge-m3` (FlagEmbedding, fp16)	0.766	0.807	free
6	`multilingual-e5-base`	0.754	0.794	free
7	`jina-embeddings-v3` (API)	0.756	0.791	$$
8	`embed-multilingual-v3.0` (Cohere 2023)	0.731	0.783	$$
9	`gte-multilingual-base`	0.705	0.752	free
10	`voyage-multilingual-2`	0.684	0.730	$$
11	`paraphrase-multilingual-mpnet-base-v2`	0.632	0.690	free
12	`distiluse-base-multilingual-cased`	0.629	0.688	free
13	`jina-embeddings-v3` (local ST)	0.605	0.659	free
14	`embed-v4.0` (Cohere 2025)	0.556	0.607	$$
15	`paraphrase-multilingual-MiniLM-L12-v2`	0.540	0.597	free
16	`text-embedding-3-large` (OpenAI)	0.438	0.482	$$
17	`e5-large-v2`	0.159	0.211	free (trap)
18	`e5-large`	0.121	0.169	free (trap)
19	`all-MiniLM-L6-v2`	0.031	0.063	free (EN only)

Top 5 by retrieval — all free, all local.

OpenAI: strong on high-resource pairs, fails to generalize

OpenAI text-embedding-3-large achieves the best R@1 on EN↔RU (0.894) in the benchmark.

But performance does not transfer to Armenian:

EN↔HY: R@1 = 0.210
RU↔HY: R@1 = 0.210

Same model, same task, same candidate pool — but a 4× drop depending on script.

Why? The cl100k_base tokenizer has zero Armenian tokens in its 100K vocabulary (verified — no token decodes to the Armenian Unicode range U+0530–U+058F). Armenian text is tokenized byte-by-byte (tok/byte = 1.00). One Armenian title = 37 tokens vs 6 tokens with SentencePiece. That's \~10× token inflation, and you're paying per token for worse results.

Cohere v4 regressed vs v3

Cohere embed-v4.0 (2025) vs embed-multilingual-v3.0 (2023):

Alignment: 0.472 vs 0.749
R@1: 0.556 vs 0.731

Newer model, worse results on low-resource languages. Don't blindly upgrade.

Practical recommendations

Need	Model	MRR	VRAM
Best retrieval	`LaBSE`	0.864	\~1.9 GB
Best balance	`multilingual-e5-large`	0.837	\~2.2 GB
Smallest	`multilingual-e5-base`	0.794	\~1.1 GB
API	`jina-embeddings-v3`	0.791	—

All local models run fine on a single RTX 4000 (20GB) or even CPU.

What NOT to use

Monolingual e5 (e5-large, e5-large-v2) — alignment looks great (0.76-0.78), R@1 is garbage (0.12-0.16). Classic trap.
all-MiniLM-L6-v2 — English only, R@1 = 0.03
OpenAI — great for EN-RU, near-random retrieval on Armenian (R@1 ≈ 0.21)
Cohere v4 — regression vs v3

Repo

GitHub: s1mb1o/epg-embedding-benchmark Everything open: code, data, results. MIT.

Anyone running cross-lingual matching on EPG/TV metadata in other non-Latin markets (ex. Arabic, Thai, Georgian and other languages)? Curious whether the alignment vs retrieval gap is as dramatic there.

Hope you find this useful — and if I missed something or got it wrong, point it out so I can improve.

[-]

mr_Owner@reddit

Amazing thanks! I can understand from this that the minillm-l12 is the lightest to work local instead of the allminillm which your post suggests it's broken? I wonder how that's possible as that one is used a lot?

FigAltruistic2086@reddit (OP)

all- in all-MiniLM means "all-purpose English", not "all languages". The "all-*" models were trained on all available training data. This is authors naming scheme. So all-MiniLM-L6-v2 (22M, English-only) — fine-tuned on English pairs. Excellent at English STS / RAG.

MiniLM-L12 on my chart is sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (118M, multilingual) — a different model entirely: 250K-token XLM-R-derived tokenizer + parallel-corpus training. The lightest cross-lingual local option here.

Per-pair R@1 heatmap — OpenAI's Armenian columns (EN↔HY, RU↔HY) are the visible failure. LaBSE and multilingual-e5 stay green across all three pairs.

Could you put the llm parameters size also to see in one go the trade-off to make vs speed and size?

Done. Split into two panels because the cost axes aren't comparable across local vs paid:

Local (left):
- X = ms/text on M2 Max,
- Y = worst-pair R@1 (min over the three pairs),
marker area ∝ VRAM (fp32) ≈ params × 4 / 1024 GB. I picked it over raw param count.

Paid (right):
- X = USD per 1M input tokens (vendor list price),
- Y = worst-pair R@1 (min over the three pairs),

Worst-pair instead of mean R@1 because averaging hides the catastrophic failures — OpenAI's 0.21 R@1 on EN↔HY drowns in the 0.89 EN↔RU number; min-over-pairs surfaces it. Pareto frontier drawn on each panel.

Two bbox callouts on the local side mark zones to avoid:

Red — monolingual trap: e5-large / e5-large-v2 look healthy on alignment but R@1 collapses to ≈ 0.08 — the model just clusters non-Latin text into one region.
Amber: all-MiniLM-L6-v2, English-only 22M model, broken across every metric.