Open-source embeddings give better results than OpenAI and Cohere on cross-lingual retrieval of EPG data for a low-resource language
Posted by FigAltruistic2086@reddit | LocalLLaMA | View on Reddit | 5 comments
TL;DR: On Armenian cross-lingual retrieval, free local models beat every paid API. On EN↔HY, LaBSE R@1 = 0.83 vs OpenAI R@1 = 0.21 (same pairs, same 245 candidates). OpenAI is best on EN↔RU (0.89), but fails to generalize to Armenian. Bonus: mean cosine can disagree sharply with R@1 — measure retrieval, not alignment.
I'm building a recommendation system for an IPTV operator in a CIS country. Most programs have English, Russian, and Armenian titles — Armenian has its own alphabet (non-Latin, non-Cyrillic), and most embedding models have seen very little of it during training.
Started with OpenAI text-embedding-3-large as the baseline. My assumption going in: commercial embeddings are the best option, just pricey. Bi-encoder retrieval looked great — until Armenian titles started coming back wrong. Quietly, systematically wrong.
That kicked off a full benchmark: 19 runs across 18 unique checkpoints — 14 local (SentenceTransformers + FlagEmbedding; bge-m3 tested on both) and 5 paid APIs — on 245 trilingual triplets (238 from TMDB + 7 hand-written EPG) plus 783 abbreviation duplets. Sample size is modest — absolute scores may not generalize to noisier real-world EPG, but relative ranking was stable (Spearman ρ = 0.80 between a 7-triplet pilot and the full 245-triplet set).
I was very wrong. For a low-resource language with a unique script, free local models crush paid APIs — the retrieval winner is LaBSE (2022), a 4-year-old free model beating every paid API from 2024–2025. And a reminder that's easy to miss in practice: alignment (mean cosine) and retrieval (R@1 / MRR) can rank the same models completely differently — e5-large-v2 is #5 by alignment but #17 by R@1, because it maps every non-Latin pair into one dense cluster, so cosine stays high but discrimination is gone. If you work with anything else off the Latin/Cyrillic path, this might be useful.
Alignment vs Retrieval: two different stories
We measured two things:
- Alignment (mean cosine between correct translation pairs) — how close are the right answers?
- Retrieval R@1 (find the correct match among 245 candidates) — can the model actually pick the right one?
These rankings don't match:
| Model | Alignment rank | R@1 rank | Shift |
|---|---|---|---|
e5-large-v2 |
#5 | #17 | +12 |
e5-large |
#6 | #18 | +12 |
bge-m3 |
#15 | #4 | -11 |
LaBSE |
#8 | #1 | -7 |
e5-large and e5-large-v2 are monolingual traps. They map all non-Latin text into one dense cluster — cosine is high for every pair, but R@1 = 0.12-0.16. The model "matches" everything equally, which means it matches nothing.
LaBSE, purpose-built in 2022 for cross-lingual sentence retrieval (parallel corpora + contrastive loss), has moderate alignment (0.746) but the best retrieval in the benchmark (R@1 = 0.834, MRR = 0.864). Task-fit matters more than recency — a 2022 model designed for exactly this job still beats general-purpose 2024/2025 APIs.
Results — Retrieval ranking (sorted by MRR)
Note: E5 family models (multilingual-e5-*, e5-*) were run without the documented "query: " prefix, so their scores are a lower bound — real performance may be higher.
| # | Model | R@1 | MRR | Cost |
|---|---|---|---|---|
| 1 | LaBSE |
0.834 | 0.864 | free |
| 2 | multilingual-e5-large |
0.802 | 0.837 | free |
| 3 | armenian-text-embeddings-1 |
0.778 | 0.816 | free |
| 4 | bge-m3 (SentenceTransformers) |
0.766 | 0.807 | free |
| 5 | bge-m3 (FlagEmbedding, fp16) |
0.766 | 0.807 | free |
| 6 | multilingual-e5-base |
0.754 | 0.794 | free |
| 7 | jina-embeddings-v3 (API) |
0.756 | 0.791 | $$ |
| 8 | embed-multilingual-v3.0 (Cohere 2023) |
0.731 | 0.783 | $$ |
| 9 | gte-multilingual-base |
0.705 | 0.752 | free |
| 10 | voyage-multilingual-2 |
0.684 | 0.730 | $$ |
| 11 | paraphrase-multilingual-mpnet-base-v2 |
0.632 | 0.690 | free |
| 12 | distiluse-base-multilingual-cased |
0.629 | 0.688 | free |
| 13 | jina-embeddings-v3 (local ST) |
0.605 | 0.659 | free |
| 14 | embed-v4.0 (Cohere 2025) |
0.556 | 0.607 | $$ |
| 15 | paraphrase-multilingual-MiniLM-L12-v2 |
0.540 | 0.597 | free |
| 16 | text-embedding-3-large (OpenAI) |
0.438 | 0.482 | $$ |
| 17 | e5-large-v2 |
0.159 | 0.211 | free (trap) |
| 18 | e5-large |
0.121 | 0.169 | free (trap) |
| 19 | all-MiniLM-L6-v2 |
0.031 | 0.063 | free (EN only) |
Top 5 by retrieval — all free, all local.
OpenAI: strong on high-resource pairs, fails to generalize
OpenAI text-embedding-3-large achieves the best R@1 on EN↔RU (0.894) in the benchmark.
But performance does not transfer to Armenian:
- EN↔HY: R@1 = 0.210
- RU↔HY: R@1 = 0.210
Same model, same task, same candidate pool — but a 4× drop depending on script.
Why? The cl100k_base tokenizer has zero Armenian tokens in its 100K vocabulary (verified — no token decodes to the Armenian Unicode range U+0530–U+058F). Armenian text is tokenized byte-by-byte (tok/byte = 1.00). One Armenian title = 37 tokens vs 6 tokens with SentencePiece. That's \~10× token inflation, and you're paying per token for worse results.
Cohere v4 regressed vs v3
Cohere embed-v4.0 (2025) vs embed-multilingual-v3.0 (2023):
- Alignment: 0.472 vs 0.749
- R@1: 0.556 vs 0.731
Newer model, worse results on low-resource languages. Don't blindly upgrade.
Practical recommendations
| Need | Model | MRR | VRAM |
|---|---|---|---|
| Best retrieval | LaBSE |
0.864 | \~1.9 GB |
| Best balance | multilingual-e5-large |
0.837 | \~2.2 GB |
| Smallest | multilingual-e5-base |
0.794 | \~1.1 GB |
| API | jina-embeddings-v3 |
0.791 | — |
All local models run fine on a single RTX 4000 (20GB) or even CPU.
What NOT to use
- Monolingual e5 (
e5-large,e5-large-v2) — alignment looks great (0.76-0.78), R@1 is garbage (0.12-0.16). Classic trap. - all-MiniLM-L6-v2 — English only, R@1 = 0.03
- OpenAI — great for EN-RU, near-random retrieval on Armenian (R@1 ≈ 0.21)
- Cohere v4 — regression vs v3
Repo
GitHub: s1mb1o/epg-embedding-benchmark Everything open: code, data, results. MIT.
Anyone running cross-lingual matching on EPG/TV metadata in other non-Latin markets (ex. Arabic, Thai, Georgian and other languages)? Curious whether the alignment vs retrieval gap is as dramatic there.
Hope you find this useful — and if I missed something or got it wrong, point it out so I can improve.
mr_Owner@reddit
Amazing thanks! I can understand from this that the minillm-l12 is the lightest to work local instead of the allminillm which your post suggests it's broken? I wonder how that's possible as that one is used a lot?
FigAltruistic2086@reddit (OP)
all-inall-MiniLMmeans "all-purpose English", not "all languages". The "all-*" models were trained on all available training data. This is authors naming scheme. Soall-MiniLM-L6-v2(22M, English-only) — fine-tuned on English pairs. Excellent at English STS / RAG.MiniLM-L12on my chart issentence-transformers/paraphrase-multilingual-MiniLM-L12-v2(118M, multilingual) — a different model entirely: 250K-token XLM-R-derived tokenizer + parallel-corpus training. The lightest cross-lingual local option here.FigAltruistic2086@reddit (OP)
Per-pair R@1 heatmap — OpenAI's Armenian columns (EN↔HY, RU↔HY) are the visible failure. LaBSE and multilingual-e5 stay green across all three pairs.
mr_Owner@reddit
Could you put the llm parameters size also to see in one go the trade-off to make vs speed and size?
FigAltruistic2086@reddit (OP)
Done. Split into two panels because the cost axes aren't comparable across local vs paid:
Local (left):
- X = ms/text on M2 Max,
- Y = worst-pair R@1 (min over the three pairs),
marker area ∝ VRAM (fp32) ≈ params × 4 / 1024 GB. I picked it over raw param count.
Paid (right):
- X = USD per 1M input tokens (vendor list price),
- Y = worst-pair R@1 (min over the three pairs),
Worst-pair instead of mean R@1 because averaging hides the catastrophic failures — OpenAI's 0.21 R@1 on EN↔HY drowns in the 0.89 EN↔RU number; min-over-pairs surfaces it. Pareto frontier drawn on each panel.
Two bbox callouts on the local side mark zones to avoid:
e5-large/e5-large-v2look healthy on alignment but R@1 collapses to ≈ 0.08 — the model just clusters non-Latin text into one region.all-MiniLM-L6-v2, English-only 22M model, broken across every metric.