Open-source embeddings give better results than OpenAI and Cohere on cross-lingual retrieval of EPG data for a low-resource language

Posted by FigAltruistic2086@reddit | LocalLLaMA | View on Reddit | 5 comments

TL;DR: On Armenian cross-lingual retrieval, free local models beat every paid API. On EN↔HY, LaBSE R@1 = 0.83 vs OpenAI R@1 = 0.21 (same pairs, same 245 candidates). OpenAI is best on EN↔RU (0.89), but fails to generalize to Armenian. Bonus: mean cosine can disagree sharply with R@1 — measure retrieval, not alignment.

I'm building a recommendation system for an IPTV operator in a CIS country. Most programs have English, Russian, and Armenian titles — Armenian has its own alphabet (non-Latin, non-Cyrillic), and most embedding models have seen very little of it during training.

Started with OpenAI text-embedding-3-large as the baseline. My assumption going in: commercial embeddings are the best option, just pricey. Bi-encoder retrieval looked great — until Armenian titles started coming back wrong. Quietly, systematically wrong.

That kicked off a full benchmark: 19 runs across 18 unique checkpoints — 14 local (SentenceTransformers + FlagEmbedding; bge-m3 tested on both) and 5 paid APIs — on 245 trilingual triplets (238 from TMDB + 7 hand-written EPG) plus 783 abbreviation duplets. Sample size is modest — absolute scores may not generalize to noisier real-world EPG, but relative ranking was stable (Spearman ρ = 0.80 between a 7-triplet pilot and the full 245-triplet set).

I was very wrong. For a low-resource language with a unique script, free local models crush paid APIs — the retrieval winner is LaBSE (2022), a 4-year-old free model beating every paid API from 2024–2025. And a reminder that's easy to miss in practice: alignment (mean cosine) and retrieval (R@1 / MRR) can rank the same models completely differently — e5-large-v2 is #5 by alignment but #17 by R@1, because it maps every non-Latin pair into one dense cluster, so cosine stays high but discrimination is gone. If you work with anything else off the Latin/Cyrillic path, this might be useful.

Alignment vs Retrieval: two different stories

We measured two things:

These rankings don't match:

Model Alignment rank R@1 rank Shift
e5-large-v2 #5 #17 +12
e5-large #6 #18 +12
bge-m3 #15 #4 -11
LaBSE #8 #1 -7

e5-large and e5-large-v2 are monolingual traps. They map all non-Latin text into one dense cluster — cosine is high for every pair, but R@1 = 0.12-0.16. The model "matches" everything equally, which means it matches nothing.

LaBSE, purpose-built in 2022 for cross-lingual sentence retrieval (parallel corpora + contrastive loss), has moderate alignment (0.746) but the best retrieval in the benchmark (R@1 = 0.834, MRR = 0.864). Task-fit matters more than recency — a 2022 model designed for exactly this job still beats general-purpose 2024/2025 APIs.

Results — Retrieval ranking (sorted by MRR)

Note: E5 family models (multilingual-e5-*, e5-*) were run without the documented "query: " prefix, so their scores are a lower bound — real performance may be higher.

# Model R@1 MRR Cost
1 LaBSE 0.834 0.864 free
2 multilingual-e5-large 0.802 0.837 free
3 armenian-text-embeddings-1 0.778 0.816 free
4 bge-m3 (SentenceTransformers) 0.766 0.807 free
5 bge-m3 (FlagEmbedding, fp16) 0.766 0.807 free
6 multilingual-e5-base 0.754 0.794 free
7 jina-embeddings-v3 (API) 0.756 0.791 $$
8 embed-multilingual-v3.0 (Cohere 2023) 0.731 0.783 $$
9 gte-multilingual-base 0.705 0.752 free
10 voyage-multilingual-2 0.684 0.730 $$
11 paraphrase-multilingual-mpnet-base-v2 0.632 0.690 free
12 distiluse-base-multilingual-cased 0.629 0.688 free
13 jina-embeddings-v3 (local ST) 0.605 0.659 free
14 embed-v4.0 (Cohere 2025) 0.556 0.607 $$
15 paraphrase-multilingual-MiniLM-L12-v2 0.540 0.597 free
16 text-embedding-3-large (OpenAI) 0.438 0.482 $$
17 e5-large-v2 0.159 0.211 free (trap)
18 e5-large 0.121 0.169 free (trap)
19 all-MiniLM-L6-v2 0.031 0.063 free (EN only)

Top 5 by retrieval — all free, all local.

OpenAI: strong on high-resource pairs, fails to generalize

OpenAI text-embedding-3-large achieves the best R@1 on EN↔RU (0.894) in the benchmark.

But performance does not transfer to Armenian:

Same model, same task, same candidate pool — but a 4× drop depending on script.

Why? The cl100k_base tokenizer has zero Armenian tokens in its 100K vocabulary (verified — no token decodes to the Armenian Unicode range U+0530–U+058F). Armenian text is tokenized byte-by-byte (tok/byte = 1.00). One Armenian title = 37 tokens vs 6 tokens with SentencePiece. That's \~10× token inflation, and you're paying per token for worse results.

Cohere v4 regressed vs v3

Cohere embed-v4.0 (2025) vs embed-multilingual-v3.0 (2023):

Newer model, worse results on low-resource languages. Don't blindly upgrade.

Practical recommendations

Need Model MRR VRAM
Best retrieval LaBSE 0.864 \~1.9 GB
Best balance multilingual-e5-large 0.837 \~2.2 GB
Smallest multilingual-e5-base 0.794 \~1.1 GB
API jina-embeddings-v3 0.791

All local models run fine on a single RTX 4000 (20GB) or even CPU.

What NOT to use

Repo

GitHub: s1mb1o/epg-embedding-benchmark Everything open: code, data, results. MIT.

Anyone running cross-lingual matching on EPG/TV metadata in other non-Latin markets (ex. Arabic, Thai, Georgian and other languages)? Curious whether the alignment vs retrieval gap is as dramatic there.

Hope you find this useful — and if I missed something or got it wrong, point it out so I can improve.