My personal benchmark
Posted by olddoglearnsnewtrick@reddit | LocalLLaMA | View on Reddit | 3 comments
I am tasked to do several tasks of knowledge extraction from Italian language news articles. The following is the comparison of several LLMs against a human curated gold set of entities:
https://preview.redd.it/aqtlu8r78tqe1.png?width=2840&format=png&auto=webp&s=b26e8f68d5ebff8bb08caf5ee703fb04b086ea75
1. **Overall Top Performer.**
* **google/gemini‐2.0‐flash‐001** achieves by far the highest F1 score (0.8638), driven by a very strong precision (0.9448).
* It also posts a high recall (0.7957) relative to its peers, so it is excelling at both correctly identifying entities and minimizing false positives.
2. **Precision–Recall Trade‐offs.**
* Most of the other models have lower recall, suggesting they are missing more true mentions (FN).
* The precision–recall balance for google/gemini‐2.0‐flash‐001 stands out as the best overall compromise, whereas others (e.g., **qwen/qwen2.5‐32b‐instruct**) sacrifice quite a bit of recall for higher precision.
3. **Speed Considerations.**
* **qwen/qwen2.5‐32b‐instruct** is the fastest at **2.86 s/article** but underperforms in F1 (0.6516).
* **google/gemini‐2.0‐flash‐001** is both highly accurate (top F1) and still quite fast at **3.74 s/article**, which is among the better speeds in the table.
* By contrast, **qwen/qwq‐32b** takes over **70 s/article**—much slower—yet still only achieves an F1 of 0.7339.
4. **Secondary Tier of Performance.**
* Several models cluster around the mid‐to‐high 0.70s in F1 (e.g., **mistralai/mistral‐small**, **meta‐llama/Llama‐3.3‐70B**, **deepseek/deepseek‐chat**), which are respectable but noticeably lower than google/gemini‐2.0’s 0.86.
* Within this cluster, **mistralai/mistral‐small** gets slightly above 0.77 in F1, and **meta‐llama** is at 0.7688, indicating close but still clearly behind the leader.
5. **False Positives vs. False Negatives.**
* Looking at the “FP” and “FN” columns shows how each model’s mistakes break down. For example:
* **google/gemini‐2.0** has only 69 FPs but 303 FNs, indicating it errs more by missing entities (as do most NER systems).
* Models with lower recall (higher FN counts) pay the F1 penalty more sharply, as can be seen with openai/gpt‐40‐mini (FN=470) and qwen2.5‐32b (FN=528).
6. **Implications for Deployment.**
* If maximum accuracy is the priority, **google/gemini‐2.0‐flash‐001** is the clear choice.
* If extremely tight inference speed is needed and some accuracy can be sacrificed, **qwen/qwen2.5‐32b** might be appealing.
* For general use, models in the 0.75–0.77 F1 range represent a middle ground but do not match the best combination of speed and accuracy offered by google/gemini‐2.0.
In summary, **google/gemini‐2.0‐flash‐001** stands out both for its top‐tier F1 and low inference time, making it the leader in these NER evaluations. Several other models do reasonably well but either trail on accuracy, speed, or both.
3 Comments
kmouratidis@reddit
olddoglearnsnewtrick@reddit (OP)
kmouratidis@reddit