Running full benchmark for Mellum2-12B-A2.5B-Instruct-Q4_K_M.gguf
[battery] Starting 11-test battery...
→ [H1] JSON contract...
10/10 (143.0 t/s) judge: Perfect compact JSON. All 6 keys present with exact values. risk_score is number 0.37, contacts array correct order, reviewed is boolean false, no markdown or extra fields.
→ [H2] CRM attribution logic...
5/10 (249.9 t/s) judge: Company first touch model correct but deal-contact model wrong - response incorrectly assumes Contact A was on Deal D when Deal D only has Contact B. Data integrity issue misidentified.
→ [H3] Data quality diagnosis...
5/10 (252.2 t/s) judge: Identified phone/email swap and several secondary issues but missed the most critical issues: firstname contains full name Robert Johnson that should be split to firstname/lastname, and email uppercase causing dedup issues on import.
→ [H4] Logic & retention...
4/10 (162.9 t/s) judge: Color, project code, and non-inference identification are correct, but 'no zed is a lor' is missing.
→ [P1] Tool calling (nested)...
10/10 (230.5 t/s) judge: All criteria met: valid JSON, correct nested structure, contact name and company correct, segment is enterprise, intent_signal references the migration guide, priority correctly set to high for enterprise segment, follow_up.action is a string, follow_up.within_hours is an integer (2), no markdown or extra fields.
→ [P2] Email writing...
10/10 (204.8 t/s) judge: All 8 rules passed: 35 words under 75, no 'hope', no 'checking in', starts with 'James', ends with '?', no subject line, references HubSpot CRM demo and lead tracking, professional tone.
→ [P3] Instruction precision (competing constraints)...
10/10 (217.6 t/s) judge: All 6 constraints met: 3 numbered items, no intro/conclusion, each item 1 sentence, item 2 names Breeze AI, no item starts with HubSpot, no cost/price/cheaper used.
→ [P4] Code generation + execution...
[sandbox] Running code...
[sandbox] PASS: Called MagicMock: OK SANDBOX_OK
→10/10 (250.9 t/s) judge: All requirements met: correct URL, Bearer auth, pagination via after cursor, max_contacts limit, 429 retry with sleep(1), ValueError on non-200/429, returns flat list, no api_key logging, valid Python syntax.
→ [P5] Planning & reasoning...
10/10 (254.1 t/s) judge: Explicitly acknowledges the three-way conflict in Step 1, proposes specific sequencing (integration stabilization first, then data migration, then training), clearly states all three compromises in Step 3, provides 5 numbered practical steps, and addresses data migration, integration, and training throughout.
→ [P6] Hallucination resistance...
10/10 (231.2 t/s) judge: All three questions handled correctly: expressed uncertainty on pricing, gave accurate CEO name with caveats, and correctly identified the fictional API without inventing a URL.
→ [P7] Business reporting...
10/10 (233.4 t/s) judge: All criteria met: 3 sentences, no bullets, 40% win rate stated (correct calculation 8/20), Marcus mentioned, forward-looking ending included.
→ [P8] Reporting discrepancy...
7/10 (250.9 t/s) judge: Provides 3 valid HubSpot reasons with diagnostic steps. Reason 2 and 3 are solid (lead source field verification, sequence of property assignment). Reason 1 is valid but surface-level. Missing deeper HubSpot property specifics like hs_analytics_source vs hs_latest_source, lifecycle stage vs deal properties, or company-contact association issues.
→ [P9] Contact data normalization...
3/10 (253.6 t/s) judge: Response provides generic advice but misses critical HubSpot constraints. Wrongly suggests Bulk Editor can split full names (it cannot). Fails to mention Operations Hub data formatting. Does not flag the critical lifecycle stage limitation (cannot bulk downgrade). Vague on practical approaches like export-normalize-reimport for company casing.
→ [H1] JSON contract...
10/10 (179.3 t/s) judge: Perfect compact JSON. All 6 keys present with exact values. risk_score is number 0.37, reviewed is boolean false, contacts array in correct order, no markdown or prose.
→ [H2] CRM attribution logic...
5/10 (254.4 t/s) judge: Company first touch correct (Contact A, LinkedIn). Deal-contact model wrong - says both contacts get credit when only Contact B should. Data integrity issue mentioned but not the specific gap issue.
→ [H3] Data quality diagnosis...
5/10 (254.6 t/s) judge: Missed critical firstname/lastname split issue (full name in wrong field). Email analysis was factually wrong (said missing @, but @ exists - actual issue is uppercase causing dedup issues). Correctly identified 6 of 8 issues but 2 key errors reduced score.
→ [H4] Logic & retention...
7/10 (207.6 t/s) judge: Correctly identified cerulean color and project code RSM-7429, and correctly identified all mors are zeds does not follow. However, failed to derive the logical conclusion no zed is a lor, instead merely restated the given rule no lor is a mor.
→ [P1] Tool calling (nested)...
10/10 (235.4 t/s) judge: All criteria met: valid nested JSON, correct contact fields, enterprise segment gets priority=high, within_hours is integer, no markdown
→ [P2] Email writing...
10/10 (214.3 t/s) judge: All 8 rules passed: 35 words (under 75), no 'hope', no 'checking in', starts with 'James', ends with '?', no subject line, references 'HubSpot CRM demo', professional tone.
→ [P3] Instruction precision (competing constraints)...
10/10 (224.0 t/s) judge: All 6 constraints satisfied: 3 numbered items, no intro/conclusion, each item is 1 sentence, item 2 names Breeze AI (specific product), no item starts with HubSpot, no cost/price/cheaper terms used.
→ [P4] Code generation + execution...
[sandbox] Running code...
[sandbox] PASS: Called MagicMock: OK SANDBOX_OK
10/10 (251.6 t/s) judge: All requirements met: correct URL, Bearer auth, pagination with after cursor, max_contacts limit, 429 retry with sleep(1), ValueError on non-200/429, returns flat list, no api_key logging.
→ [P5] Planning & reasoning...
10/10 (255.7 t/s) judge: Fully meets criteria: explicit conflict acknowledgment in Step 1, specific sequencing with integration first then data then training, clearly defined compromises for all three stakeholders in Step 3, and 5 numbered practical steps.
→ [P6] Hallucination resistance...
10/10 (246.8 t/s) judge: All three questions handled correctly. No price given for Q1, Q2 correctly named Marc Benioff with appropriate uncertainty caveat, Q3 correctly identified the API as non-existent without inventing a URL.
→ [P7] Business reporting...
10/10 (239.2 t/s) judge: All five criteria met exactly: 3 sentences, no bullets, win rate correctly stated as 40% (8/20), Marcus mentioned by name, ends with forward-looking statement.
→ [P8] Reporting discrepancy...
7/10 (252.1 t/s) judge: Identifies 3 valid HubSpot reasons with diagnostic steps. Covers lead source filtering, deal-lead source linkage, and timing issues. Points 1 and 2 are solid. Point 3 is valid but timing issue overlaps with point 2. Missing deeper technical issues like hs_analytics_source vs hs_latest_source property differences or company-contact association gaps.
→ [P9] Contact data normalization...
4/10 (254.1 t/s) judge: Gives generic bulk edit advice for all 5 issues but misses critical HubSpot constraints. Workflows cannot split names natively; lifecycle stage cannot bulk downgrade (only moves forward). Suggests HubSpot can standardize casing when export/reimport needed.
Composite: 8.62/10 | Speed: 236.3 t/s | Rank: #23/38
── Model Assessment ──
The Mellum2-12B model achieves a solid 8.62/10 overall with exceptional performance on business, email, and code tasks but moderate results on data operations and CRM-specific workflows. Its key strengths lie in processing structured outputs, planning, and instruction-following capabilities, making it reliable for multi-step HubSpot automation tasks. The main weakness is inconsistent data handling performance and lower scores on CRM-centric operations compared to specialized alternatives. For HubSpot CRM consulting, this model works well for email automation and business logic but consider MiniMax or GPT-4 for data-heavy CRM integrations where accuracy is critical.
for me it looks awesome, considering that it has only 2.5B active params (12B/A2.5B) and with "Grouped-Query Attention \[1\] with only 4 KV heads, Sliding Window Attention \[6\] on three of every four layers, and a single Multi-Token Prediction (MTP)"
10 Comments
uhuge@reddit
rawdikrik@reddit
Dany0@reddit
kevinlch@reddit
Jeidoz@reddit
rawdikrik@reddit
AVX_Instructor@reddit
foldl-li@reddit
vasileer@reddit
boxtlandfickerel@reddit