jd_3d

Meta has not given up on open-source

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 73 comments
I created a single-prompt benchmark (with 5-questions) that anyone can use to easily evaluate LLMs. Mistral-Next somehow vastly outperformed all others. Prompt and more details in the post.

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 41 comments
Meta released a paper last month that seems to have gone under the radar. ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization. This is a better solution than BitNet and means if Meta wanted (for 10% extra compute) they could give us extremely performant 2-bit models.

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 62 comments
What's the best free/open-source memory bandwidth benchmarking software?

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 1 comments
New open weight 1 trillion param total / 69B active MOE model released by YuanLab (Yuan3.0 Ultra)

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 0 comments
AIME 2026 Results are out and both closed and open models score above 90%. DeepSeek V3.2 only costs $0.09 to run the entire test.

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 50 comments
GLM-5 vs Opus 4.6

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 17 comments
RAM Memory Bandwidth measurement numbers (for both Intel and AMD with instructions on how to measure your system)

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 60 comments
1 year later and people are still speedrunning NanoGPT. Last time this was posted the WR was 8.2 min. Its now 127.7 sec.

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 24 comments
New challenging benchmark called FrontierMath was just announced where all problems are new and unpublished. Top scoring LLM gets 2%.

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 273 comments
That jump in ARC-AGI-2 score from Gemini 3

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 6 comments
Interesting to finally have some real param numbers on these bigger closed-source models (Grok). I listed a few other big models for reference. See source in text

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 0 comments
Simple Bench (from AI Explained YouTuber) really matches my real-world experience with LLMs

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 228 comments
Meta's Byte Latent Transformer (BLT) paper looks like the real-deal. Outperforming tokenization models even up to their tested 8B param model size. 2025 may be the year we say goodbye to tokenization.

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 190 comments
New GPU startup Bolt Graphics detailed their upcoming GPUs. The Bolt Zeus 4c26-256 looks like it could be really good for LLMs. 256GB @ 1.45TB/s

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 139 comments
Meta on track to be first lab with a 1GW supercluster

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 91 comments
Meta on track to be the first with a 1GW supercluster

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 0 comments
Did Mark just casually drop that they have a 100,000+ GPU datacenter for llama4 training?

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 174 comments
Implementing Reflexion into LLaMA/Alpaca would be an really interesting project

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 5 comments
Does Google not understand that DeepSeek R1 was trained in FP8?

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 112 comments
University of Hong Kong releases Dream 7B (Diffusion reasoning model). Highest performing open-source diffusion model to date. You can adjust the number of diffusion timesteps for speed vs accuracy

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 172 comments
NoLiMa: Long-Context Evaluation Beyond Literal Matching - Finally a good benchmark that shows just how bad LLM performance is at long context. Massive drop at just 32k context for all models.

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 111 comments
SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 129 comments
Anole - First multimodal LLM with Interleaved Text-Image Generation

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 85 comments
A new Microsoft paper lists sizes for most of the closed models

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 152 comments
LlamaCon is less than a week away. Anyone want to put down some concrete predictions?

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 0 comments
Inspired by the spinning heptagon test I created the forest fire simulation test (prompt in comments)

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 45 comments
With no update in 4 months, livebench was getting saturated and benchmaxxed, so I'm really looking forward to this one.

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 2 comments
DeepSeek does not need 5 hours to generate $1 worth of tokens. Due to batching, they can get that in about 1 minute

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 28 comments
New LiveBench results just released. Sonnet 3.7 reasoning now tops the charts and Sonnet 3.7 is also top non-reasoning model

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 59 comments
Elon's bid for OpenAI is about making the for-profit transition as painful as possible for Altman, not about actually purchasing it (explanation in comments).

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 283 comments
What do you think of the rabbit r1 and its Large Action Model (LAM)?

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 47 comments
Over-Tokenized Transformer - New paper shows massively increasing the input vocabulary (100x larger or more) of a dense LLM significantly enhances model performance for the same training cost

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 49 comments
Mistral Small 3 one-shotting Unsloth's Flappy Bird coding test in 1 min (vs 3hrs for DeepSeek R1 using NVME drive)

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 74 comments
Gemini 2.0 Flash beating Claude Sonnet 3.5 on SWE-Bench was not on my bingo card

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 169 comments
Mistral Small 3 one-shotting Unsloth's Flappy Bird coding test in 1 min (vs 3hrs for DeepSeek R1 using NVME drive)

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 0 comments
The first performant open-source byte-level model without tokenization has been released. EvaByte is a 6.5B param model that also has multibyte prediction for faster inference (vs similar sized tokenized models)

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 81 comments
Aider has released a new much harder code editing benchmark since their previous one was saturated. The Polyglot benchmark now tests on 6 different languages (C++, Go, Java, JavaScript, Python and Rust).

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 43 comments
Chinese AI startup StepFun up near the top on livebench with their new 1 trillion param MOE model

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 84 comments
Meta releases the Apollo family of Large Multimodal Models. The 7B is SOTA and can comprehend a 1 hour long video. You can run this locally.

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 159 comments
Kudos to the LMArena folks, the new WebDev arena is showing great separation of ELO scores and shows Claude 3.5 Sonnets dominance in this domain

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 34 comments
Livebench updates - Gemini 1206 with one of the biggest score jumps I've seen recently and Llama 3.3 70b nearly on par with GPT-4o.

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 77 comments
A team from MIT built a model that scores 61.9% on ARC-AGI-PUB using an 8B LLM plus Test-Time-Training (TTT). Previous record was 42%.

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 65 comments
Qwen 2.5 casually slotting above GPT-4o and o1-preview on Livebench coding category

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 105 comments
Meta needs to create a skunkworks AI team that can work on projects with quick turnaround times to stay relevant in-between Llama releases. They can use the old 25k H100 clusters while Llama 4 trains on the 100k+ H100 cluster. Thoughts?

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 5 comments
If an RTX 4090/5090 with 48GB of VRAM were introduced how much extra (over a standard 24GB version) would you pay for it?

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 17 comments
Gemini 1.5 Pro 002 putting up some impressive benchmark numbers

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 57 comments
Is this marketing BS, or how did NVIDIA speed up inference by 15x on Blackwell (and will any of that trickle down to RTX 5090)? VRAM bandwidth is only 2.5x faster

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 71 comments
LiveBench results now with o1-preview slotting in 2nd place. Apparently o1-mini is the reasoning king.

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 18 comments
Updated benchmarks from Artificial Analysis using Reflection Llama 3.1 70B. Long post with good insight into the gains

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 144 comments