I changed my mind about DeepSeek-R1-Distill-Llama-70B
Posted by fairydreaming@reddit | LocalLLaMA | View on Reddit | 33 comments

Posted by fairydreaming@reddit | LocalLLaMA | View on Reddit | 33 comments
AvidCyclist250@reddit
Can any of those solve this? It's been my go-to reasoning test and even the latest gemini can't use the correct reasoning
48 : 610 :: 39 : ? 362 975 511 602 353
vha4@reddit
what is it about 610/48 that isn't ~12?
thesuperbob@reddit
I'm not having much luck getting the Llama-70B distill to work, it starts on two 3090s and answers a few prompts, but soon stopsing and just replies right away, even to complex prompts. Also it typically crashes llama-cli after a few prompts. Any suggestions for better parameters? The Qwen-32b distill works fine, but it's not as smart.
fairydreaming@reddit (OP)
It looks like I simply used a wrong provider for this model on OpenRouter. With Groq provider and 0.5 temperature it beats o3-mini in https://github.com/fairydreaming/lineage-bench
While o3-mini is clearly better in lineage-8, lineage-16 and lineage-32, in lineage-64 it almost always chooses wrong answers. DeepSeek-R1-Distill-Llama-70B performed much better in lineage-64 selecting the correct answer more than half of the time. That's how it beat o3-mini.
But it has some issues, it loves to create different variations of the required answer format.
Now if I could find reliable providers for remaining distills...
CptKrupnik@reddit
So what does it actually mean, if we instruct it better would it perform better for the easier tasks?
fairydreaming@reddit (OP)
No, it means that the model quality (quantization?) and proper settings matter a lot for this model.
With the previous model provider (DeepInfra) I had results like this:
With Groq provider and temperature 0.5 I have:
So the score went up from 0.552 to 0.734 just by changing the provider and temperature settings!
and_human@reddit
Please, name and shame the providers? :)
fairydreaming@reddit (OP)
Previously I was looking for reliable provider for DeepSeek-R1-Distill-Qwen-32B:
- DeepInfra had Max Output 131k, but cut the generated tokens to 4k regardless of my settings
- Fireworks had Max Output 64k, but cut the generated tokens short to 8k regardless of my settings
- Cloudflare didn't cut the output but often got stuck in a loop regardless of my temperature settings (tried 0.01, 0.5, 0.7)
For DeepSeek-R1-Distill-Llama-70B I tried DeepInfra, Together and NovitaAI, but it was few weeks ago so I don't remember the exact settings (maybe my temp was too low).
selipso@reddit
I run deepseek distil qwen 32B locally on LM studio. Something as good as o1 mini for free :) gotta love open sourceĀ
Feztopia@reddit
Thats neat I use sometimes similar but easier questions to check much smaller models. Wouldn't expect Sonnet so low but they are all big models.
fairydreaming@reddit (OP)
Claude has personality issues, it almost always selects a wrong answer - the last answer in each quiz: "None of the above is correct" is always a wrong choice but for some reason it's also Sonnet's favorite one.
Christosconst@reddit
Sonnet always has a better answer than the author of the benchmark
SomeOddCodeGuy@reddit
I run the Distill 32b and I love it. Honestly it's become my favorite model in my workflows. I had tried the 70b, but I didn't see massive gains in the response quality, while I did see massive slowdown in the responses, so I went back to the 32b.
These R1 Distill models are absolute magic. I've been tinkering with the 14b lately and it's honestly really impressive as well.
ortegaalfredo@reddit
I tried the 32B and 70B and they were good, but then I realized QwQ had better results so I went back to it.
InterstellarReddit@reddit
How much VRAM to run 70B q4 ? ~35 GB right now?
Cergorach@reddit
The one at Ollama is 43GB...
InterstellarReddit@reddit
Dammit I have 32GB š„ŗ
xor_2@reddit
You can use lower quants - integer quants e.g. IQ2_XSĀ surprisingly performs way above its weight and it can fit in to even single 24GB with usable context length so you might try e.g. 3-bit version or use 2-bit and have decent context length running at full speed. It is an option and you can always run harder problems/questions through higher quantized version to validate what you got with lower quants version.
Calcidiol@reddit
Yes.
InterstellarReddit@reddit
Dammit I have 32GB
some_user_2021@reddit
I just bought 96GB RAM to be able to run 70B models. It's going to be slow but that's ok!
xor_2@reddit
With quantized versions you can run this model with just two 24GB GPUs with decent context length. With more butchered integer quants you can run it with even single GPU but in this case context length is somewhat limited and of course model performance drops the more you drop precision. I mean at very usable performance - tokens/s sharply drop when you involve CPU and its slow RAM.
xor_2@reddit
Thing with these distilled deepseek-r1 models is they could be even better if more training was done on them. Specifically getting logints and trying to match the distribution from full deepseek-r1 as this was not how these models were produced. There is nice work done on re-distilling these models - just smaller 1.5B and 8B models but results are quite promising for bigger models also: https://mobiusml.github.io/r1_redistill_blogpost/
This means someone with enough compute could re-distill this model to get even better model.
Then again someone with such compute could also create proper logint distillation using qwen2.5-72B to make even better model - though I guess re-distilling to bring already distilled model requires far less compute than full distillation from scratch.
RedEnergy-US@reddit
We are running Distill Llama 70B currently at TrialRadar.com and you can see the balance of speed and intelligence is great! Really, it could power our āAdvancedā mode, which is running o3-mini currently - quite slow especially with long prompts (expected).
Sockand2@reddit
Is o3-mini low, medium or high? They are clear diferences between the three
fairydreaming@reddit (OP)
Default medium
Shivacious@reddit
I can provide a not so limited api for r1 if u want to try op?
fairydreaming@reddit (OP)
You mean for 32b or 14b distills? Sure, I'm interested.
Shivacious@reddit
nah deeprseek r1.
fairydreaming@reddit (OP)
Umm but I already benchmarked DeepSeek R1 - it's on the second place, almost tied with o1. But if you want to check if the model on your API performs the same as the official one then sure, we can try it.
Shivacious@reddit
dm'd endpoint. feel free to test
Caffdy@reddit
OOTL, what are lineage benchmarks?
fairydreaming@reddit (OP)
Just a logical reasoning benchmark I created.