No GGUFs for DeepSeek V4-Flash as yet?
Posted by rm-rf-rm@reddit | LocalLLaMA | View on Reddit | 51 comments
Wondering why there aren't any "name brand" (like unsloth, bartowski) GGUFs as yet for DeepSeek V4 Flash?
echowin@reddit
It's a massive model. Reliably running it on consumer hardware is not easy.
FullstackSensei@reddit
Why do you need to run it on consumer hardware? Old server grade hardware is both cheaper and faster.
A 10 year old Skylake Xeon with DDR4-2666 memory is still faster than anything you can get in the consumer market. The Cascade Lake refresh from 7 years ago with DDR4-2933 is even faster. Both run on socket LGA3647, which has plenty of ATX motherboards and even some mATX boards.
echowin@reddit
While those old systems may have memory, they lack the total memory bandwidth and processing power making them basically unusable. The system you mentioned would probably get around 5 tokens per second with Qwen 27b model.
Conscious-content42@reddit
Depends on how many memory channels you have, an EPYC Rome/Milan with 8-12 channels will have at least 140-200GB/sec bandwidth, and with a MoE model like DSv4-flash, you can probably get 10 tokens per sec gen at least, and for the big boi (v4 pro) maybe 4-5 or so depending on how much you quantize it from the mixed 8 and 4 bit quant. So while from token gen that's not super fast it's reasonable for only buying a $2-3k system. Those optimizations usually require at least one GPU for KV cache, prompt processing and shared expert, while you load all the routed experts into RAM.
FullstackSensei@reddit
Milan/Rome has 8 channels. I have a 48 core Milan and in practice it isn't faster than Cascade Lake Xeon with half the cores. Part of that is because the limited bandwidth between CCDs, which creates a bottleneck during any cache synchronization or gather operation in the software, and part is because Intel has better memory controller.
Epyc is great if you want a ton of PCIe lanes, but for hybrid inference it doesn't bring much of a benefit considering the additional cost
Conscious-content42@reddit
Yes but if you have two processors you can take advantage of NUMA, and then perhaps it's not 2x faster, but does give some slight speed up advantages.
FullstackSensei@reddit
No current open source inference library currently supports NUMA. That's I run two instances in parallel
Conscious-content42@reddit
There is this https://github.com/ztxz16/fastllm/blob/master/README_EN.md
FullstackSensei@reddit
Oh, I wasn't aware of this. Have you tried it? I tried ktransformers last year but for the life of me I couldn't get it to run
Conscious-content42@reddit
I haven't, /u/a_beautiful_rhind/ mentioned it a few months back, maybe he has some more thoughts on it: https://www.reddit.com/r/LocalLLaMA/comments/1ngfxn8/dual_xeon_scalable_gen_45_lga_4677_vs_dual_epyc/
FullstackSensei@reddit
Hope he chimes in to see if he's still using it. The project's issues page isn't very encouraging with people being unable to compile for ROCm, and his comment from 7 months ago about no CUDA TP with NUMA isn't very encouraging either.
FullstackSensei@reddit
It's way faster than any consumer platform on whatever metric you want.
Cascade Lake has 140GB/s memory bandwidth and up to 28 cores with AVX-512. I have the 24 core variant and that's more than enough to saturate the memory controller.
Yes, you get 5t/s on 27B, which while still being faster than any consumer DDR5 platform is still a very dumb way of using it.
The post is about DS4-Flash, and for that, if you pair this Xeon with a single 16GB or more GPU you'll get ~13t/s. Not very fast, but also not bad either to run up to 400B models on a limited budget.
ambient_temp_xeno@reddit
No point arguing with these people. They're just salty they didn't buy ram when it was cheap.
FullstackSensei@reddit
DDR4 prices have somewhat come down. You can now get 32GB DDR4 sticks for like 120 eadh, possibly even 100. That means you can assemble a motherboard + CPU + 192GB RAM for about 1k, if not less. Modified SXM2 V100s are going for like 150. That's 600 for four of them. Add in a PSU, fans and a case, and you're comfortably below 2k for a a system with 64GB VRAM and 256GB total memory, that can comfortably run up to 400B MoE models with full 256k context at double digit t/s TG speeds.
ambient_temp_xeno@reddit
I'm not going to say what I paid for my 256gb but it was a lot less than that.
I don't really use it, that's the worst part. I think the 'best' model it makes possible is GLM 4.6 for personal shrink type advice you don't want people reading.
FullstackSensei@reddit
I paid 0.50-0.55€/GB back in the before times. I do use RAM all the time to run large models with 3090s or Mi50s.
ambient_temp_xeno@reddit
I think I'll use minimax m2.7 in the winter when using it is essentially free because of the heating.
FullstackSensei@reddit
By then we'll probably have minimax 4.
Running two instances on my dual Xeon isn't that bad. ~600W with both instances under full load.
ambient_temp_xeno@reddit
Toasty!
FullstackSensei@reddit
I don't know. I think it's not much considering it's two instances running in parallel. That's about 34-35t/s aggregate.
ambient_temp_xeno@reddit
Well, a toaster is at least 800 watts so it's about 3/4 there!
FullstackSensei@reddit
Reddit needs a special love up vote for Amiga hardware
jacek2023@reddit
this is my speed on gemma 26B I use for agentic coding right now:
what's your usecase for 5t/s model?
FullstackSensei@reddit
Like I said, that's a very dumb way to use the system.
First, you pair it with one or two GPUs. So, for such models, you'd actually get the same speeds.
I get ~10t/s running minimax 2.7 Q8, or 17t/s at Q4, paired with three Mi50s. Show me a how many t/s you get running minimax Q4 or Q8?
The whole point, and what this post is about, is DS4-flash.
jacek2023@reddit
I am asking WHY you run minimax, what's the goal.
FullstackSensei@reddit
Can throw quite complicated tasks at it or Qwen 3.5 397B and let it run unattended. When rubber ducking an idea or working on the design of a new feature, those models have much more nuanced understanding of the problem and tradeoffs of different solutions, than anything under or around 100B, including Gemma 4 31B or Qwen 3.6 27B, both a5 Q8_K_XL (the latter runs at 30t/s on two 3090s).
These 200B+ models are great for more advanced work, where you need more intelligence and/or more nuance.
If you don't see the point of using such models, or at the very least having the option to do so, then this whole conversation is moot, TBH.
jacek2023@reddit
do you use opencode or pi or something or just a single prompt?
FullstackSensei@reddit
Roo with modified prompts to and building my own agentic harness.
This isn't about running 200-400B models all the time. It's about having the option to.
For ex: I use a mix or dense and MoE "small" models (26-35B), running fully in VRAM to generate low level (function/method and class) documentation for existing projects, and then use the large ones to synthesize high level documentation about the whole project, like architecture, project organization and core design decisions.
When developing a new feature or functionality, I plan big chunks of work using the large models, feeding them the high level documentation generated above. Often, I run the plan by a couple of models to see what feedback I get (different models tend to ask different questions). That gives me a detailed action plan that in real life I could distribute to an entire software development team to implement. I then hand this plan back to dense/MoE smaller models to implement. In said Roo with modified prompts, I have gotten up to 41 sub tasks/agents to complete the work, fully unattended, and get 97% of the work I want done, the way I want it done, while I'm not even sitting at the computer.
AvidCyclist250@reddit
kinda not helpful if a used 3090 barely costs more and is like 10 times faster
FullstackSensei@reddit
A single 3090 on its own won't run a 248B model. The post is about DS4-flash. If you pair that single 3090 with such a xeon, you'll get 13-15t/s for much less than any consumer platform
AvidCyclist250@reddit
yes and with far larger models as an option
coder543@reddit
As a related note, half of the reason they released these "preview" models is to allow the community to have time to build support for the DS4 architecture before the models are fully trained.
rm-rf-rm@reddit (OP)
Huh? These arent the final V4 models??
EffectiveCeilingFan@reddit
That’s my understanding. While they’re technically Chinchilla optimal, they’re trained on significantly fewer tokens than, say, Qwen. I think Qwen is like 30T or so and DSV4 preview has only been trained on 20T or so.
Technical-Earth-3254@reddit
Check this. I'm pretty sure more things are to come once they get their boatload of ascend 950s in Q2/26.
Pixer---@reddit
But training these models can take like 3 months
Minute_Attempt3063@reddit
Considering that these models are massive already, I assume the final ones will be bigger then.
coder543@reddit
No, models don't get bigger while training.
shing3232@reddit
it depends the amount of token and hardware but it can easily exceed 70T if they really want
coder543@reddit
DeepSeek calls them Preview: https://x.com/deepseek_ai/status/2047516922263285776
jon23d@reddit
I'm a little lost. I have a mac studio 512, should I be downloading one of the MLX community quants or waiting for something else?
FoxiPanda@reddit
I think a lot of people are struggling with how the Deepseek team released it - llama.cpp needs a good bit of surgery to make it work so until that's there, the GGUFs aren't going to appear really.
I got it kinda working on Apple Silicon with a fork of mlx, a couple of open PRs that haven't been merged (https://github.com/ml-explore/mlx-lm/pull/1189 being one of them), and a bunch of trial and error with chat template nonsense/encodings, but man I wouldn't release what I have working to anyone. It's messy and I would say it's only ~90% working and it's not even working well enough that I'd consider trying to share yet.
Minute_Attempt3063@reddit
I think that is also kind of the magic of deepseek lol.
Completely new architecture, massive models, and from what I understand, these are not even the final models yet
Then-Topic8766@reddit
Just downloaded gguf and fork of llama.cpp from https://www.reddit.com/r/LocalLLaMA/comments/1sw3stb/llamacpp_deepseek_v4_flash_experimental_inference/ Can confirm it works, about 5 t/s on linux CPU, i5-14600K 128GB DDR5.
ortegaalfredo@reddit
It's taking a long time to implement to all inference engines but it makes sense, it's different to every other LLM, QV cache is 10x smaller! temember all the noise turboquant caused for being just 4x smaller.
thereisonlythedance@reddit
It’s a shame the Deepseek people don’t work with llama.cpp the way Qwen seems to.
HeavyConfection9236@reddit
Deepseek just released all of the V4 models about 5 days ago. The groups making quants and gguf's are probably working on other stuff right now and it's on their to-do list. And with these large models, making quants isn't fast or free; we should probably be grateful that they'll be released at all, for us to use at no cost.
SM8085@reddit
I think they have to wait for llama.cpp support so they can make the ggufs.
rm-rf-rm@reddit (OP)
ah should've checked that first. Youre right, support is in flight:
https://github.com/ggml-org/llama.cpp/pull/22359 https://github.com/ggml-org/llama.cpp/pull/22378
MotokoAGI@reddit
It needs to be supported first. If you have apple or want to run it on CPU, you can get support for it from here.
https://github.com/antirez/llama.cpp-deepseek-v4-flash
I ran it on CPU and the result is very coherent.
jacek2023@reddit
It's not supported by llama.cpp, these people just run converter