Deepseek V4 Flash and Non-Flash Out on HuggingFace
Posted by MichaelXie4645@reddit | LocalLLaMA | View on Reddit | 315 comments
https://huggingface.co/collections/deepseek-ai/deepseek-v4
toothpastespiders@reddit
I think this is the most annoyed I've ever been at myself for not going overboard with RAM when I was putting my machine together.
Zymedo@reddit
I did and bought 448 GB (192 GB) sticks. Still not enough for any current big model, AND I cursed everything because of how bad* my CPU is able to run them (DDR5-3600 or something)...
desktop4070@reddit
What CPU is it? I have a 12900K that also struggles running my DDR5 at advertised speeds.
Zymedo@reddit
9800X3D. Maybe I'm just unlucky, of course, but same - it doesn't want to run even just 2 sticks at 6000, drops them to 5600, or 3600 with 4.
PhilippeEiffel@reddit
The DeepSeek v4 flash model is 160 GB. If I had your memory size, I would give it a try (13B active).
Zymedo@reddit
Flash version might be okay. I tried Step-3.5-Flash (199B A11B) and it was ~10-13 tokens/sec (depending on the exact quant) with careful -ot. But it reasons for literal thousands of tokens - not ideal, considering that I managed to cram only 40K of context in my VRAM. The biggest problem would be waiting for llama.cpp to implement CSA and HCA, probably. They didn't implement DSA, did they?
Monad_Maya@reddit
Didn't jump to AM5 / DDR5 for the same reason. You need a very good board + binned CPU to get stable high capacity DDR5 on consumer grade stuff AM5.
We really need to jump to workstation parts to get stable high capacity DDR5 stable.
species__8472__@reddit
This used to be true, but isn't anymore. Multiple motherboard vendors have 192gb and 256gb 6000/6400MT/s kits on their QVL.
I have 192gb at 5200 stable. Just loaded the expo profile and was off and running.
Better-Monk8121@reddit
running ddr5 192gb 6000mhz with 9950x3d stable. Mobo z870e tomahawk
DistanceSolar1449@reddit
284B A13B is not gonna fit on 128GB, you'll need at least 256GB for that. That's a lot of overboard.
BestGirlAhagonUmiko@reddit
But if you have like 128GB RAM + a couple of RTX 3090, quantize it down to IQ4XS or Q3KXL and it'll fit with a pretty usable context size. Am I wrong?
Vaguswarrior@reddit
Right, just a couple 3090s lol
Radiant_Bag_5007@reddit
WHO_IS_3R@reddit
Those quick second decisions man, im still beating myself for not choosing 128gb ram and throw 2-3 random 3090s instead of choosing an i5-16gb laptop, i was this close man, if only i knew
Vaguswarrior@reddit
I'm surprised you're noticing a difference I mean an i5... That the big leagues right?
IrisColt@reddit
heh
madhan4u@reddit
Good to know that I just need 96Gb more RAM and a couple of RTX 3090
DistanceSolar1449@reddit
IQ4XS will be around the size of 1/2 the param count of the model. So 284b/2 = 142GB
Assuming 8GB for the operating system and a few small apps, you have 120GB of RAM. So you need 22GB of VRAM before you even look at KV cache. You need 2x 3090 to even have ANY context.
It's doable but ugly.
Karyo_Ten@reddit
According to release notes, it's 10GiB of KV cache for 1M context.
So this fits comfortably on 192 GiB VRAM.
DistanceSolar1449@reddit
192GB != 128GB
Karyo_Ten@reddit
You said
192GB != 256GB
DistanceSolar1449@reddit
I’m not talking about mixing RAM sticks. Yes you can mix and match ram sticks and somehow stick exactly 142GB into a motherboard but nobody does that.
FullstackSensei@reddit
Or you can run a six channel memory platform
Karyo_Ten@reddit
Not sure what you're talking about. I'm talking about 192 GiB not 142 and 192GiB is a very standard size: https://www.corsair.com/us/en/p/memory/cmk192gx5m4b5200c38/vengeance-192gb-4x48gb-ddr5-dram-5200mhz-c38-memory-kit-black-cmk192gx5m4b5200c38
florinandrei@reddit
So easy, you could run it on a Nokia.
ResidentPositive4122@reddit
The weights are ~160GB on hf, and they use mixed fp4/fp8 already. You won't get much out of quanting it again, and might lose a lot of precision. Realistically this will be served on 2x PRO6000, since the kv cache seems to be really efficient.
-dysangel-@reddit
Smug mode: engaged
JLeonsarmiento@reddit
Sell everything, get m3 ultra with 256gb for 6k and go on holiday trip with the rest of cash.
draetheus@reddit
I passed up on a chance to buy an open box 256gb mac studio from microcenter for extremely cheap. What a fool I was...
RoomyRoots@reddit
Could be like me, prices exploded in the week of my birthday. I had to cancel my wishlist because 64GB was more expensive than 128 when I bought it one year before.
SnooPaintings8639@reddit
I've built mine for Llama 3 release. It is serving well and is worth today more that two years ago when it was new.
But...
I do daily ritualis of looking for cheap options to extend it it anyway... but it only is getting more and more expensive:(
PWCIV@reddit
is it producing something good?
Wooden_Yam1924@reddit
I feel the same. In January last year I build PC and got 512GB DDR5 ECC - it cost me \~$3000, now I can buy one 64GB stick for the same price... Looking at the current models I wish I had 1TB
ambient_temp_xeno@reddit
Rare instance of me being right about something.
kaisurniwurer@reddit
With A49B, you probably would have had hard time actually using it anyway so don't feel so bad.
Flash on the other hand looks quite juicy...
WAFFLED_II@reddit
This is the one time I’m GLAD I went overboard on ram and nothing else
Zyj@reddit
Looks like it‘s 158GB in size. Fits dual Strix Halo.
RazsterOxzine@reddit
I feel you! I feel you! * cries in corner *
Monad_Maya@reddit
I've maxed out (128gb / AM4) but it's still not enough.
Monkey_1505@reddit
Beautiful month. Incredible really.
Makes me wonder how rich one has to be to run flash locally though.
Zyj@reddit
Dual Strix Halo, I paid ~3300€ total.
pmttyji@reddit
when did you buy? Thought single one alone cost $2K+
Zyj@reddit
Fall last year, they were 1580€ each
Zeeplankton@reddit
seriously is this like the heaviest release month in.. ever? Gemma, Kimi, Qwen, Ling, Hy3, MiMo, Deepseek, am I missing anything..? I guess lots of closed model releases too.
grumd@reddit
Yeah, all of that AND GPT-5.5 and Opus 4.7, it's crazy
WildContribution8311@reddit
I'd rather pretend Opus 4.7 never happened.
steny007@reddit
We do not use thost words here.
7734128@reddit
There's been a lot more. Plenty of specialized non-llm models, but also more llms like https://huggingface.co/ibm-granite/granite-4.1-8b
Oh, wait you're talking about the whole month? This week alone has been insane.
jadbox@reddit
Now we just need proper benchmarks to compare all of them in real world writing and coding.
Zc5Gwu@reddit
Minimum $5k for cluster of 2 strix halo 128gbs I'd say.
LagOps91@reddit
2 gpus with 24GB vram each and 128gb ram would be viable for consumer hardware for q4, but q3 should be fine too, so that's not too absurd. Or at least was before ram price hikes.
Monkey_1505@reddit
You'd probably need a fairly beefy cpu/ram setup to handle offloading a significant majority of a 13b active model that's in the 100+gb territory and get usable speeds, I suspect.
Not that I've tried.
LagOps91@reddit
I do have 24gb vram and 128gb ddr5 ram. On comparable models, i am getting about 8 t/s at 32k context. No good for agentic coding, but fine for regular assistant usage.
Nobby_Binks@reddit
Sigh, I haven't even set up Qwen 3.6 27B yet. I cant keep up
Iory1998@reddit
You can't set up these either.
the3dwin@reddit
Available on LM Studio 2 days ago
Iory1998@reddit
??
the3dwin@reddit
Qwen 3.6 27B available on LM Studio 2 days ago
AXYZE8@reddit
Both models use FP4 + FP8 Mixed: MoE expert parameters use FP4 precision; most other parameters use FP8.
So these DeepSeek models are not as big as param count suggest. Very important when comparing to other models that are usually BF16 or FP8 - quantizing them to DeepSeek will reduce their quality.
That being said as Kimi K2.6 is INT4 the new whale is still fattest OSS model. Love that we got Flash variant for (beefy) desktops - 284B at FP4+FP8 fits in 256GB limit of AM5/LG1700/1851 platforms.
waruby@reddit
Shit, that means it can't be quantized much more.
Minute_Attempt3063@reddit
they did the work for us XD
and yet its massive LOOL
maybe finally we are reaching levels of chatgpt, but locally
Unusual_Guidance2095@reddit
Kind of disappointing that indeed it seems to only be a single modality
traveddit@reddit
I think just from this Qwen already beats the flash model.
Hoodfu@reddit
Qwen is dry as a bone for writing. Deepseek (at least from 0324) has been a creative writing dynamo.
traveddit@reddit
Prompt:
Deepseek:
Qwen 397:
Tell me the dynamo that is Deepseek here?
Outrageous-Wait-8895@reddit
In what way is Qwen superior here?
traveddit@reddit
Why did you run them through an LLM and find that they prefer Deepseek's?
I'll entertain you if you tell me why Deepseek's in better in your own words.
Outrageous-Wait-8895@reddit
Did I?
You didn't answer the question by the way.
RuthlessCriticismAll@reddit
Deepseek's is like infinitely better. Neither is particularly good but its actually about loneliness rather than a highly used snack machine which evokes literally the opposite emotion.
traveddit@reddit
Is it? A lone passenger at a subway?
Gen Z level of literacy you have right there. You can't even properly locate the subject of Qwen's poem.
You can barely read but you're confident enough to tell the world about what you think is good writing? Nobody should care because 99% of you on the sub don't know good writing from bad.
tengo_harambe@reddit
Qwen's poem is abstract and metaphorical (the snack is alone and nobody wants it), Deepseek's is literal. Subjectively I prefer Qwen which is osd because I usually find it to be terrible at creative writing.
Hoodfu@reddit
I've used the qwens/deepseeks/gemmas/mistrals for text to image prompt enhancements for a good while now on an m3 ultra 512. The images qwen makes are simplistic and unimaginative, even when upping the temp. Gemma is a good deal better, along with the older mistrals. Deepseek however, especially 0324, thinks of stuff that's amazing. If you want it to be snarky and sarcastic, is especially comes alive. There's a reason why it's always near the top of the creative writing benchmarks whereas these others are far below it.
traveddit@reddit
What makes you think you can tell good writing from bad? Why should I care about LLM evaluated writing benchmarks?
The_Rational_Gooner@reddit
V3.2 is really dry though. Hopefully this one strikes a good balance
dark-light92@reddit
In the report they mention that they are working on multimodal capabilities.
Monkey_1505@reddit
I think they were wise to focus on agentic first, and circle back to multi-modality.
breadfruitcore@reddit
Yeah I'm super grateful they released this, but it's a bit sad. Guess I'll still be doing model switching in my workflows.
MichaelXie4645@reddit (OP)
Agreed, I kinda looked forward to the 284B with vision support.
Specter_Origin@reddit
I would keep my fingers crossed...
ReadyCelebration2774@reddit
tested the flash model, seems really really fast
Whiz_Markie@reddit
How much VRAM is required?
power97992@reddit
Approximately 160gb for the model if you count the xet files . Plus the full context i guess you are looking at another 20-40gb if it is 5% of v3.2( since pro is 10% of v3.2 per tk)
thrownawaymane@reddit
What about those of us sitting on a bunch of DDR4/5 RAM?
the__storm@reddit
284B params; we'll see what the quants look like but you could probably squeeze it into 192 GB (2x Pro 6000 Blackwell). With 13B active, a 256GB Mac might do okay as well.
200206487@reddit
I have one of the 256s, looking forward to testing. I just learned about this while sitting on the porcelain throne. I hear that since its QAT (prequantized during training) that using any quant below what's ready released as base will considerably reduce quality compared to a NON-QAT release? Did I get that right? Also, from what I'm reading, QAT is great to lead in with to set a trend for others too do the same. I imagine QAT is best since they can both control quality while compressing the model straight from the source - again iirc.
the__storm@reddit
Yeah, that's right. Looks like the instruct QAT is 160GB though so you should be good.
OC2608@reddit
Yes.
Zyj@reddit
160GB + context
Thomas-Lore@reddit
But context is very efficient so maybe 192GB will be enough.
RedBull555@reddit
Yes.
ReadyCelebration2774@reddit
not local, through their api
thread-e-printing@reddit
All of it
True_Requirement_891@reddit
Looks like minimax is gonna be out of bis...
power97992@reddit
They will release a bigger model like 390-400b
silenceimpaired@reddit
DeepSeek-V4-Flash with 284B parameters (13B activated) is an interesting setup. Hardly a flash in my mind but I'm curious how it will compare at 4bit against GLM 5 at 2bit.
andy2na@reddit
need a 0.01bit quant of that
Unusual_Guidance2095@reddit
It does seem the the entire model size is only 896 GB though, so seemingly mostly Q4 but the model card said a mix of both
-dysangel-@reddit
if this model is using engram (though I'm not reading anything saying that yet..), a lot of those weights could be stored on disk
Hoodfu@reddit
So at q4 that's around 450 gigs. So I can run it on an m3 ultra 512 gig mac with about 3 sentences of context.
moar1176@reddit
Nah, did you not see the kv chart? They fit 1M context into < 10 gigs on the full size model. Absolutely masterful engineering. Of course on a Mac that would probably take a few hours to Prefill.
Silver-Champion-4846@reddit
How is the loss of the cach?
ResidentPositive4122@reddit
No, the ~865GB is already quantised in fp4/fp8 mixed. So you can't reduce that too much without basically bricking it.
po_stulate@reddit
3 sentences context costs 60GB RAM?
Nodja@reddit
Parent is saying that the model was mostly trained at 4 bits already because 1.6T * 4 bits = 6.4Tbits = 800GB, if the size on disk is 896GB only 96GB is possibly 16 bit, for for Q4 quants you're seeing a savings of ~50GB not 450GB.
shing3232@reddit
the based is 1.6T and qat into instruct at 900
Caffdy@reddit
can you expand on this? it's kinda confusing
the__storm@reddit
They're both 1.6T parameters, but the base model (not trained to talk in turns or do tasks or anything, purely predicts text) is at higher numeric precision. The instruct version (model with additional training that we normally use to chat and do tasks) is basically pre-quantized to take up less space by storing the parameters with less precision. Since the model was trained taking into account in advance how it would be quantized, we would expect this version to perform better than an equivalent model that had been quantized separately after training. (Downside is that you also have less flexibility to change the quantization after the fact.)
34574rd@reddit
not really according to their technical paper, it can be dequantized to bf16 lossless
florinandrei@reddit
Just quoting that fact because it's important.
Caffdy@reddit
is this what they call QAT?
shing3232@reddit
yes
redimkira@reddit
if you find that, you will win the medicine Nobel prize on lobotomy
cafedude@reddit
Yeah, it would be nice if they took a hint from the Qwens and released some under 100B models. Flash-lite or something.
synn89@reddit
MIT license? Nice.
SufficientPie@reddit
Hear that, Qwen?
popiazaza@reddit
Minimax perhaps? or less likely, Moonshot? Qwen and Zai are quite open.
SufficientPie@reddit
Qwen3.6-397B-A17B was released as open weights?
popiazaza@reddit
There is no such model? Plus and Max variant are proprietary as usual for Qwen.
SufficientPie@reddit
Qwen3.5-Plus was released as open weights:
popiazaza@reddit
Not sure what do you what, but seem to be able to find the answer by yourself. Have a nice day.
popiazaza@reddit
Yes. Only their plus/max variant are proprietary.
onil_gova@reddit
what's wrong with Apache 2.0?
SufficientPie@reddit
Qwen3.6-397B-A17B was released under Apache 2.0??
Embarrassed_Adagio28@reddit
Kinda disappointed to see there is no 30b, 80b or even a 122b. This doesnt help the local llm community much.
Right-Law1817@reddit
Multimodal models soon.
steny007@reddit
Doens't necessarily mean soon.
AFruitShopOwner@reddit
Oof those hallucinations on flash are baaaaad
Zeeplankton@reddit
the whole idea of positive rlhf for hallucination is still new. in flashes defense all these scores suck balls
silenceimpaired@reddit
Sounds like a great creative writing model.
AFruitShopOwner@reddit
Seems like it has more world knowledge at the cost of thinking it knows everything
silenceimpaired@reddit
I'm okay with that. I'm not using LLMs to code... Just brainstorm
AFruitShopOwner@reddit
Big yikes
Altruistic_Heat_9531@reddit
So lemme get this straight in 1-2 weeks there are
- Qwen 3.6
- Deepseek V4
- Gemma 4
- Opus 4.7
- GPT 5.5
And in past 24 hours
- DeepSeek V4
- 27B Qwen 3.5
- GPT 5.5
ndrewpj@reddit
In past 24hrs Qwen v3.6 27b not v3.5
Mashic@reddit
Anthropic is the only one to still release an open model.
VampiroMedicado@reddit
Did xAI release a model yet?
Altruistic_Heat_9531@reddit
They did. if i am not mistaken only Anthropic that not yet released any model and also not released any major contribution in terms of training. Only MCP.
Microsoft even donated DeepSpeed, Uber with Horovod, and ofc Meta with Torch and its predecessor Caffe2.
Mashic@reddit
Well, that one too.
arbv@reddit
They did! Grok 2, IIRC.
Altruistic_Heat_9531@reddit
whoops mb
xspider2000@reddit
u forgot GLM 5.1
Zeeplankton@reddit
I think this is the biggest release month ever
Sky-kunn@reddit
zdy132@reddit
On par with SOTA models, with reduced compute and memory load. DeepSeek did a great job here.
coder543@reddit
At 1.6T A49B, I don't know that we can confidently say that for DeepSeek V4 Pro. We don't know how big the frontier models are. Anyone who throws out a number is just wildly guessing.
But it is still very cool that they released the model openly.
the__storm@reddit
Well the API is like 1/4 the price of Opus/GPT, so they're probably either accepting a lower margin or have improved inference efficiency. The technical report has a lot of stuff about how they're serving it.
kurtcop101@reddit
The inference costs of open source models are more representative of actual inference costs.
It's why I keep saying the big companies are not losing money on their subs or anything like that. Random data centers are not hosting models trying to gain market share, they run at a profit.
The place where big labs are losing money is only training the models - these subscriptions ARE profitable. For the frontier US labs the training is subsidized by VC money, for Chinese labs the training is subsidized by the Chinese government. The Chinese government isn't expecting a return - they're in an arms race - so they release open source instead, because it makes it harder for the frontier US labs to charge higher pricing and profit more, and disrupts them. It generally just disrupts the US market.
coder543@reddit
They definitely accept lower margins. If they tried to charge high margins on an open model, every other AI host on OpenRouter would undercut them with their own model.
HiddenoO@reddit
Likely, yes, but you have no idea about OpenAI/Claude/Google margins. They cannot charge arbitrarily high either because they don't want people to go to their competitor models and possibly never return.
34574rd@reddit
im not sure how exactly they are going to undercut with a 1.6T parameter model
shing3232@reddit
it's gonna be cheaper once more 950 accelerator cluster is online.
Zeeplankton@reddit
I think it's funny when just last year we were in awe of an open 1T model. I can't remember which one it was, but how times have changed lol.
Silver-Champion-4846@reddit
Kimi K2
Winter_Educator_2496@reddit
Either same or bigger. It depends on a lot of factors I won't get in to, but it is not less. Remember that the single easiest way to make a model better is to make it bigger.
Monkey_1505@reddit
I assume you did not browse the paper?
zdy132@reddit
I am talking about the reduction in Single-Token FLOPS, and Accumulated KV Cache, the two diagrams on the right.
Both of these mean that they will be less taxing on the hardware to run, making it cheaper to serve. And in the (hopefully near) future when consumer hardware can run it, we will be able to run it more efficiently as well.
power97992@reddit
On page6 of the technical report, they said it is approaching the level of opus 4.5.
DistanceSolar1449@reddit
power97992@reddit
glm 5.1 is bf 16 , but ds v4 is q4+q8 mixed precision, so v4 actually uses less vram..
NandaVegg@reddit
According to the GLM 5 paper, GLM 5 (and 5.1 I guess) had int4 QAT (also mixed precision: W8A8 and W4A8 for experts) at the post-training phase, but it is kind of vaguely written. I would consider it native 8-bit at least; I had no issues running GLM 5.1 with the official fp8 quant (vllm, 8xH200).
BillDStrong@reddit
Yes, but isn't GLM-5.1 a dense model? Take the sqrt of the active * the total of the MOE to get a ballpark of the size of a dense version it will act like, as a ballpark figure.
So it acts like a 277B model at the speed of a 49B model, and the database of a 1.6T model.
If it is sticking head to head with the 744B model, it is doing pretty darn good, really.
DistanceSolar1449@reddit
It’s 744b 40b active
NandaVegg@reddit
If the FLOPS reduction claim stands true (and could be properly implemented in inference engines) it would be fantastic choice for faster interactive experience.
I tested the official API a bit; so far the official API is not that fast for neither Pro nor Flash, but it has always been the case that DeepSeek's official API is as slow as old towser as they apparently do not have enough compute.
As for parameters count, it uses mixed precision (experts are 4-bit) so the real VRAM footprint for weights should be around 860B. Still larger than GLM-5.1 which is 744B, Kimi which is around 500B with 4-bit QAT.
zdy132@reddit
It varys a lot right now. Some replies are almost instant, 100~200 t/s . But some replies are only 10~20 t/s.
It's probably their server being hammered by all the people checking the new model out.
Few_Water_1457@reddit
https://artificialanalysis.ai/?models=deepseek-v4-pro%2Cdeepseek-v4-flash%2Cminimax-m2-7%2Cqwen3-6-27b wait what?
More-Curious816@reddit
Recoloring the chart because I can't see shit
RazsterOxzine@reddit
May I ask how this helps? Just wondering because it's not doing it for my eyes ☼_☼;
More-Curious816@reddit
I have shit eyesight, that white and gray wasn't doing it for me. I recolor it originally for me, and posted for people who suffer the same headache from grey and white. Wasn't meant as better alternative
layer4down@reddit
Yeah and thanks for it. I’ve got crazy floaters and blind spots so this is definitely preferable.
RazsterOxzine@reddit
I feel yeah. It works it works. Probably more people in the same situation. I was just curious. 🖖
Salaja@reddit
YOU MADE A MISTAKE
the "SWE Verified" middle column should be Claude (green), instead of GPT (orange).
I feel like there is a "blind leading the blind" joke here, but i can't "see" it... ha ha.
Desm0nt@reddit
The fact that GPT-5.4-high performs worse than (the very strange and erratic) Gemini 3.1 Pro in some tests makes me seriously doubt the reliability of these results...
Thomas-Lore@reddit
You should try Gemini 3.1 Pro on API or AI Studio, not in the Gemini App. It is much more capable than people think. I use it for one shotting solutions that I then give to minimax to implement.
Desm0nt@reddit
In AI studio - maybe. But for coding google force me to use antigravity and gemini CLI (and ban if I proxied it to any more usefull tools) and in atigravity it way more dumb than even sonnet and sometimes (during prime hours) even more dumb than gemini 3 flash, with typical quantisation problems (losing the meaning of the context, getting stuck on repeating the same phrase, etc.).
So maybe 3.1 Pro is actually pretty good in its original, full-fledged FP16 form, but it feels like they only ran it that way in its lifetime once — for benchmarks. And then “little, poor indie company GOOGLE” apparently quantized it in Q2 and rolled it out into their tools, otherwise I can’t explain such a difference in real-world use.
power97992@reddit
Gemini 3.1 pro in the antigravity seems to be way worse than opus 4.6
Ardalok@reddit
Not as benchmaxed as Qwen, good.
danigoncalves@reddit
Waiting for a micro version since their Flash version is almost 300B 😬
GlossyCylinder@reddit
Interesting they say it's a preview version but looking at the benchmark, it's on par with k 2.6 on coding and agenic but slightly better at math and reasoning (expected . I honestly thought it would perform worse than kimi or glm on coding but the gap between OSS models are very tight.
And detailed report/paper from DS as always. Seems like there able to incportared all their recent ideas into v4.
Karyo_Ten@reddit
They didn't incorporate Engram ;)
Re theorem proving, interesting, though are you aware of Mistral's Leanstral and Meituan Longcat Prover?:
segmond@reddit
you must not be aware of these that long predate that
https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-671B - "DeepSeek-Prover-V2, an open-source large language model designed for formal theorem proving in Lean 4"
https://huggingface.co/deepseek-ai/DeepSeek-Math-V2 - "DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute."
Karyo_Ten@reddit
I am aware, but Leanstral and Meituan prover are less than 2 months old and didn't get that much coverage so could have easily be missed.
segmond@reddit
They have the best and only math models, deepseekMathV2 which is capable of winning gold at IMO and deepseekProver. I think they incorporated those lessons and data into this model.
power97992@reddit
It looks good and maybe engrams will Come out later ?
Ok-Mess-3317@reddit
IT’S FINALLY HERE
More-Curious816@reddit
AND IT CHANGES EVERYTHING. (I can see the click baiting slop already)
bnolsen@reddit
ITS A GAME CHANGER. WE WERE ALL WRONG, DEEPSEEK IS BACK.
Ok-Mess-3317@reddit
Lmao unfortunately
200206487@reddit
Looking at Unsloth / Bartowski 5-bit ftw!
SnooPaintings8639@reddit
Wake me up when GGIF is there!
This_Maintenance_834@reddit
This is pretty much already quantized. GGUF won’t really help to reduce the size, unless it goes below Q4.
tarruda@reddit
I don't have high expectations for deepseek, but Qwen 3.5 397b quantizes extremely well even up to 2-bit. I was able to run it on my 128G mac and got excellent benchmark results: https://huggingface.co/tarruda/Qwen3.5-397B-A17B-GGUF/discussions/1#69d142b4f17676f98e53c16a
If Q4 fits in 160G, maybe there will be a good ~3-bit quant for 128G machines.
Zeeplankton@reddit
WHATEVER IT TAKES
SnooPaintings8639@reddit
Not gonna lie - I am waiting for Q2, lol
Recoil42@reddit
Gank Goodness Its Friday
manipp@reddit
So is this the same thing you get from their website? E.g. is instant DS4 flash?
david_0_0@reddit
the flash variant with 13B active params is interesting — does anyone know if there's a q4 quant that fits on 32-64GB VRAM without completely tanking quality? curious where the actual quality cliff shows up vs the full activated count
Nepherpitu@reddit
The point is it is already Q4 at 170Gb. It will not fit into 64Gb VRAM even at Q1.
david_0_0@reddit
ah that's brutal, so even the flash version is basically multi-GPU territory regardless of quant level
dampflokfreund@reddit
Disappointing release. Too hard to run, even the Flash, no multimodality, no revolutionary stuff like Engram... expected much more after the long wait.
Mochila-Mochila@reddit
Their work on solving context size shouldn't be dismissed zo.
Pristine-Tax4418@reddit
What is the real size? I want to believe that it is not 284B
Look_0ver_There@reddit
The real size is 284B, however it uses a mix of FP4/FP8/FP16/FP32 as well as some I8 weights. It's sort of like GPT-OSS-120B where there it's all mostly MXFP4.
As a consequence, the full weight size is \~170GB, and not the \~284GB you might expect.
edward-dev@reddit
How much RAM? Yes.
Jokes asides, Qwen3.6 27B seems on par or even a little bit better than V4 flash, at least on benchmarks
alex20_202020@reddit
So I have noted. Why run DS instead using so much space?
Former-Tangerine-723@reddit
Because benchmarks don't tell the whole true
34574rd@reddit
*on benchmarks*
kevin_1994@reddit
Anyone able to compare flash to minmax m2.7? Similar sizes but i dont see any direct comparison and I'm on my phone.
Sinister1066@reddit
Zc5Gwu@reddit
Wow, that pricing and speed is crazy. Minimax need new architecture fast.
Sinister1066@reddit
I think flash takes the win on this one, but K2.6 wins over v4 flash
power97992@reddit
K2.6 is also 13x more expensive
Few_Water_1457@reddit
really
nFunctor@reddit
Spoke to it about some philosophical/social stuff to check on its style and analytical effort. Sent outputs back to opus. We both agreed it’s Opus 4.6 high think both in style and substance.
It was set free.
uniVocity@reddit
Ha just when I was beginning to feel rich with my 128gb… now I’m hoping qwen releases another model to compete with DeepSeek. Maybe qwen3.6-397b or a new qwen-coder version? One can only dream.
alex20_202020@reddit
Benchmarks list is very different for 3.6 27B, but where lines are named similar (what EM means?, 5-shot?), I see Qwen is higher: https://huggingface.co/Qwen/Qwen3.6-27B vs https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro
Due_Net_3342@reddit
32B reap version when? :))
Ferilox@reddit
What do they mean by "Think Max" mode? Is it like Think High but with a special system prompt? Have they specified what system prompt did they use on the benched values for DS-V4-Pro Max? Was it a single system prompt for the entire benchmark suite or did they change it benchmark to benchmark?
power97992@reddit
Max means max reasoning
Ferilox@reddit
I was going over deepseek api docs and thats how its turne on, but what do they mean by the system prompt in the response format in the model card:
I find it confusing
jnmi235@reddit
V4-flash only being 160GB is wild
Eyelbee@reddit
It's not that wild considering it's barely any better than qwen 3.6 27B
Expensive-Paint-9490@reddit
So I was going to download MiMo and Hy3 to test but now my priorities have changed.
power97992@reddit
U can test all of them on openrouter then decide which one u will download
power97992@reddit
I’m surprised they didnt use engrams in this model , maybe in the future and it doesnt have multimodal for the pro model
Dany0@reddit
It's a preview, apparently multimodality + engrams are still coming
An early checkpoint that beats Opus 4.6, thank you deepseek, we may not be gpu rich but with friends like you, we are rich
power97992@reddit
It seems like they are saying it is almost as good as opus 4.5 , so they are 5m behind
DarkArtsMastery@reddit
It's happening aight?
megadonkeyx@reddit
Hand cranking my dell r720d as we speak, 384gb ddr3 with 12gb rtx3060.
4 tokens/sec is all anyone ever needs
power97992@reddit
Just rent 5 b200s or use the api… it is faster
jselby81989@reddit
This is the moment I regret not maxing out my RAM.
Dany0@reddit
I thought I'm baller with my 5090 but I'll be renting out h200s in the cloud with the rest of y'all :/
Bestlife73@reddit
I was here!
markovianmind@reddit
I am here
onil_gova@reddit
Zyj@reddit
It‘s only open source if you release the training data. Like Nvidia is doing lately. Otherwise it‘s open weights.
Mashic@reddit
If they were trained on copyrighted materials (books, scientific papers...) without aquiring proper license, don't expect them to release the training data and put themselves in risk of lawsuits.
Ardalok@reddit
And all the research.
onil_gova@reddit
context
mrjackspade@reddit
Holy cringe.
FlyingCC@reddit
me too my friend me too
More-Curious816@reddit
Add me too the celebrity group chat.
TheRealMasonMac@reddit
It seems to be a native FP4 model? Their card says: “FP4 + FP8 Mixed: MoE expert parameters use FP4 precision; most other parameters use FP8.”
Lowkey_LokiSN@reddit
Yup! And that's what I'm most excited for!
It's the one thing I've been really missing out on since gpt-oss-120b.
Independent-Date393@reddit
52% of deepseek's own engineers switched to V4-Pro as their primary coding model. that data is in the paper section 5.4.4 and it's more interesting than any public benchmark
RuthlessCriticismAll@reddit
To be clear, its fewer than 9% said no. so almost all the deepseek engineers are willing to use v4.
ilintar@reddit
Is it better than Qwen3.6 27B?
Noxusequal@reddit
Do I see correctly that engrams are at least not mentioned in the model descriptions ?
faschu@reddit
The V4 version also has no native image support?
Mr-I17@reddit
284B Flash 🫠. *Sad 128GiB UMA noises*
beneath_steel_sky@reddit
Flash (non base) is 158B, quants should fit in 128GB
Mr-I17@reddit
Nuh, it's 284B as it is written on the model card. The 158B seems to be a miscalculation by HuggingFace. DS V4 uses a mix of FP4 and FP8 natively, the total size of the safetensors files is about 160GB. 128GiB UMA owners will have to use 2bit quants.
beneath_steel_sky@reddit
My hopes just went out the window
Karyo_Ten@reddit
It's already quanted in Fp8 + Fp4 (or int4? unsure) so you'll need to requant in int8 + ~int3
Zyj@reddit
It‘s 158B because it‘s quantized to mostly Q4
cafedude@reddit
Maybe we can get one of those 1-bit quants?
Mr-I17@reddit
Yeah, might as well just pick a 1T model. It'd fit into 128GiB RAM perfectly. /s
MDSExpro@reddit
AWQ when?
power97992@reddit
They need to make a 120b q4 ,q8 mixed precision model
Then-Topic8766@reddit
History in the making.
Material_Soft1380@reddit
AleksHop@reddit
its extremely bad in terraform
weiyong1024@reddit
As a developer from China, this is what I respect most about DeepSeek, they just keep shipping MIT license and 1M context while the rest of the field is busy marketing, in a noisy race a bit of rational focus goes a long way. It's also a solid self hostable option in my multi provider agent rotation, not a hedge exactly, more like a core slot that happens to also be free of external policy exposure.
TinyDetective110@reddit
`For the Think Max reasoning mode, we recommend setting the context window to at least 384K tokens.`
Accomplished_Ad9530@reddit
Holy shit, that's a lot of thinking tokens
Wibong@reddit
The distill dataset already come out!!!
https://huggingface.co/datasets/beyoru/deepseek-v4-pro-max-distillation-preview-shot
AlbeHxT9@reddit
>Opus level at these prices wtf
MDSExpro@reddit
Flash size is perfect! Finally a good model for that parameter band.
Jackalzaq@reddit
is the base model for the pro the one thats 1.6T parameters and the instruct one is half of that(862b)? or is the hugging face parameter count bugged?
Caffdy@reddit
someone mentioned that it should be corrected in the next 1-2 days
Jackalzaq@reddit
ah, thanks!
Karyo_Ten@reddit
It's QAT. Delivered in Fp8+Fp4 (or int4 didn't check)
Zyj@reddit
The instruct models weights are mostly FP4
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
rm-rf-rm@reddit
odd that they haven't shared any benchmarks for the flash model
hdmcndog@reddit
They have: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro#comparison-across-modes
rm-rf-rm@reddit
those comparison against modes, not to other models like Qwen3.6, Minimax 2.7, Sonnet 4.6 etc.
hdmcndog@reddit
Luckily, we can just join the tables ourselves ;)
But I see what you meant now.
ortegaalfredo@reddit
They are on the hugging face page
rm-rf-rm@reddit
I didnt see it, looking again
Different_Fix_2217@reddit
It does not seem very good... Hopefully its just broken. Because this is no where near kimi / glm.
Aaaaaaaaaeeeee@reddit
In case you were wondering about Engram, it's not part of these models yet.
It's saved for future work.
Both models have post-trained QAT experts in MXFP4. Very happy that they do QAT release too, so it can be the norm.
Real_Ebb_7417@reddit
Ok so, can someone smarter than me tell me what new techniques they used that will make our lives easier? (They always implement something new, so I guess this time is no exception?)
Caffdy@reddit
1M context with very low KV cache memory requirements
Zyj@reddit
Will help with larger context size. Less RAM, more speed.
Finanzamt_Endgegner@reddit
Better hybrid attention it seems, some long context improvements and if I'm not mistaken something to make the active params smarter, but that one is hearsay lol
ComplexType568@reddit
BEEN WAITING SO LONG TO SEE A PERFORMANT >1T MODEL OTHER THAN KIMI
I hope these stats are from pure performance and not benchmaxxing. Because if this is just pure performance it'll be a glorious step up
reefine@reddit
No image input :(
uniVocity@reddit
I guess most of us are left with the option of running it on a spare nvme
Major_Olive7583@reddit
Api price , Any changes?
TinFoilHat_69@reddit
Someone should start vibe coding optimizations for older 24GB p100, cards, and maybe even older p40s or k80s I’m pretty sure p100s support nvlink 1.0 which makes it interesting choice for 85 dollars each
Sakatard@reddit
My p40 agrees with this idea
johnxreturn@reddit
Now on to test whether it’s benchmaxxed
GlossyCylinder@reddit
DS is one of the least benchmaxxed model out there lol.
Finanzamt_Endgegner@reddit
Doubt it this seems not to be that good in typical benchmaxxed benchmarks for its size but could be wrong ofc
mikumikubeeeeaaaammm@reddit
Imma die due to happiness oh my its thiccc
tassa-yoniso-manasi@reddit
gguf when?
I can't wait to run this at 0.00000000006884 t/s
popiazaza@reddit
Unsloth dynamic 3.0 gguf 0.1 bit when?
Raredisarray@reddit
Straight up lol
Finanzamt_Endgegner@reddit
Same
jakegh@reddit
Great pricing in the API. Not many gonna run a 1.6T param model locally, though.
They just waited too long; Opus 4.6/GPT-5.4 but open-source and cheaper won't shake the earth like R1 did. If they matched 4.7/5.5 that would be a different story.
SufficientPie@reddit
I'll always choose open source models over proprietary ones that scrape my open source code without following the license, sell the result back to me, then give all my data to people who want to hunt me down with autonomous weapons.
Middle_Bullfrog_6173@reddit
New hybrid attention + mHC. Is this supported in any inference software yet?
hdmcndog@reddit
vLLM has support for it.
Iory1998@reddit
Man up guys and release a nano version.
Lopsided_Dot_4557@reddit
DeepSeek V4 Pro & Flash are HERE
and they just made every GPU cluster look overbuilt
🔹 1.6T parameter Pro + 284B Flash — both with 1M token context 🔹 27% of compute cost vs V3.2 — 10% of KV cache 🔹 Beats GPT-5.4 on Codeforces (3206 rating) — first open model to match closed frontier on code 🔹 New architecture: CSA + HCA + mHC + Muon optimizer — built different from the ground up 🔹 Fully open source — MIT license — run it yourself
Full breakdown video below 👇
https://youtu.be/Owzn47EBsow
HeavenBeach777@reddit
insane release
FoxiPanda@reddit
Guys, I can only test so many models per day.
KeikakuAccelerator@reddit
The goats!
marhalt@reddit
You can just feel the machine going 'wtf are you doing to me' when you are downloading it.
Lazy-Pattern-5171@reddit
Section 5.4.4 Code Agent in their report
To benchmark our coding agent capability, we curate tasks from real internal R&D workloads We collect ~ 200 challenging tasks from 50+ internal engineers, spanning feature development, bug fixing, refactoring, and diagnostics across diverse technology stacks including PyTorch, CUDA, Rust, and Ctt. Each task is accompanied by its original repository, the corresponding execution environment, and human-annotated scoring rubrics; after rigorous quality filtering, 30 tasks are retained as the evaluation set. As shown in Table 8, DeepSeek-V4-Pro significantly outperforms Claude Sonnet 4.5 and approaches the level of Claude Opus 4.5.
(There’s a table in the middle with information that DeepSeek v4 pro reaches 67 where on the same benchmark Opus 4.6 reaches 80 and 4.5 reaches 70)
In a survey asking DeepSeek developers and researchers (N = 85) — all with experience of using DeepSeek-V4-Pro for agentic coding in their daily work — whether DeepSeek-V4-Pro is ready to serve as their default and primary coding model compared to other frontier models, 52% said yes, 39% leaned toward yes, and fewer than 9% said no. Respondents find DeepSeek-V4-Pro to deliver satisfactory results across most tasks, but note trivial mistakes, misinterpretation of vague prompts, and occasional over-thinking.
This sounds like the best “DeepSeek helped develop DeepSeek” moment for me and that’s amazing.
26YrVirgin@reddit
Multi-modal? Does it support image input?
MichaelXie4645@reddit (OP)
No, HF flags it as “text generation” not “image text to text”
Kahvana@reddit
Glad it's finally released.
larrytheevilbunnie@reddit
Finally
pmttyji@reddit
Most expected news! Finally!