Deepseek V4 Flash and Non-Flash Out on HuggingFace

[-]

toothpastespiders@reddit

I think this is the most annoyed I've ever been at myself for not going overboard with RAM when I was putting my machine together.

[-]

Zymedo@reddit

I did and bought 448 GB (192 GB) sticks. Still not enough for any current big model, AND I cursed everything because of how bad* my CPU is able to run them (DDR5-3600 or something)...

[-]

desktop4070@reddit

What CPU is it? I have a 12900K that also struggles running my DDR5 at advertised speeds.

[-]

Zymedo@reddit

9800X3D. Maybe I'm just unlucky, of course, but same - it doesn't want to run even just 2 sticks at 6000, drops them to 5600, or 3600 with 4.

[-]

PhilippeEiffel@reddit

The DeepSeek v4 flash model is 160 GB. If I had your memory size, I would give it a try (13B active).

[-]

Flash version might be okay. I tried Step-3.5-Flash (199B A11B) and it was ~10-13 tokens/sec (depending on the exact quant) with careful -ot. But it reasons for literal thousands of tokens - not ideal, considering that I managed to cram only 40K of context in my VRAM. The biggest problem would be waiting for llama.cpp to implement CSA and HCA, probably. They didn't implement DSA, did they?

[-]

Monad_Maya@reddit

Didn't jump to AM5 / DDR5 for the same reason. You need a very good board + binned CPU to get stable high capacity DDR5 on consumer grade stuff AM5.

We really need to jump to workstation parts to get stable high capacity DDR5 stable.

[-]

species8472@reddit

This used to be true, but isn't anymore. Multiple motherboard vendors have 192gb and 256gb 6000/6400MT/s kits on their QVL.

I have 192gb at 5200 stable. Just loaded the expo profile and was off and running.

[-]

Better-Monk8121@reddit

running ddr5 192gb 6000mhz with 9950x3d stable. Mobo z870e tomahawk

[-]

DistanceSolar1449@reddit

DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated)

284B A13B is not gonna fit on 128GB, you'll need at least 256GB for that. That's a lot of overboard.

[-]

BestGirlAhagonUmiko@reddit

But if you have like 128GB RAM + a couple of RTX 3090, quantize it down to IQ4XS or Q3KXL and it'll fit with a pretty usable context size. Am I wrong?

[-]

Vaguswarrior@reddit

Right, just a couple 3090s lol

[-]

Radiant_Bag_5007@reddit

[-]

WHO_IS_3R@reddit

Those quick second decisions man, im still beating myself for not choosing 128gb ram and throw 2-3 random 3090s instead of choosing an i5-16gb laptop, i was this close man, if only i knew

[-]

Vaguswarrior@reddit

I'm surprised you're noticing a difference I mean an i5... That the big leagues right?

[-]

IrisColt@reddit

heh

[-]

madhan4u@reddit

Good to know that I just need 96Gb more RAM and a couple of RTX 3090

[-]

DistanceSolar1449@reddit

IQ4XS will be around the size of 1/2 the param count of the model. So 284b/2 = 142GB

Assuming 8GB for the operating system and a few small apps, you have 120GB of RAM. So you need 22GB of VRAM before you even look at KV cache. You need 2x 3090 to even have ANY context.

It's doable but ugly.

[-]

Karyo_Ten@reddit

According to release notes, it's 10GiB of KV cache for 1M context.

So this fits comfortably on 192 GiB VRAM.

[-]

DistanceSolar1449@reddit

192GB != 128GB

[-]

Karyo_Ten@reddit

You said

DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated)

284B A13B is not gonna fit on 128GB, you'll need at least 256GB for that. That's a lot of overboard.

192GB != 256GB

[-]

DistanceSolar1449@reddit

I’m not talking about mixing RAM sticks. Yes you can mix and match ram sticks and somehow stick exactly 142GB into a motherboard but nobody does that.

[-]

FullstackSensei@reddit

Or you can run a six channel memory platform

[-]

Karyo_Ten@reddit

Not sure what you're talking about. I'm talking about 192 GiB not 142 and 192GiB is a very standard size: https://www.corsair.com/us/en/p/memory/cmk192gx5m4b5200c38/vengeance-192gb-4x48gb-ddr5-dram-5200mhz-c38-memory-kit-black-cmk192gx5m4b5200c38

[-]

florinandrei@reddit

So this fits comfortably on 192 GiB VRAM.

So easy, you could run it on a Nokia.

[-]

ResidentPositive4122@reddit

The weights are ~160GB on hf, and they use mixed fp4/fp8 already. You won't get much out of quanting it again, and might lose a lot of precision. Realistically this will be served on 2x PRO6000, since the kv cache seems to be really efficient.

[-]

-dysangel-@reddit

Smug mode: engaged

[-]

JLeonsarmiento@reddit

Sell everything, get m3 ultra with 256gb for 6k and go on holiday trip with the rest of cash.

[-]

draetheus@reddit

I passed up on a chance to buy an open box 256gb mac studio from microcenter for extremely cheap. What a fool I was...

[-]

RoomyRoots@reddit

Could be like me, prices exploded in the week of my birthday. I had to cancel my wishlist because 64GB was more expensive than 128 when I bought it one year before.

[-]

SnooPaintings8639@reddit

I've built mine for Llama 3 release. It is serving well and is worth today more that two years ago when it was new.

But...

I do daily ritualis of looking for cheap options to extend it it anyway... but it only is getting more and more expensive:(

[-]

PWCIV@reddit

is it producing something good?

[-]

Wooden_Yam1924@reddit

I feel the same. In January last year I build PC and got 512GB DDR5 ECC - it cost me \~$3000, now I can buy one 64GB stick for the same price... Looking at the current models I wish I had 1TB

[-]

ambient_temp_xeno@reddit

Rare instance of me being right about something.

[-]

kaisurniwurer@reddit

With A49B, you probably would have had hard time actually using it anyway so don't feel so bad.

Flash on the other hand looks quite juicy...

[-]

WAFFLED_II@reddit

This is the one time I’m GLAD I went overboard on ram and nothing else

[-]

Zyj@reddit

Looks like it‘s 158GB in size. Fits dual Strix Halo.

[-]

RazsterOxzine@reddit

I feel you! I feel you! * cries in corner *

[-]

Monad_Maya@reddit

I've maxed out (128gb / AM4) but it's still not enough.

[-]

Monkey_1505@reddit

Beautiful month. Incredible really.

Makes me wonder how rich one has to be to run flash locally though.

[-]

Zyj@reddit

Dual Strix Halo, I paid ~3300€ total.

[-]

pmttyji@reddit

when did you buy? Thought single one alone cost $2K+

[-]

Zyj@reddit

Fall last year, they were 1580€ each

[-]

Zeeplankton@reddit

seriously is this like the heaviest release month in.. ever? Gemma, Kimi, Qwen, Ling, Hy3, MiMo, Deepseek, am I missing anything..? I guess lots of closed model releases too.

[-]

grumd@reddit

Yeah, all of that AND GPT-5.5 and Opus 4.7, it's crazy

[-]

WildContribution8311@reddit

I'd rather pretend Opus 4.7 never happened.

[-]

steny007@reddit

We do not use thost words here.

[-]

7734128@reddit

There's been a lot more. Plenty of specialized non-llm models, but also more llms like https://huggingface.co/ibm-granite/granite-4.1-8b

Oh, wait you're talking about the whole month? This week alone has been insane.

[-]

jadbox@reddit

Now we just need proper benchmarks to compare all of them in real world writing and coding.

[-]

Zc5Gwu@reddit

Minimum $5k for cluster of 2 strix halo 128gbs I'd say.

[-]

LagOps91@reddit

2 gpus with 24GB vram each and 128gb ram would be viable for consumer hardware for q4, but q3 should be fine too, so that's not too absurd. Or at least was before ram price hikes.

[-]

Monkey_1505@reddit

You'd probably need a fairly beefy cpu/ram setup to handle offloading a significant majority of a 13b active model that's in the 100+gb territory and get usable speeds, I suspect.

Not that I've tried.

[-]

LagOps91@reddit

I do have 24gb vram and 128gb ddr5 ram. On comparable models, i am getting about 8 t/s at 32k context. No good for agentic coding, but fine for regular assistant usage.

[-]

Nobby_Binks@reddit

Sigh, I haven't even set up Qwen 3.6 27B yet. I cant keep up

[-]

Iory1998@reddit

You can't set up these either.

[-]

the3dwin@reddit

Available on LM Studio 2 days ago

[-]

Iory1998@reddit

??

[-]

the3dwin@reddit

Qwen 3.6 27B available on LM Studio 2 days ago

[-]

AXYZE8@reddit

Both models use FP4 + FP8 Mixed: MoE expert parameters use FP4 precision; most other parameters use FP8.

So these DeepSeek models are not as big as param count suggest. Very important when comparing to other models that are usually BF16 or FP8 - quantizing them to DeepSeek will reduce their quality.

That being said as Kimi K2.6 is INT4 the new whale is still fattest OSS model. Love that we got Flash variant for (beefy) desktops - 284B at FP4+FP8 fits in 256GB limit of AM5/LG1700/1851 platforms.

[-]

waruby@reddit

Shit, that means it can't be quantized much more.

[-]

Minute_Attempt3063@reddit

they did the work for us XD

and yet its massive LOOL

maybe finally we are reaching levels of chatgpt, but locally

[-]

Unusual_Guidance2095@reddit

Kind of disappointing that indeed it seems to only be a single modality

[-]

traveddit@reddit

I think just from this Qwen already beats the flash model.

[-]

Hoodfu@reddit

Qwen is dry as a bone for writing. Deepseek (at least from 0324) has been a creative writing dynamo.

[-]

traveddit@reddit

Prompt:

Write a four-line poem about loneliness. You may not use: night, silence, empty, cold, shadow, alone, darkness, stars, moon, room, echo, or any word for crying. The poem should contain one concrete image drawn from an unexpected domain (not nature, not domestic interiors, not weather).

Deepseek:

A turnstile clicks—
one passenger through,
then the long count
of no further clicks.

Qwen 397:

The vending machine displays
one snack left in row seven—
its coil spins without purchase,
metal fingers grasping nothing.

Tell me the dynamo that is Deepseek here?

[-]

Outrageous-Wait-8895@reddit

In what way is Qwen superior here?

[-]

traveddit@reddit

Why did you run them through an LLM and find that they prefer Deepseek's?

I'll entertain you if you tell me why Deepseek's in better in your own words.

[-]

Outrageous-Wait-8895@reddit

Why did you run them through an LLM and find that they prefer Deepseek's?

Did I?

You didn't answer the question by the way.

[-]

RuthlessCriticismAll@reddit

Deepseek's is like infinitely better. Neither is particularly good but its actually about loneliness rather than a highly used snack machine which evokes literally the opposite emotion.

[-]

traveddit@reddit

Deepseek's is like infinitely better

Is it? A lone passenger at a subway?

highly used snack machine which evokes literally the opposite emotion.

Gen Z level of literacy you have right there. You can't even properly locate the subject of Qwen's poem.

You can barely read but you're confident enough to tell the world about what you think is good writing? Nobody should care because 99% of you on the sub don't know good writing from bad.

[-]

tengo_harambe@reddit

Qwen's poem is abstract and metaphorical (the snack is alone and nobody wants it), Deepseek's is literal. Subjectively I prefer Qwen which is osd because I usually find it to be terrible at creative writing.

[-]

Hoodfu@reddit

I've used the qwens/deepseeks/gemmas/mistrals for text to image prompt enhancements for a good while now on an m3 ultra 512. The images qwen makes are simplistic and unimaginative, even when upping the temp. Gemma is a good deal better, along with the older mistrals. Deepseek however, especially 0324, thinks of stuff that's amazing. If you want it to be snarky and sarcastic, is especially comes alive. There's a reason why it's always near the top of the creative writing benchmarks whereas these others are far below it.

[-]

traveddit@reddit

What makes you think you can tell good writing from bad? Why should I care about LLM evaluated writing benchmarks?

[-]

The_Rational_Gooner@reddit

V3.2 is really dry though. Hopefully this one strikes a good balance

[-]

dark-light92@reddit

In the report they mention that they are working on multimodal capabilities.

[-]

Monkey_1505@reddit

I think they were wise to focus on agentic first, and circle back to multi-modality.

[-]

breadfruitcore@reddit

Yeah I'm super grateful they released this, but it's a bit sad. Guess I'll still be doing model switching in my workflows.

[-]

MichaelXie4645@reddit (OP)

Agreed, I kinda looked forward to the 284B with vision support.

[-]

Specter_Origin@reddit

I would keep my fingers crossed...

[-]

ReadyCelebration2774@reddit

tested the flash model, seems really really fast

[-]

Whiz_Markie@reddit

How much VRAM is required?

[-]

power97992@reddit

Approximately 160gb for the model if you count the xet files . Plus the full context i guess you are looking at another 20-40gb if it is 5% of v3.2( since pro is 10% of v3.2 per tk)

[-]

thrownawaymane@reddit

What about those of us sitting on a bunch of DDR4/5 RAM?

[-]

the__storm@reddit

284B params; we'll see what the quants look like but you could probably squeeze it into 192 GB (2x Pro 6000 Blackwell). With 13B active, a 256GB Mac might do okay as well.

[-]

200206487@reddit

I have one of the 256s, looking forward to testing. I just learned about this while sitting on the porcelain throne. I hear that since its QAT (prequantized during training) that using any quant below what's ready released as base will considerably reduce quality compared to a NON-QAT release? Did I get that right? Also, from what I'm reading, QAT is great to lead in with to set a trend for others too do the same. I imagine QAT is best since they can both control quality while compressing the model straight from the source - again iirc.

[-]

the__storm@reddit

Yeah, that's right. Looks like the instruct QAT is 160GB though so you should be good.

[-]

OC2608@reddit

Yes.

[-]

Zyj@reddit

160GB + context

[-]

Thomas-Lore@reddit

But context is very efficient so maybe 192GB will be enough.

[-]

RedBull555@reddit

Yes.

[-]

ReadyCelebration2774@reddit

not local, through their api

[-]

thread-e-printing@reddit

All of it

[-]

True_Requirement_891@reddit

Looks like minimax is gonna be out of bis...

[-]

power97992@reddit

They will release a bigger model like 390-400b

[-]

silenceimpaired@reddit

DeepSeek-V4-Flash with 284B parameters (13B activated) is an interesting setup. Hardly a flash in my mind but I'm curious how it will compare at 4bit against GLM 5 at 2bit.

[-]

andy2na@reddit

DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of one million tokens

need a 0.01bit quant of that

[-]

Unusual_Guidance2095@reddit

It does seem the the entire model size is only 896 GB though, so seemingly mostly Q4 but the model card said a mix of both

[-]

-dysangel-@reddit

if this model is using engram (though I'm not reading anything saying that yet..), a lot of those weights could be stored on disk

[-]

Hoodfu@reddit

So at q4 that's around 450 gigs. So I can run it on an m3 ultra 512 gig mac with about 3 sentences of context.

[-]

moar1176@reddit

Nah, did you not see the kv chart? They fit 1M context into < 10 gigs on the full size model. Absolutely masterful engineering. Of course on a Mac that would probably take a few hours to Prefill.

[-]

Silver-Champion-4846@reddit

How is the loss of the cach?

[-]

ResidentPositive4122@reddit

q4 that's around 450 gigs.

No, the ~865GB is already quantised in fp4/fp8 mixed. So you can't reduce that too much without basically bricking it.

[-]

po_stulate@reddit

3 sentences context costs 60GB RAM?

[-]

Nodja@reddit

Parent is saying that the model was mostly trained at 4 bits already because 1.6T * 4 bits = 6.4Tbits = 800GB, if the size on disk is 896GB only 96GB is possibly 16 bit, for for Q4 quants you're seeing a savings of ~50GB not 450GB.

[-]

shing3232@reddit

the based is 1.6T and qat into instruct at 900

[-]

Caffdy@reddit

can you expand on this? it's kinda confusing

[-]

the__storm@reddit

They're both 1.6T parameters, but the base model (not trained to talk in turns or do tasks or anything, purely predicts text) is at higher numeric precision. The instruct version (model with additional training that we normally use to chat and do tasks) is basically pre-quantized to take up less space by storing the parameters with less precision. Since the model was trained taking into account in advance how it would be quantized, we would expect this version to perform better than an equivalent model that had been quantized separately after training. (Downside is that you also have less flexibility to change the quantization after the fact.)

[-]

34574rd@reddit

not really according to their technical paper, it can be dequantized to bf16 lossless

[-]

florinandrei@reddit

Downside is that you also have less flexibility to change the quantization after the fact.

Just quoting that fact because it's important.

[-]

Caffdy@reddit

is this what they call QAT?

[-]

shing3232@reddit

yes

[-]

redimkira@reddit

if you find that, you will win the medicine Nobel prize on lobotomy

[-]

cafedude@reddit

Yeah, it would be nice if they took a hint from the Qwens and released some under 100B models. Flash-lite or something.

[-]

synn89@reddit

MIT license? Nice.

[-]

SufficientPie@reddit

Hear that, Qwen?

[-]

popiazaza@reddit

Minimax perhaps? or less likely, Moonshot? Qwen and Zai are quite open.

[-]

SufficientPie@reddit

Qwen3.6-397B-A17B was released as open weights?

[-]

popiazaza@reddit

There is no such model? Plus and Max variant are proprietary as usual for Qwen.

[-]

SufficientPie@reddit

Qwen3.5-Plus was released as open weights:

In particular, Qwen3.5-Plus is the hosted version corresponding to Qwen3.5-397B-A17B with more production features, e.g., 1M context length by default, official built-in tools, and adaptive tool use. For more information, please refer to the User Guide.

[-]

popiazaza@reddit

Not sure what do you what, but seem to be able to find the answer by yourself. Have a nice day.

[-]

popiazaza@reddit

Yes. Only their plus/max variant are proprietary.

[-]

onil_gova@reddit

what's wrong with Apache 2.0?

[-]

SufficientPie@reddit

Qwen3.6-397B-A17B was released under Apache 2.0??

[-]

Embarrassed_Adagio28@reddit

Kinda disappointed to see there is no 30b, 80b or even a 122b. This doesnt help the local llm community much.

[-]

Right-Law1817@reddit

Multimodal models soon.

[-]

steny007@reddit

Doens't necessarily mean soon.

[-]

AFruitShopOwner@reddit

Oof those hallucinations on flash are baaaaad

[-]

Zeeplankton@reddit

the whole idea of positive rlhf for hallucination is still new. in flashes defense all these scores suck balls

[-]

silenceimpaired@reddit

Sounds like a great creative writing model.

[-]

AFruitShopOwner@reddit

Seems like it has more world knowledge at the cost of thinking it knows everything

[-]

silenceimpaired@reddit

I'm okay with that. I'm not using LLMs to code... Just brainstorm

[-]

AFruitShopOwner@reddit

Big yikes

[-]

Altruistic_Heat_9531@reddit

So lemme get this straight in 1-2 weeks there are

- Qwen 3.6
- Deepseek V4
- Gemma 4
- Opus 4.7
- GPT 5.5

And in past 24 hours
- DeepSeek V4
- 27B Qwen 3.5
- GPT 5.5

[-]

ndrewpj@reddit

In past 24hrs Qwen v3.6 27b not v3.5

[-]

Mashic@reddit

Anthropic is the only one to still release an open model.

[-]

VampiroMedicado@reddit

Did xAI release a model yet?

[-]

Altruistic_Heat_9531@reddit

They did. if i am not mistaken only Anthropic that not yet released any model and also not released any major contribution in terms of training. Only MCP.

Microsoft even donated DeepSpeed, Uber with Horovod, and ofc Meta with Torch and its predecessor Caffe2.

[-]

Mashic@reddit

Well, that one too.

[-]

arbv@reddit

They did! Grok 2, IIRC.

[-]

Altruistic_Heat_9531@reddit

whoops mb

[-]

xspider2000@reddit

u forgot GLM 5.1

[-]

Zeeplankton@reddit

I think this is the biggest release month ever

[-]

Sky-kunn@reddit

[-]

zdy132@reddit

On par with SOTA models, with reduced compute and memory load. DeepSeek did a great job here.

[-]

coder543@reddit

with vastly reduced compute and memory load

At 1.6T A49B, I don't know that we can confidently say that for DeepSeek V4 Pro. We don't know how big the frontier models are. Anyone who throws out a number is just wildly guessing.

But it is still very cool that they released the model openly.

[-]

the__storm@reddit

Well the API is like 1/4 the price of Opus/GPT, so they're probably either accepting a lower margin or have improved inference efficiency. The technical report has a lot of stuff about how they're serving it.

[-]

kurtcop101@reddit

The inference costs of open source models are more representative of actual inference costs.

It's why I keep saying the big companies are not losing money on their subs or anything like that. Random data centers are not hosting models trying to gain market share, they run at a profit.

The place where big labs are losing money is only training the models - these subscriptions ARE profitable. For the frontier US labs the training is subsidized by VC money, for Chinese labs the training is subsidized by the Chinese government. The Chinese government isn't expecting a return - they're in an arms race - so they release open source instead, because it makes it harder for the frontier US labs to charge higher pricing and profit more, and disrupts them. It generally just disrupts the US market.

[-]

coder543@reddit

They definitely accept lower margins. If they tried to charge high margins on an open model, every other AI host on OpenRouter would undercut them with their own model.

[-]

HiddenoO@reddit

Likely, yes, but you have no idea about OpenAI/Claude/Google margins. They cannot charge arbitrarily high either because they don't want people to go to their competitor models and possibly never return.

[-]

34574rd@reddit

im not sure how exactly they are going to undercut with a 1.6T parameter model

[-]

shing3232@reddit

it's gonna be cheaper once more 950 accelerator cluster is online.

[-]

Zeeplankton@reddit

I think it's funny when just last year we were in awe of an open 1T model. I can't remember which one it was, but how times have changed lol.

[-]

Silver-Champion-4846@reddit

Kimi K2

[-]

Winter_Educator_2496@reddit

Either same or bigger. It depends on a lot of factors I won't get in to, but it is not less. Remember that the single easiest way to make a model better is to make it bigger.

[-]

Monkey_1505@reddit

I assume you did not browse the paper?

[-]

zdy132@reddit

I am talking about the reduction in Single-Token FLOPS, and Accumulated KV Cache, the two diagrams on the right.

Both of these mean that they will be less taxing on the hardware to run, making it cheaper to serve. And in the (hopefully near) future when consumer hardware can run it, we will be able to run it more efficiently as well.

[-]

power97992@reddit

On page6 of the technical report, they said it is approaching the level of opus 4.5.

[-]

DistanceSolar1449@reddit

Category	Benchmark (Metric)	Opus-4.6 Max	GPT-5.4 xHigh	Gemini-3.1-Pro High	K2.6 Thinking	GLM-5.1 Thinking	DS-V4-Pro Max
Knowledge & Reasoning	MMLU-Pro (EM)	89.1	87.5	91.0	87.1	86.0	87.5
Knowledge & Reasoning	SimpleQA-Verified (Pass@1)	46.2	45.3	75.6	36.9	38.1	57.9
Knowledge & Reasoning	Chinese-SimpleQA (Pass@1)	76.4	76.8	85.9	75.9	75.0	84.4
Knowledge & Reasoning	GPQA Diamond (Pass@1)	91.3	93.0	94.3	90.5	86.2	90.1
Knowledge & Reasoning	HLE (Pass@1)	40.0	39.8	44.4	36.4	34.7	37.7
Knowledge & Reasoning	LiveCodeBench (Pass@1)	88.8	-	91.7	89.6	-	93.5
Knowledge & Reasoning	Codeforces (Rating)	-	3168	3052	-	-	3206
Knowledge & Reasoning	HMMT 2026 Feb (Pass@1)	96.2	97.7	94.7	92.7	89.4	95.2
Knowledge & Reasoning	IMOAnswerBench (Pass@1)	75.3	91.4	81.0	86.0	83.8	89.8
Knowledge & Reasoning	Apex (Pass@1)	34.5	54.1	60.9	24.0	11.5	38.3
Knowledge & Reasoning	Apex Shortlist (Pass@1)	85.9	78.1	89.1	75.5	72.4	90.2
Long Context	MRCR 1M (MMR)	92.9	-	76.3	-	-	83.5
Long Context	CorpusQA 1M (ACC)	71.7	-	53.8	-	-	62.0
Agentic	Terminal Bench 2.0 (Acc)	65.4	75.1	68.5	66.7	63.5	67.9
Agentic	SWE Verified (Resolved)	80.8	-	80.6	80.2	-	80.6
Agentic	SWE Pro (Resolved)	57.3	57.7	54.2	58.6	58.4	55.4
Agentic	SWE Multilingual (Resolved)	77.5	-	-	76.7	73.3	76.2
Agentic	BrowseComp (Pass@1)	83.7	82.7	85.9	83.2	79.3	83.4
Agentic	HLE w/ tools (Pass@1)	53.1	52.0	51.6	54.0	50.4	48.2
Agentic	GDPval-AA (Elo)	1619	1674	1314	1482	1535	1554
Agentic	MCPAtlas Public (Pass@1)	73.8	67.2	69.2	66.6	71.8	73.6
Agentic	Toolathlon (Pass@1)	47.2	54.6	48.8	50.0	40.7	51.8

[-]

power97992@reddit

glm 5.1 is bf 16 , but ds v4 is q4+q8 mixed precision, so v4 actually uses less vram..

[-]

NandaVegg@reddit

According to the GLM 5 paper, GLM 5 (and 5.1 I guess) had int4 QAT (also mixed precision: W8A8 and W4A8 for experts) at the post-training phase, but it is kind of vaguely written. I would consider it native 8-bit at least; I had no issues running GLM 5.1 with the official fp8 quant (vllm, 8xH200).

[-]

BillDStrong@reddit

Yes, but isn't GLM-5.1 a dense model? Take the sqrt of the active * the total of the MOE to get a ballpark of the size of a dense version it will act like, as a ballpark figure.

So it acts like a 277B model at the speed of a 49B model, and the database of a 1.6T model.

If it is sticking head to head with the 744B model, it is doing pretty darn good, really.

[-]

DistanceSolar1449@reddit

It’s 744b 40b active

[-]

NandaVegg@reddit

If the FLOPS reduction claim stands true (and could be properly implemented in inference engines) it would be fantastic choice for faster interactive experience.

I tested the official API a bit; so far the official API is not that fast for neither Pro nor Flash, but it has always been the case that DeepSeek's official API is as slow as old towser as they apparently do not have enough compute.

As for parameters count, it uses mixed precision (experts are 4-bit) so the real VRAM footprint for weights should be around 860B. Still larger than GLM-5.1 which is 744B, Kimi which is around 500B with 4-bit QAT.

[-]

zdy132@reddit

It varys a lot right now. Some replies are almost instant, 100~200 t/s . But some replies are only 10~20 t/s.

It's probably their server being hammered by all the people checking the new model out.

[-]

Few_Water_1457@reddit

https://artificialanalysis.ai/?models=deepseek-v4-pro%2Cdeepseek-v4-flash%2Cminimax-m2-7%2Cqwen3-6-27b wait what?

[-]

More-Curious816@reddit

Recoloring the chart because I can't see shit

[-]

RazsterOxzine@reddit

May I ask how this helps? Just wondering because it's not doing it for my eyes ☼_☼;

[-]

More-Curious816@reddit

I have shit eyesight, that white and gray wasn't doing it for me. I recolor it originally for me, and posted for people who suffer the same headache from grey and white. Wasn't meant as better alternative

[-]

layer4down@reddit

Yeah and thanks for it. I’ve got crazy floaters and blind spots so this is definitely preferable.

[-]

RazsterOxzine@reddit

I feel yeah. It works it works. Probably more people in the same situation. I was just curious. 🖖

[-]

Salaja@reddit

YOU MADE A MISTAKE

the "SWE Verified" middle column should be Claude (green), instead of GPT (orange).

I feel like there is a "blind leading the blind" joke here, but i can't "see" it... ha ha.

[-]

Desm0nt@reddit

The fact that GPT-5.4-high performs worse than (the very strange and erratic) Gemini 3.1 Pro in some tests makes me seriously doubt the reliability of these results...

[-]

Thomas-Lore@reddit

You should try Gemini 3.1 Pro on API or AI Studio, not in the Gemini App. It is much more capable than people think. I use it for one shotting solutions that I then give to minimax to implement.

[-]

Desm0nt@reddit

In AI studio - maybe. But for coding google force me to use antigravity and gemini CLI (and ban if I proxied it to any more usefull tools) and in atigravity it way more dumb than even sonnet and sometimes (during prime hours) even more dumb than gemini 3 flash, with typical quantisation problems (losing the meaning of the context, getting stuck on repeating the same phrase, etc.).

So maybe 3.1 Pro is actually pretty good in its original, full-fledged FP16 form, but it feels like they only ran it that way in its lifetime once — for benchmarks. And then “little, poor indie company GOOGLE” apparently quantized it in Q2 and rolled it out into their tools, otherwise I can’t explain such a difference in real-world use.

[-]

power97992@reddit

Gemini 3.1 pro in the antigravity seems to be way worse than opus 4.6

[-]

Ardalok@reddit

Not as benchmaxed as Qwen, good.

[-]

danigoncalves@reddit

Waiting for a micro version since their Flash version is almost 300B 😬

[-]

GlossyCylinder@reddit

Interesting they say it's a preview version but looking at the benchmark, it's on par with k 2.6 on coding and agenic but slightly better at math and reasoning (expected . I honestly thought it would perform worse than kimi or glm on coding but the gap between OSS models are very tight.

And detailed report/paper from DS as always. Seems like there able to incportared all their recent ideas into v4.

[-]

Karyo_Ten@reddit

They didn't incorporate Engram ;)

Re theorem proving, interesting, though are you aware of Mistral's Leanstral and Meituan Longcat Prover?:

https://mistral.ai/news/leanstral (120B params)
https://www.longcatai.org/models/flash-prover

[-]

segmond@reddit

you must not be aware of these that long predate that

https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-671B - "DeepSeek-Prover-V2, an open-source large language model designed for formal theorem proving in Lean 4"

https://huggingface.co/deepseek-ai/DeepSeek-Math-V2 - "DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute."

[-]

Karyo_Ten@reddit

I am aware, but Leanstral and Meituan prover are less than 2 months old and didn't get that much coverage so could have easily be missed.

[-]

segmond@reddit

They have the best and only math models, deepseekMathV2 which is capable of winning gold at IMO and deepseekProver. I think they incorporated those lessons and data into this model.

[-]

power97992@reddit

It looks good and maybe engrams will Come out later ?

[-]

Ok-Mess-3317@reddit

IT’S FINALLY HERE

[-]

More-Curious816@reddit

AND IT CHANGES EVERYTHING. (I can see the click baiting slop already)

[-]

bnolsen@reddit

ITS A GAME CHANGER. WE WERE ALL WRONG, DEEPSEEK IS BACK.

[-]

Ok-Mess-3317@reddit

Lmao unfortunately

[-]

200206487@reddit

Looking at Unsloth / Bartowski 5-bit ftw!

[-]

SnooPaintings8639@reddit

Wake me up when GGIF is there!

[-]

This_Maintenance_834@reddit

This is pretty much already quantized. GGUF won’t really help to reduce the size, unless it goes below Q4.

[-]

tarruda@reddit

unless it goes below Q4.

I don't have high expectations for deepseek, but Qwen 3.5 397b quantizes extremely well even up to 2-bit. I was able to run it on my 128G mac and got excellent benchmark results: https://huggingface.co/tarruda/Qwen3.5-397B-A17B-GGUF/discussions/1#69d142b4f17676f98e53c16a

If Q4 fits in 160G, maybe there will be a good ~3-bit quant for 128G machines.

[-]

Zeeplankton@reddit

WHATEVER IT TAKES

[-]

SnooPaintings8639@reddit

Not gonna lie - I am waiting for Q2, lol

[-]

Recoil42@reddit

Gank Goodness Its Friday

[-]

manipp@reddit

So is this the same thing you get from their website? E.g. is instant DS4 flash?

[-]

david_0_0@reddit

the flash variant with 13B active params is interesting — does anyone know if there's a q4 quant that fits on 32-64GB VRAM without completely tanking quality? curious where the actual quality cliff shows up vs the full activated count

[-]

Nepherpitu@reddit

The point is it is already Q4 at 170Gb. It will not fit into 64Gb VRAM even at Q1.

[-]

david_0_0@reddit

ah that's brutal, so even the flash version is basically multi-GPU territory regardless of quant level

[-]

dampflokfreund@reddit

Disappointing release. Too hard to run, even the Flash, no multimodality, no revolutionary stuff like Engram... expected much more after the long wait.

[-]

Mochila-Mochila@reddit

Their work on solving context size shouldn't be dismissed zo.

[-]

Pristine-Tax4418@reddit

What is the real size? I want to believe that it is not 284B

[-]

Look_0ver_There@reddit

The real size is 284B, however it uses a mix of FP4/FP8/FP16/FP32 as well as some I8 weights. It's sort of like GPT-OSS-120B where there it's all mostly MXFP4.

As a consequence, the full weight size is \~170GB, and not the \~284GB you might expect.

[-]

edward-dev@reddit

How much RAM? Yes.

Jokes asides, Qwen3.6 27B seems on par or even a little bit better than V4 flash, at least on benchmarks

[-]

alex20_202020@reddit

Qwen3.6 27B seems on par or even a little bit better than V4 flash, at least on benchmarks

So I have noted. Why run DS instead using so much space?

[-]

Former-Tangerine-723@reddit

Because benchmarks don't tell the whole true

[-]

34574rd@reddit

*on benchmarks*

[-]

kevin_1994@reddit

Anyone able to compare flash to minmax m2.7? Similar sizes but i dont see any direct comparison and I'm on my phone.

[-]

Sinister1066@reddit

[-]

Zc5Gwu@reddit

Wow, that pricing and speed is crazy. Minimax need new architecture fast.

[-]

Sinister1066@reddit

I think flash takes the win on this one, but K2.6 wins over v4 flash

[-]

power97992@reddit

K2.6 is also 13x more expensive

[-]

Few_Water_1457@reddit

really

[-]

nFunctor@reddit

Spoke to it about some philosophical/social stuff to check on its style and analytical effort. Sent outputs back to opus. We both agreed it’s Opus 4.6 high think both in style and substance.

It was set free.

[-]

uniVocity@reddit

Ha just when I was beginning to feel rich with my 128gb… now I’m hoping qwen releases another model to compete with DeepSeek. Maybe qwen3.6-397b or a new qwen-coder version? One can only dream.

[-]

alex20_202020@reddit

now I’m hoping qwen releases another model to compete with DeepSeek

Benchmarks list is very different for 3.6 27B, but where lines are named similar (what EM means?, 5-shot?), I see Qwen is higher: https://huggingface.co/Qwen/Qwen3.6-27B vs https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro

MMLU-Pro 86.2

MMLU-Pro (EM) 5-shot 73.5

[-]

Due_Net_3342@reddit

32B reap version when? :))

[-]

Ferilox@reddit

What do they mean by "Think Max" mode? Is it like Think High but with a special system prompt? Have they specified what system prompt did they use on the benched values for DS-V4-Pro Max? Was it a single system prompt for the entire benchmark suite or did they change it benchmark to benchmark?

[-]

power97992@reddit

Max means max reasoning

[-]

Ferilox@reddit

I was going over deepseek api docs and thats how its turne on, but what do they mean by the system prompt in the response format in the model card:

Special system prompt + <think> thinking </think> summary

I find it confusing

[-]

jnmi235@reddit

V4-flash only being 160GB is wild

[-]

Eyelbee@reddit

It's not that wild considering it's barely any better than qwen 3.6 27B

[-]

Expensive-Paint-9490@reddit

So I was going to download MiMo and Hy3 to test but now my priorities have changed.

[-]

power97992@reddit

U can test all of them on openrouter then decide which one u will download

[-]

power97992@reddit

I’m surprised they didnt use engrams in this model , maybe in the future and it doesnt have multimodal for the pro model

[-]

Dany0@reddit

It's a preview, apparently multimodality + engrams are still coming

An early checkpoint that beats Opus 4.6, thank you deepseek, we may not be gpu rich but with friends like you, we are rich

[-]

power97992@reddit

It seems like they are saying it is almost as good as opus 4.5 , so they are 5m behind

[-]

DarkArtsMastery@reddit

It's happening aight?

[-]

megadonkeyx@reddit

Hand cranking my dell r720d as we speak, 384gb ddr3 with 12gb rtx3060.

4 tokens/sec is all anyone ever needs 🫪

[-]

power97992@reddit

Just rent 5 b200s or use the api… it is faster

[-]

jselby81989@reddit

This is the moment I regret not maxing out my RAM.

[-]

Dany0@reddit

I thought I'm baller with my 5090 but I'll be renting out h200s in the cloud with the rest of y'all :/

[-]

Bestlife73@reddit

I was here!

[-]

markovianmind@reddit

I am here

[-]

onil_gova@reddit

[-]

Zyj@reddit

It‘s only open source if you release the training data. Like Nvidia is doing lately. Otherwise it‘s open weights.

[-]

Mashic@reddit

If they were trained on copyrighted materials (books, scientific papers...) without aquiring proper license, don't expect them to release the training data and put themselves in risk of lawsuits.

[-]

Ardalok@reddit

And all the research.

[-]

onil_gova@reddit

context

[-]

mrjackspade@reddit

Holy cringe.

[-]

FlyingCC@reddit

me too my friend me too

[-]

More-Curious816@reddit

Add me too the celebrity group chat.

[-]

TheRealMasonMac@reddit

It seems to be a native FP4 model? Their card says: “FP4 + FP8 Mixed: MoE expert parameters use FP4 precision; most other parameters use FP8.”

[-]

Lowkey_LokiSN@reddit

Yup! And that's what I'm most excited for!
It's the one thing I've been really missing out on since gpt-oss-120b.

[-]

Independent-Date393@reddit

52% of deepseek's own engineers switched to V4-Pro as their primary coding model. that data is in the paper section 5.4.4 and it's more interesting than any public benchmark

[-]

RuthlessCriticismAll@reddit

To be clear, its fewer than 9% said no. so almost all the deepseek engineers are willing to use v4.

[-]

ilintar@reddit

Is it better than Qwen3.6 27B?

[-]

Noxusequal@reddit

Do I see correctly that engrams are at least not mentioned in the model descriptions ?

[-]

faschu@reddit

The V4 version also has no native image support?

[-]

Mr-I17@reddit

284B Flash 🫠. *Sad 128GiB UMA noises*

[-]

beneath_steel_sky@reddit

Flash (non base) is 158B, quants should fit in 128GB

[-]

Mr-I17@reddit

Nuh, it's 284B as it is written on the model card. The 158B seems to be a miscalculation by HuggingFace. DS V4 uses a mix of FP4 and FP8 natively, the total size of the safetensors files is about 160GB. 128GiB UMA owners will have to use 2bit quants.

[-]

beneath_steel_sky@reddit

My hopes just went out the window

[-]

Karyo_Ten@reddit

It's already quanted in Fp8 + Fp4 (or int4? unsure) so you'll need to requant in int8 + ~int3

[-]

Zyj@reddit

It‘s 158B because it‘s quantized to mostly Q4

[-]

cafedude@reddit

Maybe we can get one of those 1-bit quants?

[-]

Mr-I17@reddit

Yeah, might as well just pick a 1T model. It'd fit into 128GiB RAM perfectly. /s

[-]

MDSExpro@reddit

AWQ when?

[-]

power97992@reddit

They need to make a 120b q4 ,q8 mixed precision model

[-]

Then-Topic8766@reddit

History in the making.

[-]

Material_Soft1380@reddit

[-]

AleksHop@reddit

its extremely bad in terraform

[-]

weiyong1024@reddit

As a developer from China, this is what I respect most about DeepSeek, they just keep shipping MIT license and 1M context while the rest of the field is busy marketing, in a noisy race a bit of rational focus goes a long way. It's also a solid self hostable option in my multi provider agent rotation, not a hedge exactly, more like a core slot that happens to also be free of external policy exposure.

[-]

TinyDetective110@reddit

`For the Think Max reasoning mode, we recommend setting the context window to at least 384K tokens.`

[-]

Accomplished_Ad9530@reddit

Holy shit, that's a lot of thinking tokens

[-]

Wibong@reddit

The distill dataset already come out!!!

https://huggingface.co/datasets/beyoru/deepseek-v4-pro-max-distillation-preview-shot

[-]

AlbeHxT9@reddit

>Opus level at these prices wtf

[-]

MDSExpro@reddit

Flash size is perfect! Finally a good model for that parameter band.

[-]

Jackalzaq@reddit

is the base model for the pro the one thats 1.6T parameters and the instruct one is half of that(862b)? or is the hugging face parameter count bugged?

[-]

Caffdy@reddit

someone mentioned that it should be corrected in the next 1-2 days

[-]

Jackalzaq@reddit

ah, thanks!

[-]

Karyo_Ten@reddit

It's QAT. Delivered in Fp8+Fp4 (or int4 didn't check)

[-]

Zyj@reddit

The instruct models weights are mostly FP4

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

rm-rf-rm@reddit

odd that they haven't shared any benchmarks for the flash model

[-]

hdmcndog@reddit

They have: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro#comparison-across-modes

[-]

rm-rf-rm@reddit

those comparison against modes, not to other models like Qwen3.6, Minimax 2.7, Sonnet 4.6 etc.

[-]

hdmcndog@reddit

Luckily, we can just join the tables ourselves ;)

But I see what you meant now.

[-]

ortegaalfredo@reddit

They are on the hugging face page

[-]

rm-rf-rm@reddit

I didnt see it, looking again

[-]

Different_Fix_2217@reddit

It does not seem very good... Hopefully its just broken. Because this is no where near kimi / glm.

[-]

Aaaaaaaaaeeeee@reddit

In case you were wondering about Engram, it's not part of these models yet.

In addition, beyond the MoE and sparse attention architecture, we will also proactively explore model sparsity along new dimensions — such as more sparse embedding modules (Cheng et al., 2026) — to further improve computational and memory efficiency without compromising capability.

It's saved for future work.

Both models have post-trained QAT experts in MXFP4. Very happy that they do QAT release too, so it can be the norm.

[-]

Real_Ebb_7417@reddit

Ok so, can someone smarter than me tell me what new techniques they used that will make our lives easier? (They always implement something new, so I guess this time is no exception? )

[-]

Caffdy@reddit

1M context with very low KV cache memory requirements

[-]

Zyj@reddit

Will help with larger context size. Less RAM, more speed.

[-]

Finanzamt_Endgegner@reddit

Better hybrid attention it seems, some long context improvements and if I'm not mistaken something to make the active params smarter, but that one is hearsay lol

[-]

ComplexType568@reddit

BEEN WAITING SO LONG TO SEE A PERFORMANT >1T MODEL OTHER THAN KIMI

I hope these stats are from pure performance and not benchmaxxing. Because if this is just pure performance it'll be a glorious step up

[-]

reefine@reddit

No image input :(

[-]

uniVocity@reddit

I guess most of us are left with the option of running it on a spare nvme

[-]

Major_Olive7583@reddit

Api price , Any changes?

[-]

TinFoilHat_69@reddit

Someone should start vibe coding optimizations for older 24GB p100, cards, and maybe even older p40s or k80s I’m pretty sure p100s support nvlink 1.0 which makes it interesting choice for 85 dollars each

[-]

Sakatard@reddit

My p40 agrees with this idea

[-]

johnxreturn@reddit

Now on to test whether it’s benchmaxxed

[-]

GlossyCylinder@reddit

DS is one of the least benchmaxxed model out there lol.

[-]

Finanzamt_Endgegner@reddit

Doubt it this seems not to be that good in typical benchmaxxed benchmarks for its size but could be wrong ofc

[-]

mikumikubeeeeaaaammm@reddit

Imma die due to happiness oh my its thiccc

[-]

tassa-yoniso-manasi@reddit

gguf when?

I can't wait to run this at 0.00000000006884 t/s

[-]

popiazaza@reddit

Unsloth dynamic 3.0 gguf 0.1 bit when?

[-]

Raredisarray@reddit

Straight up lol

[-]

Finanzamt_Endgegner@reddit

Same

[-]

jakegh@reddit

Great pricing in the API. Not many gonna run a 1.6T param model locally, though.

They just waited too long; Opus 4.6/GPT-5.4 but open-source and cheaper won't shake the earth like R1 did. If they matched 4.7/5.5 that would be a different story.

[-]

SufficientPie@reddit

I'll always choose open source models over proprietary ones that scrape my open source code without following the license, sell the result back to me, then give all my data to people who want to hunt me down with autonomous weapons.

[-]

Middle_Bullfrog_6173@reddit

New hybrid attention + mHC. Is this supported in any inference software yet?

[-]

hdmcndog@reddit

vLLM has support for it.

[-]

Iory1998@reddit

Man up guys and release a nano version.

[-]

Lopsided_Dot_4557@reddit

DeepSeek V4 Pro & Flash are HERE

and they just made every GPU cluster look overbuilt

🔹 1.6T parameter Pro + 284B Flash — both with 1M token context 🔹 27% of compute cost vs V3.2 — 10% of KV cache 🔹 Beats GPT-5.4 on Codeforces (3206 rating) — first open model to match closed frontier on code 🔹 New architecture: CSA + HCA + mHC + Muon optimizer — built different from the ground up 🔹 Fully open source — MIT license — run it yourself

Full breakdown video below 👇

https://youtu.be/Owzn47EBsow

[-]

HeavenBeach777@reddit

insane release

[-]

FoxiPanda@reddit

Guys, I can only test so many models per day.

[-]

KeikakuAccelerator@reddit

The goats!

[-]

marhalt@reddit

You can just feel the machine going 'wtf are you doing to me' when you are downloading it.

[-]

Lazy-Pattern-5171@reddit

Section 5.4.4 Code Agent in their report

To benchmark our coding agent capability, we curate tasks from real internal R&D workloads We collect ~ 200 challenging tasks from 50+ internal engineers, spanning feature development, bug fixing, refactoring, and diagnostics across diverse technology stacks including PyTorch, CUDA, Rust, and Ctt. Each task is accompanied by its original repository, the corresponding execution environment, and human-annotated scoring rubrics; after rigorous quality filtering, 30 tasks are retained as the evaluation set. As shown in Table 8, DeepSeek-V4-Pro significantly outperforms Claude Sonnet 4.5 and approaches the level of Claude Opus 4.5.

(There’s a table in the middle with information that DeepSeek v4 pro reaches 67 where on the same benchmark Opus 4.6 reaches 80 and 4.5 reaches 70)

In a survey asking DeepSeek developers and researchers (N = 85) — all with experience of using DeepSeek-V4-Pro for agentic coding in their daily work — whether DeepSeek-V4-Pro is ready to serve as their default and primary coding model compared to other frontier models, 52% said yes, 39% leaned toward yes, and fewer than 9% said no. Respondents find DeepSeek-V4-Pro to deliver satisfactory results across most tasks, but note trivial mistakes, misinterpretation of vague prompts, and occasional over-thinking.

This sounds like the best “DeepSeek helped develop DeepSeek” moment for me and that’s amazing.

[-]

26YrVirgin@reddit

Multi-modal? Does it support image input?

[-]

MichaelXie4645@reddit (OP)

No, HF flags it as “text generation” not “image text to text”

[-]