Takeaways & discussion about the DeepSeek V4 architecture
Posted by benja0x40@reddit | LocalLLaMA | View on Reddit | 84 comments
Spent the morning looking at the V4 tech report. The benchmarks are getting deserved attention, but I think the architecture is also worth digging into.
Quick thoughts below to encourage feedback and discussions.
TL;DR
- Significant novelties compared to DeepSeek V3
- Hybrid attention: CSA (compressed sparse) + HCA (heavily compressed), instead of going pure MLA or involving SSM / Gated DeltaNet like Qwen3.5+, Mamba, etc.
- Manifold-Constrained Hyper-Connections replacing standard residuals (original mHC paper)
- FP4 QAT training at frontier scale
Hybrid attention
The CSA + HCA approach is interesting because it does not replace quadratic attention layers with linear ones. Instead, it performs attention on compressed (coarser grain) token streams, concatenated with sliding window attention tokens. This means that all layers remain attention-based, which is a novel direction compared to existing hybrid architectures.
Residual streams
Standard residual connections have been a largely untouched part of transformers. V4 uses manifold-constrained hyper-connections, which redesigns how information flows between blocks. As far as I know DeepSeek is the only lab that has solved the training stability issues and is shipping this in production (happy to be corrected).
Realistically, almost nobody here will be able to run DeepSeek V4 locally. For that you'd need at least a cluster of the recently discontinued M3 Ultra 512GB, or an even more expensive NVIDIA setup.
V4-Flash and community distillations are where this release will probably get more interesting and accessible for local inference.
Would love to know what you think.
mineyevfan@reddit
Deterministic output as well, I don't think anyone else has done that in production.
Looz-Ashae@reddit
umh, you can achieve that with any LLM. Same seed, maximized/minimized top k, top p, same prompt = same outputs. LLMs are deterministic by design.
mineyevfan@reddit
From section 3.3
mineyevfan@reddit
No, floating point error differences from accumulation order etc. causes nondeterminism
mineyevfan@reddit
The paper linked in the OP contents goes into it, it's worth a read.
ThePixelHunter@reddit
As in, guaranteed reproducible output given the same inputs? I'm curious if this is accomplished at the model level, or in the infra/serving stack.
mineyevfan@reddit
benja0x40@reddit (OP)
Good catch!
dark-light92@reddit
The graph seems to indicate that they can fit 1M context in about 5GB. That's the biggest takeaway.
CryptoUsher@reddit
the v4 architecture seems to be introducing some significant changes, especially with the hybrid attention mechanism. what's the potential impact of manifold-constrained hyper-connections on the model's ability to generalize to unseen data, tbh?
benja0x40@reddit (OP)
Nice question, non-obvious answer.
The original mHC paper showed modest benchmark gains. The V4 report describes it as a third scaling axis alongside depth and width, decoupling the residual width from the actual hidden size, with minimal computational overhead. But their main point is that mHC enables stable training for configurations that would otherwise diverge.
Ironically, this is a circular argument since mHC solves instability introduced by hyper-connections themselves, whereas standard skip connections are stable by design.
I think the real promise of mHC is richer representation routing, with multiple residual streams carrying different "views" of each token through the layer stack. But the report doesn't provide evidence about associated gains.
CryptoUsher@reddit
tbh, that stability angle might be the unsung hero for tougher datasets
Lazy-Syllabub-3571@reddit
Ok_Warning2146@reddit
Wow. That's huge for a 1.6T model. In contrast, Kimi LInear which is a 48B-A3B hybrid MLA transformer/delta-net architecture uses 7.785GB for 1M. This advance basically makes KV cache costs nothing. Hopefully, Kimi and Zhipu will steal this model and make it 10M context.
This_Maintenance_834@reddit
now we are talking about RTX PRO 6000. two of them is 192GB. the model take 180GB. that left us 16GB for 2 concurrent queries at 1M context length for kv cache plus cuda graph consumption. this is actually local friendly
SignalCompetitive582@reddit
This is indeed the biggest takeaway! It now means that hosting any LLM is compute bound and no longer memory bound.
So, in theory, we should see way more AI Coding Plans that offer very generous subscription limits compared to what we’re used to.
The moment Zhipu introduces this novel approach into a GLM-6 for instance, it instantly becomes one of the best open source LLMs available.
It means that it is now economically viable to offer good prices to a lot of customers.
Long_comment_san@reddit
Well, it's a Kimi-class model, no shit nobody can run it at home!
"flash" (HILARIOUS naming) is the most interesting one to be completely honest.
Karyo_Ten@reddit
Why hilarious?
Flash models in the past have always been around 10~15 active params
Long_comment_san@reddit
Flash is something that's supposed to be fast, and calling a 300b class model is like calling a truck "a supersonic fighter jet". Flash is like, I dunno, Qwen 3.6 35b a3b, Gemma 26b a4b - that's a perfect model to call "flash". It really is maybe 1% of larger model and it's 100x it's speed.
Karyo_Ten@reddit
10B / 13B active parameters is fast. Please read on how MoE models work.
Case in point: - MiMo-V2-Flash is 309B-A15B - Step-3.5-Flash is 196B-A11B
Long_comment_san@reddit
GLM 4.7 Flash begs to differ.
petuman@reddit
Gemini 1.5 Flash-8B when Google boasted size by themselves.
Karyo_Ten@reddit
Not even third, there are others that are flash like Longcat Flash (560B) or Ring Flash (100B).
It's just the third of the ~200~300B size class.
petuman@reddit
Longcat Flash seems like a misnomer with its 27B activation. Ring Flash indeed fits
Karyo_Ten@reddit
Ah fair point
Ok_Warning2146@reddit
You can "easily" run it on AMD Turin CPU that has 24*64GB=1536GB plus a RTX 6000 Pro. Since 1M context is now only 5GB VRAM, offloading experts to RAM should should give fairly good performance.
silenceimpaired@reddit
Huggingface seems to list the wrong parameters for it...at first I was like alright! Now I'm like alright... How to run this.
dobkeratops@reddit
high end prosumer tier with 256gb mac studio or 2x DGX Spark could run the 284b version ? might that even squeeze into one DGX spark at Q3 ?
Long_comment_san@reddit
I obviously meant the pro model, flash can probably squeeze in 24+256
sagotchy@reddit
I just asked DeepSeek V4 wheater the DeepSeek V4 model has engram built in. It searched online and then hallucinated that the answer is yes. Not a very good first impression LOL. I checked the sources and they're talking about anticipated features, not actually implemented features.
No-Fig-8614@reddit
Is it fair to say since b300’s really excel at quant4 it will be amazing for them?
KPaleiro@reddit
Where is engram? I was excited to see this novel transformer architecture in v4... maybe they are holding it for the definitive version of deepseek v4, since this is a preview...
DerDave@reddit
Unlikely the jump from preview to final version v4 will have a huge architectural change like Engram. It needs training from scratch. I'm afraid we'll only see engram in v5 or another model
amozu16@reddit
You're probably right but I hated reading that so boo boooooooo /j
insanemal@reddit
Emgram is in V4.
Did you not read the announcement?
https://deepseek.ai/deepseek-v4
torytyler@reddit
this site isn't affiliated with deepseek. it says in multiple places "Deepseek.ai is an independent website and is not affiliated with, sponsored by, or endorsed by Hangzhou DeepSeek Artificial Intelligence Co., Ltd."
-dysangel-@reddit
From https://api-docs.deepseek.com/news/news260424
Token-wise compression does sound like it could be engram, or at least related to engram? I think it would actually be way more flexible/useful to build dynamic engrams per conversation, rather than just be stuck with a fixed list of engrams, so if that's what they're doing this is going to be amazing.
oxygen_addiction@reddit
Longcat flash-lite uses ngram and there is still no support in in llama.cpp
https://github.com/ggml-org/llama.cpp/pull/19167
Aaaaaaaaaeeeee@reddit
(I might have tried to answer you or someone else who asked about ngram, and put out a wrong answer for Engram instead.) but basically Engram and that Longcat's ngram are different, only deepseek's Engram does disc inference, Longcat's ngram is a large vocabulary layer, which comes with improved training speed, and less compute for inference.
But it doesn't have RAM savings and disc inference characteristics.
Taenk@reddit
Engram not making it into v4 would surprise me, but IIRC they do sometimes publish previews that aren't architecturally final relative to the "non-preview" published version.
Infrared12@reddit
Apologies, but what's engram exactly folks?
pmttyji@reddit
https://github.com/deepseek-ai/Engram
Infrared12@reddit
Thanks!
Gleethos@reddit
Yes! I was also hoping for it to have engram!
insanemal@reddit
engram is there
https://deepseek.ai/deepseek-v4
jesus_fucking_marry@reddit
This is not their official website.
insanemal@reddit
Well fuck. My apologies
pmttyji@reddit
Saw this
marhalt@reddit
Why wouldn't be able to run it locally? It's still mixture of experts, no? A ton of RAM + a few RTX 6000 Pro, and it's doable if slow, no? Or 2 M3 Ultra.
Mass2018@reddit
Should we normalize spending as much on our home servers as people spend on their toy sports cars that rarely leave the garage?
"My mortgage is $3500, my car payment is $1000, and my DGX H100 payment is $2850."
boutell@reddit
Hmm. Business idea: "the latest bragworthy home AI server as a service." Pick your tier ("oh cool, oh wow, or oh HOLY SHIT") and pay $500, $1,000 or $1,500 per month. Periodically they ship you the latest one. You sync some keys between them and ship the previous one back to be sold on to somebody in a lesser tier. You don't fuss with installing and configuring different models because it's all been pregamed for you to deliver at, of course, a currently bragworthy level. You just keep hitting that same API over tailscale and it keeps delivering at whatever is currently a "oh wow dude, you have this in your house? I mean it's not Claude Opus but" level
Raredisarray@reddit
lol love this!!!
LetterRip@reddit
Move to a cool climate and use it as a heater.
LoveMind_AI@reddit
...honestly, maybe.
Veearrsix@reddit
I’m completely convinced we’ll see advancements in models that will let them run on local hardware better, but that logical part of my brain is definitely giving way to the urge to spend money on insane hardware.
DeviantPlayeer@reddit
Considering how fast it's moving it's either one of two:
1. Build your own server using decommissioned GPUs
2. Rent a server and run your own models
boutell@reddit
Is anybody renting GPUs at anything approaching a reasonable price?
FullOf_Bad_Ideas@reddit
yes I am renting a bunch of RTX 4090s, 5090s and RTX 6000 Pros from Vast.ai at reasonable prices, even when I have local 8x 3090 ti setup. Sometimes I want to do 2 things at once or I OOM on 3090 Ti.
boutell@reddit
Interesting. So this should be acting as a limiting factor on the cloud price of any model that could feasibly be hosted at home or by renting a whole GPU.
This_Maintenance_834@reddit
rent cost is about the same as 2-year financing. it that reasonable?
boutell@reddit
Sure that sounds reasonable. I had seen numbers that made my hair stand on end, but GPUs are, in fact, expensive...
Ryoonya@reddit
There are plenty of people like that, compared to racing/rally and other expensive hobbies, this isn't that bad.
The average barely scraping by individual was never able to participate in those. This is an enthusiast hobby after all.
-p-e-w-@reddit
Racing? Hell, tennis can easily cost 5000 bucks per year. Skiing can cost twice that. And I‘m talking about higher-level amateurs, not semi-pros or pros.
buecker02@reddit
Really should not be comparing sports cars to computer equipment. The computer equipment is a depreciating asset. Classic sports cars are literally the opposite.
Ofcourse, there is an argument to make with toys but generally, the computer equipment will always be the depreciating asset.
MDSExpro@reddit
Yet. Alternative is spending more on cloud-based service that offers less while owning your data.
ikkiho@reddit
The FP4 QAT at frontier scale is the part I keep coming back to. Doubling effective compute vs FP8 on Blackwell only matters if loss curves stay clean, which usually means stochastic rounding plus per-tensor scales tuned during training, not just at conversion. If the report actually shows no quality gap vs an FP8 baseline, that's a bigger story long-term than the attention changes.
rulerofthehell@reddit
Wish it had engram enabled
Mochila-Mochila@reddit
What's your opinion on the technique they used to dramatically reduce the context's memory footprint ?
Long_comment_san@reddit
Yeah they're saying 10x reduction over deepseek v3, sounds like some of their own variant of turbo/rotorquant. Unless they made some sort of internal discovery.
That's actually one of the reasons this model is mind-blowing if that's not rotor/turboquant variant. And I doubt that. DS v4 must have been in training long before we got these new techniques.
benja0x40@reddit (OP)
Completely different from TurboQuant. The savings come from the attention mechanisms themselves, an architectural improvement rather than a quantisation technique.
TurboQuant operates on the numerical values of KV entries, and the good news is it can be applied on top of V4's architecture!
Mochila-Mochila@reddit
It'd be very interesting to compare the efficiency and trade-offs of both approaches. Also, whether they could be combined to some extent.
benja0x40@reddit (OP)
My understanding of the attention section (5 pages long in the tech report, most of it needing deeper maths background than I currently have).
V4 uses two new attention types, CSA and HCA interleaved across layers. Both share a common skeleton with queries attending to (a) sliding-window KV entries for local dependencies at full resolution, and (b) a compressed global KV set (low resolution long range associations). The final operation is a shared Key-Value Multi-Query Attention over that combined KV set, including an attention sink mechanism.
CSA compresses every 𝑚 tokens into one KV entry, then uses an indexer + top-k selector (DeepSeek Sparse Attention) to pick which compressed entries the query actually attends to.
HCA uses similar compression logic but with a much larger ratio (𝑚′ ≫ 𝑚, e.g. 128 vs 4 in V4-Pro) and drops sparse selection entirely. Since the compressed sequence is already short, queries can attend to all compressed entries directly.
With this architecture, attention cost scales with the compressed sequence length rather than the raw one. The reduction is moderate in CSA layers, though further cut down by top-k selection, and drastic in HCA layers.
How this approach stands against competing hybrid architectures (SSM, Gated DeltaNet, local-global splits) remains to be evaluated.
What's certain is that these are not variations of existing attention kernels, so inference backends like vLLM and llama.cpp will need several new implementations to support DeepSeek V4.
I think this report shows how much design and engineering work went into making V4 trainable and production-ready. And the authors themselves mention architectural simplification as future work.
Mochila-Mochila@reddit
Thank you, so it appears that the devs have implemented several stages of compression happens in order to achieve this result.
benja0x40@reddit (OP)
Yes. A few more details.
For both V4-Flash and V4-Pro, CSA layers use 1/4 compression plus query-dependent top-1024 KV selection. HCA layers use 1/128 compression, meaning a 1M token sequence is reduced to under 10k KV entries.
The sliding window attention operates on 128 local KV entries (\~10 medium-length sentences), so each layer's sliding window branch performs full-resolution attention over paragraph-sized chunks.
The final stage performs a shared Key-Value Multi-Query Attention (!) over the concatenation of the sliding window KV entries, the CSA/HCA KV entries, and the query stream.
Here the "shared Key-Value" part means K and V projections are shared across query heads (MQA), which keeps the KV cache manageable at long contexts.
Thanks to the concatenation occurring before this final attention stage, in each layer the query stream attends jointly over full-resolution local KV entries and the compressed global KV set (CSA or HCA, alternating across layers).
AnomalyNexus@reddit
Seems to work well as a 2nd opinion model too for coding.
i.e. if you have a thing made by another model having DS4 pro look at it too seems to yield improvements
Few_Water_1457@reddit
above all we need to see by when we will have the GGUF considering the number of changes that need to be made to implement the new attention
pmttyji@reddit
I don't see any PRs on llama.cpp as of now.
Looks like both vllm & slang are up to run these models.
ResidentPositive4122@reddit
Interesting, I checked openrouter just now and there are no other providers listed (not even for flash), other than ds themselves. I was curious what would be the price-point where 3rd party providers find it profitable to serve.
pmttyji@reddit
Ticket created on llama.cpp
https://github.com/ggml-org/llama.cpp/issues/22319
redpandafire@reddit
I mean I found this post useful. I don’t always have the time to read the full paper while getting ready for work. But I’ll read it later. Surprised (not surprised?) to see the comment section is just a big brawl of people fighting each other.
adeadfetus@reddit
It’s Reddit
ggone20@reddit
All the testing I’ve seen so far show this is garbage. Just benchmaxxed to get press/hype. Literally Qwen 27b trounces it
segmond@reddit
what I think? low quality post. anyone that can understand your post will read the paper and if in a rush throw it into a model and get better summary than you gave us.
with that said, we will run DeepSeek V4 locally. If they can run it in the cloud, we can run it locally. Nothing will stop us, I remember when folks thought running 70B models locally was impossible. ... and for a moment, it kinda was and felt like that.
leonbollerup@reddit
I feel stupid now.. so.. ya.. thanx for that :D