Takeaways & discussion about the DeepSeek V4 architecture

Posted by benja0x40@reddit | LocalLLaMA | View on Reddit | 84 comments

Spent the morning looking at the V4 tech report. The benchmarks are getting deserved attention, but I think the architecture is also worth digging into.

Quick thoughts below to encourage feedback and discussions.

TL;DR
- Significant novelties compared to DeepSeek V3
- Hybrid attention: CSA (compressed sparse) + HCA (heavily compressed), instead of going pure MLA or involving SSM / Gated DeltaNet like Qwen3.5+, Mamba, etc.
- Manifold-Constrained Hyper-Connections replacing standard residuals (original mHC paper)
- FP4 QAT training at frontier scale

Hybrid attention
The CSA + HCA approach is interesting because it does not replace quadratic attention layers with linear ones. Instead, it performs attention on compressed (coarser grain) token streams, concatenated with sliding window attention tokens. This means that all layers remain attention-based, which is a novel direction compared to existing hybrid architectures.

Residual streams
Standard residual connections have been a largely untouched part of transformers. V4 uses manifold-constrained hyper-connections, which redesigns how information flows between blocks. As far as I know DeepSeek is the only lab that has solved the training stability issues and is shipping this in production (happy to be corrected).

Realistically, almost nobody here will be able to run DeepSeek V4 locally. For that you'd need at least a cluster of the recently discontinued M3 Ultra 512GB, or an even more expensive NVIDIA setup.
V4-Flash and community distillations are where this release will probably get more interesting and accessible for local inference.

Would love to know what you think.

[-]

mineyevfan@reddit

Deterministic output as well, I don't think anyone else has done that in production.

[-]

Looz-Ashae@reddit

umh, you can achieve that with any LLM. Same seed, maximized/minimized top k, top p, same prompt = same outputs. LLMs are deterministic by design.

[-]

mineyevfan@reddit

From section 3.3

Determinism. Deterministic training is highly beneficial for debugging hardware or software issues. Moreover, when training exhibits anomalies such as loss spikes, determinism enables researchers to more easily pinpoint numerical causes and further refine the model design. Non-determinism in training typically stems from non-deterministic accumulation order, often due to the use of atomic addition instructions. This issue primarily occurs during the backward pass, notably at the following parts:

• Attention Backward. In conventional implementations of backward propagation for sparse attention, we use atomicAdd to accumulate gradients for the KV tokens. This introduces non-determinism due to the non-associativity of floating-point addition. To address this problem, we allocate separate accumulation buffers for each SM, followed by a global deterministic summation across all buffers.

• MoE Backward. When multiple SMs from different ranks concurrently write data to the same buffer on a receiving rank, negotiating writing positions also introduces non-determinism. To resolve this, we design a token order pre-processing mechanism within each single rank, combined with buffer isolation across multiple ranks. This strategy ensures determinism of both the send results of expert parallelism and the accumulation order in the MoE backward pass.

• Matrix Multiplication in mHC. mHC involves a matrix multiplication with an output dimension of only 24. For very small batch sizes, we are compelled to use the split-k (Osama et al., 2023) algorithm, whose naive implementation will cause non-determinism. To overcome this, we output each split part separately and perform a deterministic reduction in a subsequent kernel, thereby preserving both performance and determinism.

[-]

mineyevfan@reddit

No, floating point error differences from accumulation order etc. causes nondeterminism

[-]

mineyevfan@reddit

The paper linked in the OP contents goes into it, it's worth a read.

[-]

ThePixelHunter@reddit

As in, guaranteed reproducible output given the same inputs? I'm curious if this is accomplished at the model level, or in the infra/serving stack.

[-]

mineyevfan@reddit

Beyond basic functionalities and maximizing hardware utilization, another pivotal design goal is to ensure training reproducibility and bitwise alignment among pree-training, post-training, and inference pipelines. Therefore, we implement end-to-end, bitwise batch-invariant, and deterministic kernels with minimal performance overhead. f the inference stack is batch-invariant and deterministic, this correctness issue could also be addressed by regenerating with a consistent seed for the pseudorandom number generator used in the sampler.

[-]

benja0x40@reddit (OP)

Good catch!

[-]

dark-light92@reddit

The graph seems to indicate that they can fit 1M context in about 5GB. That's the biggest takeaway.

[-]

CryptoUsher@reddit

the v4 architecture seems to be introducing some significant changes, especially with the hybrid attention mechanism. what's the potential impact of manifold-constrained hyper-connections on the model's ability to generalize to unseen data, tbh?

[-]

benja0x40@reddit (OP)

Nice question, non-obvious answer.

The original mHC paper showed modest benchmark gains. The V4 report describes it as a third scaling axis alongside depth and width, decoupling the residual width from the actual hidden size, with minimal computational overhead. But their main point is that mHC enables stable training for configurations that would otherwise diverge.

Ironically, this is a circular argument since mHC solves instability introduced by hyper-connections themselves, whereas standard skip connections are stable by design.

I think the real promise of mHC is richer representation routing, with multiple residual streams carrying different "views" of each token through the layer stack. But the report doesn't provide evidence about associated gains.

[-]

CryptoUsher@reddit

tbh, that stability angle might be the unsung hero for tougher datasets

[-]

Lazy-Syllabub-3571@reddit

[-]

Ok_Warning2146@reddit

Wow. That's huge for a 1.6T model. In contrast, Kimi LInear which is a 48B-A3B hybrid MLA transformer/delta-net architecture uses 7.785GB for 1M. This advance basically makes KV cache costs nothing. Hopefully, Kimi and Zhipu will steal this model and make it 10M context.

[-]

This_Maintenance_834@reddit

now we are talking about RTX PRO 6000. two of them is 192GB. the model take 180GB. that left us 16GB for 2 concurrent queries at 1M context length for kv cache plus cuda graph consumption. this is actually local friendly

[-]

SignalCompetitive582@reddit

This is indeed the biggest takeaway! It now means that hosting any LLM is compute bound and no longer memory bound.

So, in theory, we should see way more AI Coding Plans that offer very generous subscription limits compared to what we’re used to.

The moment Zhipu introduces this novel approach into a GLM-6 for instance, it instantly becomes one of the best open source LLMs available.

It means that it is now economically viable to offer good prices to a lot of customers.

[-]

Long_comment_san@reddit

Well, it's a Kimi-class model, no shit nobody can run it at home!

"flash" (HILARIOUS naming) is the most interesting one to be completely honest.

[-]

Karyo_Ten@reddit

Why hilarious?

Flash models in the past have always been around 10~15 active params

[-]

Long_comment_san@reddit

Flash is something that's supposed to be fast, and calling a 300b class model is like calling a truck "a supersonic fighter jet". Flash is like, I dunno, Qwen 3.6 35b a3b, Gemma 26b a4b - that's a perfect model to call "flash". It really is maybe 1% of larger model and it's 100x it's speed.

[-]

Karyo_Ten@reddit

10B / 13B active parameters is fast. Please read on how MoE models work.

Case in point: - MiMo-V2-Flash is 309B-A15B - Step-3.5-Flash is 196B-A11B

[-]

Long_comment_san@reddit

GLM 4.7 Flash begs to differ.

[-]

petuman@reddit

Also there is no precedent to call 10-15b active as "flash" on an industry wide basis at all.

Gemini 1.5 Flash-8B when Google boasted size by themselves.

[-]

Karyo_Ten@reddit

Not even third, there are others that are flash like Longcat Flash (560B) or Ring Flash (100B).

It's just the third of the ~200~300B size class.

[-]

petuman@reddit

Longcat Flash seems like a misnomer with its 27B activation. Ring Flash indeed fits

[-]

Karyo_Ten@reddit

Ah fair point

[-]

Ok_Warning2146@reddit

You can "easily" run it on AMD Turin CPU that has 24*64GB=1536GB plus a RTX 6000 Pro. Since 1M context is now only 5GB VRAM, offloading experts to RAM should should give fairly good performance.

[-]

silenceimpaired@reddit

Huggingface seems to list the wrong parameters for it...at first I was like alright! Now I'm like alright... How to run this.

[-]

dobkeratops@reddit

high end prosumer tier with 256gb mac studio or 2x DGX Spark could run the 284b version ? might that even squeeze into one DGX spark at Q3 ?

[-]

Long_comment_san@reddit

I obviously meant the pro model, flash can probably squeeze in 24+256

[-]

sagotchy@reddit

I just asked DeepSeek V4 wheater the DeepSeek V4 model has engram built in. It searched online and then hallucinated that the answer is yes. Not a very good first impression LOL. I checked the sources and they're talking about anticipated features, not actually implemented features.

[-]

No-Fig-8614@reddit

Is it fair to say since b300’s really excel at quant4 it will be amazing for them?

[-]

KPaleiro@reddit

Where is engram? I was excited to see this novel transformer architecture in v4... maybe they are holding it for the definitive version of deepseek v4, since this is a preview...

[-]

DerDave@reddit

Unlikely the jump from preview to final version v4 will have a huge architectural change like Engram. It needs training from scratch. I'm afraid we'll only see engram in v5 or another model

[-]

amozu16@reddit

You're probably right but I hated reading that so boo boooooooo /j

[-]

insanemal@reddit

Emgram is in V4.

Did you not read the announcement?

https://deepseek.ai/deepseek-v4

[-]

torytyler@reddit

this site isn't affiliated with deepseek. it says in multiple places "Deepseek.ai is an independent website and is not affiliated with, sponsored by, or endorsed by Hangzhou DeepSeek Artificial Intelligence Co., Ltd."

[-]

-dysangel-@reddit

Novel Attention: Token-wise compression + DSA (DeepSeek Sparse Attention).

From https://api-docs.deepseek.com/news/news260424

Token-wise compression does sound like it could be engram, or at least related to engram? I think it would actually be way more flexible/useful to build dynamic engrams per conversation, rather than just be stuck with a fixed list of engrams, so if that's what they're doing this is going to be amazing.

[-]

oxygen_addiction@reddit

Longcat flash-lite uses ngram and there is still no support in in llama.cpp

https://github.com/ggml-org/llama.cpp/pull/19167

[-]

Aaaaaaaaaeeeee@reddit

(I might have tried to answer you or someone else who asked about ngram, and put out a wrong answer for Engram instead.) but basically Engram and that Longcat's ngram are different, only deepseek's Engram does disc inference, Longcat's ngram is a large vocabulary layer, which comes with improved training speed, and less compute for inference.

But it doesn't have RAM savings and disc inference characteristics.

[-]

Taenk@reddit

Engram not making it into v4 would surprise me, but IIRC they do sometimes publish previews that aren't architecturally final relative to the "non-preview" published version.

[-]

Infrared12@reddit

Apologies, but what's engram exactly folks?

[-]

pmttyji@reddit

https://github.com/deepseek-ai/Engram

[-]

Infrared12@reddit

Thanks!

[-]

Gleethos@reddit

Yes! I was also hoping for it to have engram!

[-]

insanemal@reddit

engram is there

https://deepseek.ai/deepseek-v4

[-]

jesus_fucking_marry@reddit

This is not their official website.

[-]

insanemal@reddit

Well fuck. My apologies

[-]

pmttyji@reddit

Saw this

[-]

marhalt@reddit

Why wouldn't be able to run it locally? It's still mixture of experts, no? A ton of RAM + a few RTX 6000 Pro, and it's doable if slow, no? Or 2 M3 Ultra.

[-]

Mass2018@reddit

Should we normalize spending as much on our home servers as people spend on their toy sports cars that rarely leave the garage?

"My mortgage is $3500, my car payment is $1000, and my DGX H100 payment is $2850."

[-]

boutell@reddit

Hmm. Business idea: "the latest bragworthy home AI server as a service." Pick your tier ("oh cool, oh wow, or oh HOLY SHIT") and pay $500, $1,000 or $1,500 per month. Periodically they ship you the latest one. You sync some keys between them and ship the previous one back to be sold on to somebody in a lesser tier. You don't fuss with installing and configuring different models because it's all been pregamed for you to deliver at, of course, a currently bragworthy level. You just keep hitting that same API over tailscale and it keeps delivering at whatever is currently a "oh wow dude, you have this in your house? I mean it's not Claude Opus but" level

[-]

Raredisarray@reddit

lol love this!!!

[-]

LetterRip@reddit

Move to a cool climate and use it as a heater.

[-]

LoveMind_AI@reddit

...honestly, maybe.

[-]

Veearrsix@reddit

I’m completely convinced we’ll see advancements in models that will let them run on local hardware better, but that logical part of my brain is definitely giving way to the urge to spend money on insane hardware.

[-]

DeviantPlayeer@reddit

Considering how fast it's moving it's either one of two:
1. Build your own server using decommissioned GPUs
2. Rent a server and run your own models

[-]

boutell@reddit

Is anybody renting GPUs at anything approaching a reasonable price?

[-]

FullOf_Bad_Ideas@reddit

yes I am renting a bunch of RTX 4090s, 5090s and RTX 6000 Pros from Vast.ai at reasonable prices, even when I have local 8x 3090 ti setup. Sometimes I want to do 2 things at once or I OOM on 3090 Ti.

[-]

boutell@reddit

Interesting. So this should be acting as a limiting factor on the cloud price of any model that could feasibly be hosted at home or by renting a whole GPU.

[-]

This_Maintenance_834@reddit

rent cost is about the same as 2-year financing. it that reasonable?

[-]

boutell@reddit

Sure that sounds reasonable. I had seen numbers that made my hair stand on end, but GPUs are, in fact, expensive...

[-]

Ryoonya@reddit

There are plenty of people like that, compared to racing/rally and other expensive hobbies, this isn't that bad.

The average barely scraping by individual was never able to participate in those. This is an enthusiast hobby after all.

[-]

-p-e-w-@reddit

Racing? Hell, tennis can easily cost 5000 bucks per year. Skiing can cost twice that. And I‘m talking about higher-level amateurs, not semi-pros or pros.

[-]

buecker02@reddit

Really should not be comparing sports cars to computer equipment. The computer equipment is a depreciating asset. Classic sports cars are literally the opposite.

Ofcourse, there is an argument to make with toys but generally, the computer equipment will always be the depreciating asset.

[-]

MDSExpro@reddit

Yet. Alternative is spending more on cloud-based service that offers less while owning your data.

[-]

ikkiho@reddit

The FP4 QAT at frontier scale is the part I keep coming back to. Doubling effective compute vs FP8 on Blackwell only matters if loss curves stay clean, which usually means stochastic rounding plus per-tensor scales tuned during training, not just at conversion. If the report actually shows no quality gap vs an FP8 baseline, that's a bigger story long-term than the attention changes.

[-]

rulerofthehell@reddit

Wish it had engram enabled

[-]

Mochila-Mochila@reddit

What's your opinion on the technique they used to dramatically reduce the context's memory footprint ?

[-]

Long_comment_san@reddit

Yeah they're saying 10x reduction over deepseek v3, sounds like some of their own variant of turbo/rotorquant. Unless they made some sort of internal discovery.

That's actually one of the reasons this model is mind-blowing if that's not rotor/turboquant variant. And I doubt that. DS v4 must have been in training long before we got these new techniques.

[-]

benja0x40@reddit (OP)

Completely different from TurboQuant. The savings come from the attention mechanisms themselves, an architectural improvement rather than a quantisation technique.

TurboQuant operates on the numerical values of KV entries, and the good news is it can be applied on top of V4's architecture!

[-]

Mochila-Mochila@reddit

It'd be very interesting to compare the efficiency and trade-offs of both approaches. Also, whether they could be combined to some extent.

[-]

benja0x40@reddit (OP)

My understanding of the attention section (5 pages long in the tech report, most of it needing deeper maths background than I currently have).

V4 uses two new attention types, CSA and HCA interleaved across layers. Both share a common skeleton with queries attending to (a) sliding-window KV entries for local dependencies at full resolution, and (b) a compressed global KV set (low resolution long range associations). The final operation is a shared Key-Value Multi-Query Attention over that combined KV set, including an attention sink mechanism.

CSA compresses every 𝑚 tokens into one KV entry, then uses an indexer + top-k selector (DeepSeek Sparse Attention) to pick which compressed entries the query actually attends to.

HCA uses similar compression logic but with a much larger ratio (𝑚′ ≫ 𝑚, e.g. 128 vs 4 in V4-Pro) and drops sparse selection entirely. Since the compressed sequence is already short, queries can attend to all compressed entries directly.

With this architecture, attention cost scales with the compressed sequence length rather than the raw one. The reduction is moderate in CSA layers, though further cut down by top-k selection, and drastic in HCA layers.

How this approach stands against competing hybrid architectures (SSM, Gated DeltaNet, local-global splits) remains to be evaluated.

What's certain is that these are not variations of existing attention kernels, so inference backends like vLLM and llama.cpp will need several new implementations to support DeepSeek V4.

I think this report shows how much design and engineering work went into making V4 trainable and production-ready. And the authors themselves mention architectural simplification as future work.

[-]

Mochila-Mochila@reddit

Thank you, so it appears that the devs have implemented several stages of compression happens in order to achieve this result.

[-]

benja0x40@reddit (OP)

Yes. A few more details.

For both V4-Flash and V4-Pro, CSA layers use 1/4 compression plus query-dependent top-1024 KV selection. HCA layers use 1/128 compression, meaning a 1M token sequence is reduced to under 10k KV entries.

The sliding window attention operates on 128 local KV entries (\~10 medium-length sentences), so each layer's sliding window branch performs full-resolution attention over paragraph-sized chunks.

The final stage performs a shared Key-Value Multi-Query Attention (!) over the concatenation of the sliding window KV entries, the CSA/HCA KV entries, and the query stream.
Here the "shared Key-Value" part means K and V projections are shared across query heads (MQA), which keeps the KV cache manageable at long contexts.

Thanks to the concatenation occurring before this final attention stage, in each layer the query stream attends jointly over full-resolution local KV entries and the compressed global KV set (CSA or HCA, alternating across layers).

[-]

AnomalyNexus@reddit

Seems to work well as a 2nd opinion model too for coding.

i.e. if you have a thing made by another model having DS4 pro look at it too seems to yield improvements

[-]

Few_Water_1457@reddit

above all we need to see by when we will have the GGUF considering the number of changes that need to be made to implement the new attention

[-]

pmttyji@reddit

I don't see any PRs on llama.cpp as of now.

Looks like both vllm & slang are up to run these models.

[-]

ResidentPositive4122@reddit

Interesting, I checked openrouter just now and there are no other providers listed (not even for flash), other than ds themselves. I was curious what would be the price-point where 3rd party providers find it profitable to serve.

[-]

pmttyji@reddit

Ticket created on llama.cpp

https://github.com/ggml-org/llama.cpp/issues/22319

[-]

redpandafire@reddit

I mean I found this post useful. I don’t always have the time to read the full paper while getting ready for work. But I’ll read it later. Surprised (not surprised?) to see the comment section is just a big brawl of people fighting each other.

[-]

adeadfetus@reddit

It’s Reddit

[-]

ggone20@reddit

All the testing I’ve seen so far show this is garbage. Just benchmaxxed to get press/hype. Literally Qwen 27b trounces it

[-]

segmond@reddit

what I think? low quality post. anyone that can understand your post will read the paper and if in a rush throw it into a model and get better summary than you gave us.

with that said, we will run DeepSeek V4 locally. If they can run it in the cloud, we can run it locally. Nothing will stop us, I remember when folks thought running 70B models locally was impossible. ... and for a moment, it kinda was and felt like that.

[-]

leonbollerup@reddit

I feel stupid now.. so.. ya.. thanx for that :D