DS4: a DeepSeek 4 flash specific inference engine for 128gb MacBooks

[-]

lakySK@reddit

Nice! This looks awesome, downloading right now.

2 questions:

- Do we need to set the sampling parameters still (temperature, top_p, etc), or is that handled?

- What is the cache situation like? Recently been using oMLX and wondering how this compares in this respect?

Thanks for what looks like great work and a step in the right direction to help handle the many moving parts of running local LLMs in a reliable way!

[-]

lakySK@reddit

Seems to work like a charm so far. I've loaded it and started using it in pi.dev coding agent.

Saw an error when it tried to edit a file, but succeeded on a second attempt. The speed is quite decent, the quality seems very coherent. Running on M4 Max 128GB and the vibe-check for usability so far seems to be passing! Tried Qwen 3.6 35b and 27b before this and I felt like 35b is fast, but not that smart and 27b is decent, but too slow to be usable (especially in cases when it decides to contemplate the meaning of life for a long time before answering....).

Will definitely try to use this further and see how it goes!

[-]

DaniDubin@reddit

I am using Unsloth MiniMax-M2.7 UD-IQ3_S and it works great, with 128k context. I’m using it via Hermes-Agent. I did try different mlx versions, but unfortunately none was agenticaly usable, Mlx equivalent quants ate noticeably worse than unsloth (for the same GB size).

[-]

lakySK@reddit

Ok, need to try this one then as I liked the tone of Minimax so far. It just couldn’t get the paths right on MLX even though it tried so hard…

[-]

DaniDubin@reddit

I like Minimax because it’s good at agentic tool use, coding and much less verbose than Qwen’s models. But because regular quadratic attention, KV cache is huge and decode speed rapidly decays with context.

Really hoping DS4 will prove good for 128gb Mac users…

[-]

lakySK@reddit

Curious to hear your experience. It has not quite hit the spot for me yet

[-]

DaniDubin@reddit

Actually I’ve been using DS4-flash with antirez server/engine for a couple of days with Hermes Agent, and it’s super good! Smart model, quite fast (prefill 200-350, decode 24-27 tps) and generation speed drops very little due to hybrid attention. KV-cache is tiny, and the model is consistent across long contexts (reached around 120k and still very much usable).

[-]

lakySK@reddit

That’s the q2 model? Did you try coding?

[-]

DaniDubin@reddit

Yes, it's a special recipe of dynamic quantization that aggressively quantize only the MoE experts, and keeps the other layers untouched. Keep in mind that DS4 was originally trained with MoE expert parameters with FP4 precision, so it "hurts" less to degrade to q2.

I did Python coding and it was OK, on par with MiniMax M2.7, but I didn't try heavy or complex coding with it yet.

[-]

lakySK@reddit

Ok, seems like the recent version of this repo + the new imatrix quant is actually working better perhaps (haven't seen tool call errors yet). Will test more!

[-]

DaniDubin@reddit

Share your experiences!

[-]

ottboy@reddit

I've gotten the same error as you: "TOOLS DSML_START DSML_END finish=error error="invalid tool call" and I learned from https://github.com/vllm-project/vllm/issues/40801 that the issue might be due to using MTP. I removed "--mtp MTP.gguf --mtp-draft 2" from ds4-server and I no longer get that error.

[-]

goat_on_boat@reddit

This is unreal. Performance is insane and the model seems to be a cut above Qwen3.6/Gemma4 i've been playing. Running M5 Max 128gb getting ~35tk/s generation at 300tk/s prefill. Context window of 100,000.

Need to do more tests but this is a big step.

[-]

coder543@reddit

Mimo-V2.5 seems better, and it’s supported on standard llama.cpp.

[-]

TomLucidor@reddit

Turn on high thinking for DeepSeek-V4-Flash and compare with it, see if it gets better still

[-]

coder543@reddit

Mimo-V2.5 has a higher aggregate score across a wide range of benchmarks, while using a fraction of the tokens of Max mode:

If you have better benchmarks to show, then by all means... but AA already aggregates a lot of the best benchmarks after running them independently against these models.

[-]

TomLucidor@reddit

Where did you get this panel? Mimo 2.5 has Flash vs Pro there and I can't find the model there for some reason! Also if you can, please use latency/speed as a measure based on similar hardware.

[-]

coder543@reddit

Keep scrolling down on the link, past the first several charts. That link takes you to the specific models/configurations in my screenshot. This was the chart shown for intelligence vs token use even farther down the page.

Towards the bottom, they also show the output speed that they recorded, which was 95 tokens per second for Mimo-V2.5, and about 65 tokens per second for DS4 Flash.

It’s very hard to have an apples-to-apples comparison on performance when they’re hosted by different providers, and I have no good way of running DS4 Flash on my DGX Spark to make my own local comparison… but nothing I’ve seen indicates DS4 Flash should actually be that much faster. Maybe at super deep contexts? Again, if you have numbers to share, I’d love to see them.

DeepSeek has indicated that this is maybe just a preview release. In the future, it could be a better model than Mimo-V2.5, but for now… I encourage you to try Mimo-V2.5 and judge it for yourself.

[-]

Shoddy_Bed3240@reddit

Mimo-V2.5 Flash has a bad llama.cpp implementation at the moment. It hallucinates really badly.

[-]

goat_on_boat@reddit

Update - Tool calling is improving in later releases.

Only negative things I’d say is the model is very keen to “do work” - especially when asked to just comment or summarise. I’d suggest restricting tool access in your harness if you know you just need read only access.

[-]

ohgoditsdoddy@reddit

This is on 128 GB RAM?

[-]

cityshade@reddit

This is SO good, thank you u/antirez

[-]

silenceimpaired@reddit

But I have a PC and uh Linux. Lol

[-]

S1M0N38@reddit

I think that this project is an important milestone.

DS4-Flash is not Mythos, Opus, or GPT-5.x, but it's a competent model that can be useful as a coding agent - far ahead of models that, two years ago, would require a datacenter to be hosted. Thanks to DS's crazy arch optimization, we have something that we can work with (not just run) on ~5k MBP or some cheaper HW with impressive ctx length (one of the bottleneck in local LLMs). antirez/ds4, building on the shoulders of giants (llama.cpp), delivers very decent speed by leveraging the full HW stack (GPU, CPU, Memory, SSD). Apple rethought (maybe by accident) the HW for the M-chip series, offering high memory capacity. This bet on narrow vertical optimization is a glimpse about local LLM will like in the future.

An alternative path to real consumer HW local LLM inference could be something like taalas is trying to achieve (burning LLMs weights directly into chipsets).

[-]

DaniDubin@reddit

Thank you! I really hope it will perform nicely!

I tried mlx versions of 2-bit and 2-bit-M-DQ and it was completely useless, going into reasoning loops even with easy prompts, can’t do simple calculations and even the Car Wash question it took 10min to think before almost giving incorrect answer. I used the official sampling params.

[-]

bobby-chan@reddit

Since safetensors is a well defined file format, you can sometimes make quants way before proper supports is there.

https://github.com/ml-explore/mlx-lm/pull/1189

[-]

fairydreaming@reddit

I wonder why only flash? Is the pro one so much different (except for memory requirements)? I noticed that other frameworks advertise flash as the supported model (example: ktransformers) and not a word about pro. I'm not trying to be mean, just genuinely curious.

[-]

coder543@reddit

No one has enough memory to run DS4 Pro.

[-]

fairydreaming@reddit

A few beers later:

(ds4) phm@epyc:/mnt/md0/huggingface/hub/models--deepseek-ai--DeepSeek-V4-Pro/snapshots/89d501aed998d33fa4f4702102ec1bb2331e10f6/inference$ PYTHONPATH=".:../encoding" python -m torch.distributed.run --nproc-per-node 1 generate.py --ckpt-path /mnt/md0/models/DeepSeek-V4-Pro/ --config config.json --input-file prompt1.txt --temperature 0.0
/mnt/md0/huggingface/hub/models--deepseek-ai--DeepSeek-V4-Pro/snapshots/89d501aed998d33fa4f4702102ec1bb2331e10f6/inference/generate.py:92: FutureWarning: torch.cuda._set_allocator_settings is deprecated. Use torch._C._accelerator_setAllocatorSettings instead.
  torch.cuda.memory._set_allocator_settings("expandable_segments:True")
ModelArgs(max_batch_size=4, max_seq_len=4096, dtype='fp8', scale_fmt='ue8m0', expert_dtype='fp4', scale_dtype='fp8', vocab_size=129280, dim=7168, moe_inter_dim=3072, n_layers=61, n_hash_layers=3, n_mtp_layers=1, n_heads=128, n_routed_experts=384, n_shared_experts=1, n_activated_experts=6, score_func='sqrtsoftplus', route_scale=2.5, swiglu_limit=10.0, q_lora_rank=1536, head_dim=512, rope_head_dim=64, norm_eps=1e-06, o_groups=16, o_lora_rank=1024, window_size=128, compress_ratios=[128, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 0], compress_rope_theta=160000, original_seq_len=65536, rope_theta=10000, rope_factor=16, beta_fast=32, beta_slow=1, index_n_heads=64, index_head_dim=128, index_topk=1024, hc_mult=4, hc_sinkhorn_iters=20, hc_eps=1e-06)
load model
I'm DeepSeek 👋
[[0, 128803, 22000, 477, 440, 2755, 128804, 128822]]
[10:46:55] /project/src/transform/thread_storage_sync.cc:841: Warning: [ThreadSync] Hoisting sync from inside if to before if. Condition depends on runtime value: thread_binding % 32 // 4 == 0
[10:46:55] /project/src/transform/thread_storage_sync.cc:841: Warning: [ThreadSync] Hoisting sync from inside if to before if. Condition depends on runtime value: thread_binding % 32 // 4 == 0
[10:46:55] /project/src/transform/thread_storage_sync.cc:841: Warning: [ThreadSync] Hoisting sync from inside if to before if. Condition depends on runtime value: thread_binding % 32 // 4 == 0
Prompt: who are you?

Completion: Hey there! 👋 I'm DeepSeek, an AI assistant created by the company DeepSeek (深度求索). I'm here to help you with questions, tasks, brainstorming, coding, writing, or just about anything you need!

A few quick facts about me:
- I'm a **text-based model** that can handle conversations, analyze documents, and process large amounts of text (up to 1M tokens of context!)
- I'm **completely free** to use
- I support **file uploads** (images, PDFs, Word docs, Excel, PPT, etc.) and can read text from them
- I can do **web searches** if you turn on that feature manually
- My knowledge is current up to **May 2025**
- I'm available on web and mobile apps with voice input support

Think of me as your helpful, enthusiastic digital companion who loves diving into complex problems and making things clearer. What can I help you with today? 😊<｜end▁of▁sentence｜>

[-]

fairydreaming@reddit

Now I feel a strange itch in my brain. Hold my beer...

[-]

segmond@reddit

probably just what he can run, seems he taps out at 128gb.

[-]

brickout@reddit

Too big

[-]

No-Mountain3817@reddit

./ds4 -p "Explain Redis streams in one paragraph."

ds4: context buffers 1061.71 MiB (ctx=32768, backend=metal, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=8194)

2026-05-08 22:52:07.177 ds4[26139:7348522] Metal API Validation Enabled

ds4: requesting Metal residency (may take tens of seconds)... done

ds4: warming Metal model views... done

ds4: Metal model views created in 2.166 ms, residency requested in 653.869 ms, warmup 5.136 ms (mapped 82697.67 MiB from offset 5.08 MiB)

ds4: Metal mapped mmaped model as 2 overlapping shared buffers

ds4: Metal backend initialized for graph diagnostics

validateComputeFunctionArguments:1066: failed assertion `Compute Function(kernel_set_rows_f32_i32): Read-only bytes are being bound at index 2 to a shader argument with write access enabled (did you mean to use const or constant in the shader?).'

zsh: abort ./ds4 -p "Explain Redis streams in one paragraph."

[-]

Professional-Bear857@reddit

Do you know how to get this to work with open webui? I can't seem to get it to return text, although the model is found based on the connection http://127.0.0.1:8072/v1 (have set the port to 8072), when running ds4 it returns

ds4 % ./ds4-server --ctx 81920 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192 --port 8072

ds4: requesting Metal residency (may take tens of seconds)... done

ds4: warming Metal model views... done

ds4: Metal model views created in 4.853 ms, residency requested in 1068.506 ms, warmup 6.194 ms (mapped 157001.67 MiB from offset 5.08 MiB)

ds4: Metal mapped mmaped model as 1 overlapping shared buffers

ds4: Metal backend initialized for graph diagnostics

0508 14:37:12 ds4-server: context buffers 2363.71 MiB (ctx=81920, backend=metal, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=20482)

0508 14:37:12 ds4-server: KV disk cache /tmp/ds4-kv (budget=8192 MiB, cross-quant=accept, min=512, cold_max=30000, continued=10000, trim=32, align=2048)

0508 14:37:12 ds4-server: listening on http://127.0.0.1:8072

[-]

t00052e@reddit

It works for me normally without any problems.

[-]

Southern_Sun_2106@reddit

Thank you so much, it's running great!

[-]

weddingperson@reddit

Amazing to see the guy who gave us Redis now giving us this.

[-]

Zeeplankton@reddit

Neat. Any reasonable chance of getting it running on a 96gb M3? Not sure if even smaller quants exist,

[-]

-dysangel-@reddit

I think even if you managed to squeeze it in you'd get basically no context. Have you tried Minimax M2.7? IQ2_XXS UD is 65GB and pretty good. It's what I'm maining atm on my M3 Ultra. It doesn't have hybrid attention so setting up good caching is a must.

[-]

stormy1one@reddit

Interesting - what are you using it for? Have you compared Minimax 2.7 IQ2_XXS to Qwen3.6-27B at native 8/16bit? I may take a second look at running 2.7 with a smaller quant as you suggested

[-]

-dysangel-@reddit

I haven't used it for my day job yet, but it's done a nice job on working on my game engine in my tests so far.

I was having issues with caching and tool calling with Qwen, plus Minimax is 2-3x faster on decode than Qwen 27B, and feels less overthink-y. I'll probably try to sort out 27B again later but Minimax is my priority for now.

I'm doing prefill on my DGX Spark, and decode on the Mac, and vibe coding up some aggressive caching using my M3 Ultra's RAM which is otherwise going to waste (512GB). It's nice to have instantaneous response from cached agentic system prompts, and 700-900tps prefill on short sequences on the Spark, dropping to about 450 at 16k.

[-]

stormy1one@reddit

Disaggregated prefill via vLLM on the DGX? How do you have it hooked in on the OSX side?

[-]

-dysangel-@reddit

just vibe coded up a scaffold with Claude. Started integrating the cache server I built a few months ago too, so that's now the orchestrator deciding where prefill/decode requests go. The Spark is faster for prefill and the Mac for decode, but once you start having a few sub agents running you might as well be batching on both machines.

[-]

rm-rf-rm@reddit

I was very excited to see this on HN. I was surprised it wasnt here as yet so I posted it last night and got downvoted .

Glad that the author himself posted it and its getting the upvotes, attention it deserves!

[-]

jwestra@reddit

Nice. How does it compare to running a similar quantization on llama.cpp in speed?

[-]

antirez@reddit (OP)

The only llama.cpp DS4 implementation I'm aware of that works reliably is the one I published on a fork. DS4 is faster. When the official llama.cpp implementation will be released there is to benchmark it. I hope they will just steal my kernels.

[-]

LagOps91@reddit

do you know if there is progress being made on llama.cpp to support DS4? haven't heard of anything in a while...

[-]

segmond@reddit

none, there are about 10 forks, every single one is vibe coded hell. performance is abysmal for every single one of them. I have tried all of them.

[-]

abkibaarnsit@reddit

Quick question

Why are you going all in on DeepSeek V4 Flash ? Did your tests reveal it to be better than others on you tasks ?

I'm curious because this seems less adaptable. Do you think the V4 architecture will be copied by others as was the case with V3/R1 ?

PS. Love your streams.. Picking up a bit of Italian 😂

[-]

DistanceSolar1449@reddit

Yeah, Deepseek V4 Flash on 128gb seems questionable. Especially since it’s a Mac, so you need 8GB for the OS.

That model is a bit too big to fit well on 120GB without quantization brain damage.

Better option would probably be Minimax M2.7

[-]

segmond@reddit

stop speculating when you can try it yourself. The q2 mixed weight is 81gb, it doesn't need that much memory utilization. if you re going to talk it down, try it at least, some of us have been running this locally for about 2 weeks now.

[-]

Electrical-Pay-5119@reddit

You really need to try it to understand. I’m using it on my M5 Max 128gb, fits perfectly with good context and tool calling. It is just a completely different beast in terms of knowledge base and creative output than the other models I’ve used, and it’s my desert island model at this point.

[-]

segmond@reddit

a lot of performance is being dropped. with pure CPU I'm getting 12pp/5tg on my CPU. With CUDA I'm getting like 30pp/15tg. Of course all the branch I have ran so far are vibe coded to hell and it's amazing that it works, but this tells us that so much performance is being dropped by the state of things. I'm running on 3090s/3080s which are much faster than M3. If this ever get's figured out we should see 1000pp. With that said, DSv4 is SOTA that you can run locally under 128gb. NOTHING comes close. Not qwen3.5-122b, not minimax2.7. NOTHING. You really do get 1 million context and I have ran it with it, and it is super duper coherent with the "dynamic" q2 quant mix even with very long context. It's my favorite local model, mimov2.5 is not too bad but often goes into a loop. perhaps my luck, but i have never had dsv4 go into a loop, not even once.

[-]

sp4_dayz@reddit

this is amazing, weights almost instantly loading into ram on m5 max is a chef's kiss

[-]

johnnyApplePRNG@reddit

Godspeed, bro. You have my eyes! Will be watching...

[-]

chafey@reddit

Any chance this would work on Mac Pro 2019 7,1 with Dual w6800x Duo GPUs?

[-]

coder543@reddit

That does not even remotely support modern Metal APIs, so there is no chance.

[-]

marutichintan@reddit

I had try DS4 in my Mac Studio 512 and it working fine. it has some issue chat templates in openwebui

[-]

foldl-li@reddit

This section is really great. ollama shall learn something from this.

[-]

SmartCustard9944@reddit

Thanks to your work I was able to use DeepSeek V4 Flash (cloud) to port your llama.cpp fork to both Vulkan and Rocm for Strix Halo. DeepSeek V4 Flash (unquantized) is surprisingly good!

Vulkan backend runs at around 15 tok/s, but prompt processing is not amazing.

I have tried the q2 version, and as expected it does not produce the same level of intelligence as the full version. I used ChatGPT 5.5 to generate some prompts that I tested against the quantized and non quantized version. As expected, the quantized gets confused more easily.

Have you tried the quantized model in real use cases?