Tutorial: How to Run DeepSeek-R1 (671B) 1.58bit on Open WebUI

Posted by yoracale@reddit | LocalLLaMA | View on Reddit | 41 comments

Hey guys! Daniel & I (Mike) at Unsloth collabed with Tim from Open WebUI to bring you this step-by-step on how to run the non-distilled DeepSeek-R1 Dynamic 1.58-bit model locally!

This guide is summarized so I highly recommend you read the full guide (with pics) here: https://docs.openwebui.com/tutorials/integrations/deepseekr1-dynamic/

To Run DeepSeek-R1:

1. Install Llama.cpp

Download prebuilt binaries or build from source following this guide.

2. Download the Model (1.58-bit, 131GB) from Unsloth

Get the model from Hugging Face.
Use Python to download it programmatically:

from huggingface_hub import snapshot_download snapshot_download(     repo_id="unsloth/DeepSeek-R1-GGUF",     local_dir="DeepSeek-R1-GGUF",     allow_patterns=["*UD-IQ1_S*"] )

Once the download completes, you’ll find the model files in a directory structure like this:

DeepSeek-R1-GGUF/ ├── DeepSeek-R1-UD-IQ1_S/ │   ├── DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf │   ├── DeepSeek-R1-UD-IQ1_S-00002-of-00003.gguf │   ├── DeepSeek-R1-UD-IQ1_S-00003-of-00003.gguf

Ensure you know the path where the files are stored.

3. Install and Run Open WebUI

If you don’t already have it installed, no worries! It’s a simple setup. Just follow the Open WebUI docs here: https://docs.openwebui.com/
Once installed, start the application - we’ll connect it in a later step to interact with the DeepSeek-R1 model.

4. Start the Model Server with Llama.cpp

Now that the model is downloaded, the next step is to run it using Llama.cpp’s server mode.

🛠️Before You Begin:

Locate the llama-server Binary
If you built Llama.cpp from source, the llama-server executable is located in:llama.cpp/build/bin Navigate to this directory using:cd [path-to-llama-cpp]/llama.cpp/build/bin Replace [path-to-llama-cpp] with your actual Llama.cpp directory. For example:cd \~/Documents/workspace/llama.cpp/build/bin
Point to Your Model Folder
Use the full path to the downloaded GGUF files.When starting the server, specify the first part of the split GGUF files (e.g., DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf).

🚀Start the Server

Run the following command:

./llama-server \     --model /[your-directory]/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \     --port 10000 \     --ctx-size 1024 \     --n-gpu-layers 40

Example (If Your Model is in /Users/tim/Documents/workspace):

./llama-server \     --model /Users/tim/Documents/workspace/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \     --port 10000 \     --ctx-size 1024 \     --n-gpu-layers 40

✅ Once running, the server will be available at:

http://127.0.0.1:10000

🖥️ Llama.cpp Server Running

[After running the command, you should see a message confirming the server is active and listening on port 10000.](

Step 5: Connect Llama.cpp to Open WebUI

Open Admin Settings in Open WebUI.
Go to Connections > OpenAI Connections.
Add the following details:
URL → http://127.0.0.1:10000/v1API Key → none

Adding Connection in Open WebUI

Notes

You don't need a GPU to run this model but it will make it faster especially when you have at least 24GB of VRAM.
Try to have a sum of RAM + VRAM = 120GB+ to get decent tokens/s

If you have any questions please let us know and also - any suggestions are also welcome! Happy running folks! :)

[-]

useful@reddit

I ran this with a 9900k 3090 and 128gb of ddr4 ram and an nvme

35 minutes for flappy bird

[-]

np-n@reddit

I am also facing similar issue. Did you identify any solution for this.

[-]

yoracale@reddit (OP)

That's definitely not right. Did you enable kv cache, offloading and mmap?

Someone from locallama ran it at 2 tokens/s WITHOUT a GPU only with 96gbram: https://www.reddit.com/r/LocalLLaMA/comments/1idseqb/deepseek_r1_671b_over_2_toksec_without_gpu_on/

[-]

zzComtra@reddit

would like to ask

May i ask estimated tok/s for 48 GB VRAM/2x3090 + 64 GB RAM?

Also wonder if anyone has been able to test is the 1.58 Dynamic on par overall with a 3.xx or 4.xx bit in answer accuracy?

[-]

yoracale@reddit (OP)

48GB VRAM? That's pretty darn good. I'd say like 3-6 tokens/s but you must offload into your GPU! :)

[-]

zzComtra@reddit

Thank you!

just wondering tho will adding a 5090/32 GB more VRAM help push the numbers up?

[-]

yoracale@reddit (OP)

Yes but in this case, RAM might me more important because u have a lot of VRAM already

[-]

zzComtra@reddit

Ahh. looks like i get 8 seconds per word... if i have a gen 3 SSD is that why?

should I get 64 GB of ram or get a Gen 5 SSD? TIL gen 3 vs gen 5 ssd... didnt thinkg I would try to attempt LLMs beyond the seventies-B Q4

[-]

ibstudios@reddit

How do I save? if I ran from the c prompt I could just do a /save modelname?

[-]

yoracale@reddit (OP)

Wait I'm confused, you don't need to save it - it's already saved on the computer when you download it

[-]

ibstudios@reddit

If you run ollama + deepseek there is a "/save XYZ" command that takes whatever you trained in chat and saves. it.

[-]

Following the guidance above, I just setup Deepseek R1 671b 1.58bit to my M2 Studio 24 core CPU, 60 core GPU, 192Gb Ram. Ran with the suggested ctx-siye of 1024, n-gpu-layers of 40. I get between 8 and 9 t/s at inference. Very happy with how it runs in OpenwebUI. Running htop reports 134gb RAM used during inferencing.

Points of interest,

1) when closing the llama server, 90GB of mem is still labelled as in-use although no perocess is listed as using.

2) The model reports a max ctx-size as 163840.

```llama_init_from_model: n_ctx_per_seq (1024) < n_ctx_train (163840) -- the full capacity of the model will not be utilized```

I'll see how far I can get, I guess.

[-]

yoracale@reddit (OP)

Amazing you're so lucky ahaha. I'm only getting 2 tokens/s with my potato 64 ram setup

[-]

FluffyGoatNerder@reddit

So at 50 layers, a ctx-size of:

2048 uses avg 128GB of VRAM and generates avg of 8.34 T/s
4098 uses avg 136GB of VRAM and generates avg of 5.80 T/s
8192 uses avg 151GB of VRAM and generates avg of 0.78 T/s

Sliding context window error occurs during long thinking generations. Seems the sliding window is disabled here.

[-]

FluffyGoatNerder@reddit

At 62 layers (the max, I understand), a ctx-siye of:

1024 uses avg 150GB of VRAM and generates avg of 13.34 T/s
2048 uses avg 156GB of VRAM and generates avg of 12.44 T/s
4098 KV cache error
8192 KV cache error

All these are avg of 3 runs, asking "Tell me about yourself and what you can do."

[-]

GortKlaatu_@reddit

What kind of a performance hit does this have on benchmarks and has anyone tried this on a 128GB Macbook Pro?

[-]

BrilliantArmadillo64@reddit

I tried it on my M4 Max 128GB and got about 0.2tk/s...
I gave it more memory and launched it like this:

sudo sysctl iogpu.wired_limit_mb=122880
./llama.cpp/build/bin/llama-cli --model ~/.cache/lm-studio/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --cache-type-k q4_0 --threads 16 --prio 2 --temp 0.6 --ctx-size 8192 --seed 3407 --n-gpu-layers 45 -no-cnv --prompt "<｜User｜>Create a Flappy Bird game in Python.<｜Assistant｜>"

If somebody finds better parameters I'd be interested!

[-]

Trans-amers@reddit

I ran this with Macmon, and noticed that it is using CPU instead of GPU. I thought llama.cpp has metal by default and runs on gpu?

[-]

Trans-amers@reddit

I even failed by running the same line:
'src/llama.cpp:5326: GGML_ASSERT(hparams.n_expert <= LLAMA_MAX_EXPERTS) failed`

[-]

rafyyy@reddit

update llama.cpp, you have an older version where LLAMA_MAX_EXPERTS was smaller then the model max experts.

[-]

Trans-amers@reddit

Thanks, after rm and pulling the repository again it worked!

[-]

yoracale@reddit (OP)

Yep the docs was actually written using 128GB Mac :)

[-]

EntertainmentBroad43@reddit

What is a “moderate” speed?

[-]

yoracale@reddit (OP)

Moderate is like 2 tokens/s

[-]

EntertainmentBroad43@reddit

Thank you for the guide! Won’t speculative decoding speed this up quite a bit?

[-]

yoracale@reddit (OP)

Yes absolutely - that's what you're supposed use :)

[-]

ethertype@reddit

What is useful as a draft model for speculative decoding with DS-R1?

[-]

useful_tool30@reddit

Hey, I tried loading this up and set GPU offload layers to 7 for my 4090. 64GB of system ram. Everything else was left default.

When loading and then prompting I see nothing loaded into the GPU memory only maxing system RAM. Model runs at what looks like 1 word per sec. I used llama-b4608-bin-win-cuda-cu12.4-x64.zip with the partner CUDA DLLs.

[-]

yoracale@reddit (OP)

Oh weird should definitely be much faster.

Someone from locallama ran it at 2 tokens/s WITHOUT a GPU only with only 96gbram: https://www.reddit.com/r/LocalLLaMA/comments/1idseqb/deepseek_r1_671b_over_2_toksec_without_gpu_on/

You could follow along what they did possibly

[-]

Rob-bits@reddit

Once the model is downloaded, can I use LM Studio to run the model? Or is it working only with llama.cpp?

[-]

yoracale@reddit (OP)

No you can't unfortunately, you will need to merge it manually

[-]

Trans-amers@reddit

Agree Lm studio requires manual merging to than be able to see the file It ran out of ram when I run in lm studio so I have to clean up my system so run it again tonight

[-]

yoracale@reddit (OP)

Good luck! LM studio is pretty good!

[-]

Rob-bits@reddit

I mean if I have the model merged. What stops LM Studio to run it?

[-]

yoracale@reddit (OP)

Honestly unsure but I think so?

[-]

Goldandsilverape99@reddit

If you have a updated LM studio, you can run the unsloth DeepSeek-R1-GGUF version. You can even download it using LM studio, or find the folder where your other LM studio files are and place it there. You dont need to merge gguf files (but you can depending on your filesystem and how large a file can be) if they are split like this 00001-of-00004.gguf if they are next to each other. A note about llama cpp, there is a basic webui that one can use insteed aswell part of llama cpp.

[-]

Rob-bits@reddit

Ohh nice, thanks for the info. Will give it a try :)

[-]

Fun_Spread_1802@reddit

Thank you

[-]

yoracale@reddit (OP)

Thanks a lot for reading appreciate it! 🙏

[-]

12v12ccc@reddit

What llama.cpp binary should i use with AMD CPU and GPU?

[-]

yoracale@reddit (OP)

I think it shouldnt matter which GPU/CPU you use since AMD is largely supproted for running models in llama.cpp but you should confirm via their github