Can anyone share their qwen 2.5 setup for a 4090 please?
Posted by firemeaway@reddit | LocalLLaMA | View on Reddit | 35 comments
Hi folks,
Totally get there are multiple 4090 related questions but I’ve been struggling to setup qwen2.5 using the oobabooga text-generation webui.
Using the 32b model I get extremely slow responses even at 4bit quantisation.
Anyone willing to share their config that performs best?
Thanks 🙏
TrashPandaSavior@reddit
On my workstation with a 4090, I use LM Studio to load qwen2.5-coder-23b-instruct q4_k_m. I set it to use flash attention and a 8192 token context window. With a simple programming question leaving the prompt kinda small, I get 31.15 T/s.
Educational_Fix5968@reddit
Have a single 4090 on the same setup and I only get about 18 T/s I just set up LMStudio tho so I might be missing something on my setup
TrashPandaSavior@reddit
Some of those things I mentioned, like context window and flash attention, have to be configured manually or it won't get picked up. Also, it might not force all layers to be offloaded by default, so that might have to be manually set. I also have keep model loaded and use mmap enabled.
Otherwise, I don't have much out of the ordinary for those numbers: Win11. 96gb RAM, but the whole model is offloaded to VRAM. CUDA 12.4. Just double checked my simple two sentence prompt in an otherwise empty context and got 32.23 T/s.
Educational_Fix5968@reddit
Hmm yeah I got similar results, only after switching to windows from open suse. Running CUDA 12.4 on both but I get \~50% bump when I run LMStudio on Win 11.
\~18-20 T/s on Linux
\~30-33 T/s on Win 11
Just getting into local LLMs and didnt think there would be this much of performance difference swapping the OS
TrashPandaSavior@reddit
I rebooted into my Debian stable partition a little bit ago and got \~36 T/s. I made sure to use the same Qwen/Qwen2.5-Coder-32B-Instruct Q4_K_M and other settings with LM Studio 0.3.5. I did notice this time, though, that LM Studio defaults to only offloading 49 layers of 64, and that gave me a speed of \~6 T/s.
TrashPandaSavior@reddit
I'll reboot into debian stable in a few hours and run a speed test with lm studio there and see what I get and drop a reply.
ParaboloidalCrest@reddit
Indeed, 8k context is the max for 24 GB of VRAM (before offloading to RAM).
SniperDuty@reddit
Ah it's always the VRAM I forget sometimes. And yet the latest 5090 will be coming out only with a marginal increase.
TrashPandaSavior@reddit
There's still a little room left. I use sunshine/moonlight as a sort of VNC, and with 8k context, only 22G is allocated. Probably can squeeze a little more if motivated. I'm usually fine with 8k, personally.
Hungry_Instance9764@reddit
2x3090, 30b exl2 6.5 quant with 1.5b as draft model using this backend https://test.pypi.org/project/gallama/ I get 40-45 t/s. 128k context. 48gb is fully loaded.
johakine@reddit
Do you have all pci at 16x speed? Or 16-4? 8-8?
Hungry_Instance9764@reddit
Both thru x4 via oculink
johakine@reddit
Please, tell me your prompt speed (prompt eval time).
And.. what is your hardware? Unknown to me, strange devices.
Hungry_Instance9764@reddit
https://aoostar.com/products/aoostar-ag02-egpu-dock-with-oculink-port-built-in-huntkey-550w-power-supply-not-supports-hot-swapcompatible-with-egpus-up-to-300w?srsltid=AfmBOopJLHVyAX68lSQJJoYfBrwsiMnhBY-ho9jUHp7k8RDWcZF-y33R
Here is the toaster.
Such_Advantage_6949@reddit
You wont need to, x4 is fine. Unless you using tensor parallel, which is only applicable for bigger model. Currently only qwen 72b, mistral large, llama 3.1 70b supported for tensor parallel in exllama
Medium_Chemist_4032@reddit
can you post the command I could use to replicate that on fresh install? Also 2x3090 set-up. Thanks!
Such_Advantage_6949@reddit
For gallama, there is the accompany gallamaui where u can load model via UI, it saves me the hassle of remembering the command until i forgot how the command actually goes
Hungry_Instance9764@reddit
==== .bat file
REM Navigate to the project directory
cd C:\Users\Eugene\Documents\git\gallama
REM Run the PowerShell script to activate the virtual environment
powershell -ExecutionPolicy Bypass -Command ".\.venv\Scripts\Activate.ps1; gallama run -id 'model_id=coder30b_6_5 draft_model_id=coder1.5b'; deactivate"
====
My model_config is:
coder30b_6_5:
backend: exllama
cache_quant: Q6
gpus: auto
max_seq_len: 128000
model_id: C:\Users\Eugene\gallama\models\Qwen2.5-Coder-32B-Instruct-exl2
prompt_template: Qwen2
quant: 6.5
coder1.5b:
backend: exllama
cache_quant: Q6
gpus: auto
max_seq_len: 32000
model_id: C:\Users\Eugene\gallama\models\Qwen_Qwen2.5-Coder-1.5B-Instruct-exl2
prompt_template: Qwen2
quant: Q8
MachineZer0@reddit
Thanks for posting this. Will check out gallama
necrogay@reddit
Single 4090, exl2 4.8bpw, 32k context, 32-36tok/s
HikaruZA@reddit
Tabbyapi/exllamav2 4.0 bpw / fp16 cache 20k+ context, 4.0 bpw exl2 0.5B-coder draft model, getting up to 75 tokens/s typically in 60s, I prefer running neutral samplers with just 0.01 min_p
appakaradi@reddit
VLLM 4 bit quantization AWQ
SniperDuty@reddit
Yeah, I'm getting 2.84 tok/sec on my 4090 and a hell of a lot more than this on my M4 Max. I will save this post and go through the advice on here in the morning thanks for posting OP.
infiniteContrast@reddit
you must use the exl2 quant.
for a single 24gb vram card you need the 4.5bpw-exl2 quant and you can fit in that card maybe more than 16k context with 4bit cache
rerri@reddit
This is not true.
I run IQ4_XS quant of Qwen 32B, n_ctx 32768, cache_8bit. No issues and even some headroom for more context.
Q4_K_M or Q4_K_L are fine too with shorter context length.
14B Q6_K, \~100k context fits.
superfsm@reddit
Cline doesn't work with this setup? Right?
I am tempted but I read about bad results I am afraid.
rerri@reddit
I've never used Cline but ooba does have OpenAI compatible API. Cline seems to support that. So maybe could connect then that way, definitely not sure only just googled what cline is.
gamesntech@reddit
Is the GPU actually being used? That seems like the most likely problem
Prudence-0@reddit
Use llama.cpp (or lmstudio) with 36 layers on GPU. Work fine with gguf q8
Downtown-Case-1755@reddit
Use TabbyAPI, Q6 cache and a 4,0-4.5bpw exl2 quantization. Voilà!
Ooba is less than ideal because its tokenziner and some other bits are really slow at long context, and also you might be using a suboptimal quantization for your hardware. It also has no option for Q6 cache, and some other missing bits.
KL_GPU@reddit
Check context length, your gpu should be able to support up to ~20k token; also your webui use python-llamacpp that in my testing with a tesla p40 delivers 2/3 of the possible performance
rerri@reddit
This is my guess too, OP probably did not touch n_ctx which defaults to 128k. Too much.
Xyzzymoon@reddit
If you are using Quantization anyway, try use koboldcpp first, no set up required really.
oobabooga is not easy to troubleshoot.
firemeaway@reddit (OP)
Thanks. I tried koboldcpp but it says it only takes GGUF format whereas the model is a bunch of safetensor files so I can’t launch it
Xyzzymoon@reddit
Just download a GGUF version. You are quantizing down to 4bit anyway you are not getting much benefit from not using GGUF. Those are much easier to run.