Can anyone share their qwen 2.5 setup for a 4090 please?

Posted by firemeaway@reddit | LocalLLaMA | View on Reddit | 35 comments

Hi folks,

Totally get there are multiple 4090 related questions but I’ve been struggling to setup qwen2.5 using the oobabooga text-generation webui.

Using the 32b model I get extremely slow responses even at 4bit quantisation.

Anyone willing to share their config that performs best?

Thanks 🙏

[-]

TrashPandaSavior@reddit

On my workstation with a 4090, I use LM Studio to load qwen2.5-coder-23b-instruct q4_k_m. I set it to use flash attention and a 8192 token context window. With a simple programming question leaving the prompt kinda small, I get 31.15 T/s.

[-]

Educational_Fix5968@reddit

Have a single 4090 on the same setup and I only get about 18 T/s I just set up LMStudio tho so I might be missing something on my setup

[-]

TrashPandaSavior@reddit

Some of those things I mentioned, like context window and flash attention, have to be configured manually or it won't get picked up. Also, it might not force all layers to be offloaded by default, so that might have to be manually set. I also have keep model loaded and use mmap enabled.

Otherwise, I don't have much out of the ordinary for those numbers: Win11. 96gb RAM, but the whole model is offloaded to VRAM. CUDA 12.4. Just double checked my simple two sentence prompt in an otherwise empty context and got 32.23 T/s.

[-]

Educational_Fix5968@reddit

Hmm yeah I got similar results, only after switching to windows from open suse. Running CUDA 12.4 on both but I get \~50% bump when I run LMStudio on Win 11.

\~18-20 T/s on Linux

\~30-33 T/s on Win 11

Just getting into local LLMs and didnt think there would be this much of performance difference swapping the OS

[-]

TrashPandaSavior@reddit

I rebooted into my Debian stable partition a little bit ago and got \~36 T/s. I made sure to use the same Qwen/Qwen2.5-Coder-32B-Instruct Q4_K_M and other settings with LM Studio 0.3.5. I did notice this time, though, that LM Studio defaults to only offloading 49 layers of 64, and that gave me a speed of \~6 T/s.

[-]

TrashPandaSavior@reddit

I'll reboot into debian stable in a few hours and run a speed test with lm studio there and see what I get and drop a reply.

[-]

ParaboloidalCrest@reddit

Indeed, 8k context is the max for 24 GB of VRAM (before offloading to RAM).

[-]

SniperDuty@reddit

Ah it's always the VRAM I forget sometimes. And yet the latest 5090 will be coming out only with a marginal increase.

[-]

TrashPandaSavior@reddit

There's still a little room left. I use sunshine/moonlight as a sort of VNC, and with 8k context, only 22G is allocated. Probably can squeeze a little more if motivated. I'm usually fine with 8k, personally.

[-]

Hungry_Instance9764@reddit

2x3090, 30b exl2 6.5 quant with 1.5b as draft model using this backend https://test.pypi.org/project/gallama/ I get 40-45 t/s. 128k context. 48gb is fully loaded.

[-]

johakine@reddit

Do you have all pci at 16x speed? Or 16-4? 8-8?

[-]

Hungry_Instance9764@reddit

Both thru x4 via oculink

[-]

johakine@reddit

Please, tell me your prompt speed (prompt eval time).

And.. what is your hardware? Unknown to me, strange devices.

[-]

Hungry_Instance9764@reddit

https://aoostar.com/products/aoostar-ag02-egpu-dock-with-oculink-port-built-in-huntkey-550w-power-supply-not-supports-hot-swapcompatible-with-egpus-up-to-300w?srsltid=AfmBOopJLHVyAX68lSQJJoYfBrwsiMnhBY-ho9jUHp7k8RDWcZF-y33R
Here is the toaster.

[-]

Such_Advantage_6949@reddit

You wont need to, x4 is fine. Unless you using tensor parallel, which is only applicable for bigger model. Currently only qwen 72b, mistral large, llama 3.1 70b supported for tensor parallel in exllama

[-]

Medium_Chemist_4032@reddit

can you post the command I could use to replicate that on fresh install? Also 2x3090 set-up. Thanks!

[-]

Such_Advantage_6949@reddit

For gallama, there is the accompany gallamaui where u can load model via UI, it saves me the hassle of remembering the command until i forgot how the command actually goes

[-]

Hungry_Instance9764@reddit

==== .bat file
REM Navigate to the project directory

cd C:\Users\Eugene\Documents\git\gallama

REM Run the PowerShell script to activate the virtual environment

powershell -ExecutionPolicy Bypass -Command ".\.venv\Scripts\Activate.ps1; gallama run -id 'model_id=coder30b_6_5 draft_model_id=coder1.5b'; deactivate"

====

My model_config is:
coder30b_6_5:

backend: exllama

cache_quant: Q6

gpus: auto

max_seq_len: 128000

model_id: C:\Users\Eugene\gallama\models\Qwen2.5-Coder-32B-Instruct-exl2

prompt_template: Qwen2

quant: 6.5

coder1.5b:

backend: exllama

cache_quant: Q6

gpus: auto

max_seq_len: 32000

model_id: C:\Users\Eugene\gallama\models\Qwen_Qwen2.5-Coder-1.5B-Instruct-exl2

prompt_template: Qwen2

quant: Q8

[-]

MachineZer0@reddit

Thanks for posting this. Will check out gallama

[-]

necrogay@reddit

Single 4090, exl2 4.8bpw, 32k context, 32-36tok/s

[-]

HikaruZA@reddit

Tabbyapi/exllamav2 4.0 bpw / fp16 cache 20k+ context, 4.0 bpw exl2 0.5B-coder draft model, getting up to 75 tokens/s typically in 60s, I prefer running neutral samplers with just 0.01 min_p

[-]

appakaradi@reddit

VLLM 4 bit quantization AWQ

[-]

SniperDuty@reddit

Yeah, I'm getting 2.84 tok/sec on my 4090 and a hell of a lot more than this on my M4 Max. I will save this post and go through the advice on here in the morning thanks for posting OP.

[-]

infiniteContrast@reddit

you must use the exl2 quant.

for a single 24gb vram card you need the 4.5bpw-exl2 quant and you can fit in that card maybe more than 16k context with 4bit cache

[-]

rerri@reddit

This is not true.

I run IQ4_XS quant of Qwen 32B, n_ctx 32768, cache_8bit. No issues and even some headroom for more context.

Q4_K_M or Q4_K_L are fine too with shorter context length.

14B Q6_K, \~100k context fits.

[-]

superfsm@reddit

Cline doesn't work with this setup? Right?

I am tempted but I read about bad results I am afraid.

[-]

rerri@reddit

I've never used Cline but ooba does have OpenAI compatible API. Cline seems to support that. So maybe could connect then that way, definitely not sure only just googled what cline is.

[-]

gamesntech@reddit

Is the GPU actually being used? That seems like the most likely problem

[-]

Prudence-0@reddit

Use llama.cpp (or lmstudio) with 36 layers on GPU. Work fine with gguf q8

[-]

Downtown-Case-1755@reddit

Use TabbyAPI, Q6 cache and a 4,0-4.5bpw exl2 quantization. Voilà!

Ooba is less than ideal because its tokenziner and some other bits are really slow at long context, and also you might be using a suboptimal quantization for your hardware. It also has no option for Q6 cache, and some other missing bits.

[-]

KL_GPU@reddit

Check context length, your gpu should be able to support up to ~20k token; also your webui use python-llamacpp that in my testing with a tesla p40 delivers 2/3 of the possible performance

[-]

rerri@reddit

This is my guess too, OP probably did not touch n_ctx which defaults to 128k. Too much.

[-]

Xyzzymoon@reddit

If you are using Quantization anyway, try use koboldcpp first, no set up required really.

oobabooga is not easy to troubleshoot.

[-]

firemeaway@reddit (OP)

Thanks. I tried koboldcpp but it says it only takes GGUF format whereas the model is a bunch of safetensor files so I can’t launch it

[-]

Xyzzymoon@reddit

Just download a GGUF version. You are quantizing down to 4bit anyway you are not getting much benefit from not using GGUF. Those are much easier to run.