Running Mimo 2.5 q4_k_m on single rtx5090 need recommendations
Posted by BlackBeardAI@reddit | LocalLLaMA | View on Reddit | 2 comments
Getting 10.3 tps using this prompt:
CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=8 GOMP_CPU_AFFINITY="0 2 4 6 8 10 12 14" ./build-mimo-5090-3090/bin/llama-server -m "$MIMO" -ngl 999 --n-cpu-moe 43 --no-mmap -c 100000 -ctk q8_0 -ctv q8_0 -fa on --main-gpu 0 -t 8 --prio 3 --host 0.0.0.0 --port 8083
cpu: 9950x3d (using igpu for display) ram: 256gb 5600mhz gpu: single rtx 5090 os: linux mint 22.xx
is 10.3 tps on token generation is the absolute limit? I guess turbo quant is the only way to move forward from here. or is there anything else i can do to squeeze 1-2 more tps?
Expert-Dig-1768@reddit
thats insane from where did you got 256 gb ram??
Shoddy_Bed3240@reddit
I was able to get 13 t/s with ud-q4_k_xl, and I’m running 6400 MT/s memory. That’s probably about the ceiling for now, at least until llama.cpp adds MTP decoding support for that model