Dual Xeon E5-2696v4 + 512GB RAM + RTX 3090 Ti local LLM for ISP sysadmin work — benchmarks + questions
Posted by OkBase5453@reddit | LocalLLaMA | View on Reddit | 19 comments
Hi all! finally after 2 Monts of reading, asking, testing... and headaches and a living room environment of over 90 dB(wife threatening to leave at one point) I am posting my setup. I work as a sysadmin/DevOps engineer, and I've been building a local AI inference rig for both professional and personal use with some old company hardware. I've been benchmarking ik_llama.cpp (becouse it was better at only CPU inference than llama.cpp) and would love community input on models and configuration twix/tricks!
Hardware
- CPU: 2× Intel Xeon E5-2696v4 (44c/88t total)
- RAM: 512GB DDR4 2400 ECC LR-DIMM
- Motherboard: Supermicro X10DRi-LN4+ (Dual Socket 2011) PCI-E 3.0 x16
- GPU: MSI RTX 3090 Ti 24GB
- NVMe: 2xIntel SSD DC P3700 400GB for faster model loading(i think, havent testet it)
- Runtime: ik_llama.cpp & llama.cpp in Debian 12 LXC on Proxmox Baremetal
Benchmarks (ik_llama.cpp build 4400 / llama.cpp build 8739, numactl --interleave=all, --mmap 0)
| Model | Quant | Size | Backend | Config | pp1024 t/s | tg128 t/s |
|---|---|---|---|---|---|---|
| Qwen3.5-27B | Q4_K_M | 15.4 GiB | ik_llama.cpp CUDA | ngl=999, t=78 | 1535 | 46.2 |
| Qwen3.5-27B | Q4_K_M | 15.4 GiB | llama.cpp BLAS+CUDA | ngl=99, t=78 | 1521 | 44.5 |
| Qwen3.5-27B Distilled (Claude 4.6 reasoning) | i1-Q4_K_M | 15.4 GiB | CUDA ngl=99 | t=78 | 1514 | 44.4 |
| Gemma 4 31B | Q4_K_M | 17.8 GiB | ik_llama.cpp CUDA | ngl=999, t=78 | 1518 | 42.9 |
| Gemma 4 31B | Q4_K_M | 17.1 GiB | llama.cpp BLAS+CUDA | ngl=99, t=78 | 1441 | 40.8 |
| Qwen3.5-27B | Q4_K_M | 15.4 GiB | CPU only | t=80 | 51 | 5.4 |
| Qwen3.5-35B MoE A3B | Q4_K_M | 20.5 GiB | CPU only | t=42 | 264 | 23.2 |
| Qwen3-Coder-Next 80B A3B | Q4_K_XL | 46.2 GiB | CUDA ngl=20 + CPU | t=65 | 427 | 23.7 |
| Qwen3-Coder-Next 80B A3B | Q4_K_S | 42.4 GiB | CPU only | t=78 | 209 | 21.9 |
| Qwen3.5-122B MoE A10B | Q4_K_M | 71.3 GiB | CPU only | t=78 | 105 | 9.3 |
Notable: Gemma 4 31B on CUDA (1518 pp / 42.9 tg) is nearly identical to Qwen3.5-27B (1535 pp / 46.2 tg) despite being a larger. ik_llama.cpp consistently outperforms llama.cpp by ~1–5% on both models. I have a problem with partially offloading the Qwen3.5-122B to the CPU/RAM, so I could not test it further.
root@llama-cpp:~# time numactl --interleave=all /opt/ik_llama.cpp/build/bin/llama-bench -m /mnt/models/Qwen3.5/Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf -ngl 14 -t 79 -p 1024 -n 128 --mmap 0 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB | model | size | params | backend | ngl | threads | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ---: | ------------: | ---------------: | | qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 79 | 0 | pp1024 | 218.72 ± 6.34 | | qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 79 | 0 | tg128 | 10.87 ± 0.08 |
build: 13d7178d (4400)
real 2m22.338s user 98m2.039s sys 1m5.814s
root@llama-cpp:~# numactl --interleave=all /opt/ik_llama.cpp/build/bin/llama-bench -m /mnt/models/Qwen3.5/Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf -ngl 18 -t 79 -p 1024 -n 128 --mmap 0 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB | model | size | params | backend | ngl | threads | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ---: | ------------: | ---------------: | main: error: failed to load model '/mnt/models/Qwen3.5/Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf'
My Use Cases
- scripting & automation — Bash/Python scripts for network ops
- Server deployment — Proxmox/LXC planning, application installation, migrations, full deployment workflows
- MCP + vendor docs — proprietary Vendor PDFs with > 1000 pages, the model should read them then help/writes configs and installation plans ← main use case
- Side project — iOS/Android board game developing
The MCP server use case is the critical one here... I want the model to ingest large vendor manuals via MCP file-system tools and then answer questions, write configs, and create step-by-step installation plans. Context length and instruction-following quality matter a lot here.
Questions
-
Best model for long vendor doc → installation/migration/uograde plan workflows? Currently on Qwen3-Coder-Next 80B (ngl=20). Is Qwen3.5 27B or Gemma 4 31B better for long-context instruction following? Or any other better ones!?
-
Optimal ngl for models and other helpfull configuration on rtx 3090 24GB VRAM? At ngl=20: 427 pp / 23.7 tg for Qwen3-Coder-Next. Anyone found a better split? Is there a formula for MoE layer-to-VRAM mapping? Why can i not go more than ngl 20
-
Qwen3.5-122B at 9 t/s tg — usable for interactive chat? I have 512GB RAM so it fits. Any tricks to squeeze more speed?
-
HAVE_FANCY_SIMD is NOT definedon Broadwell-EP (AVX2, no AVX-512) — expected or am I missing a compile flag in ik_llama.cpp/llama.cpp? -
Gemma 4 31B real-world impressions? fits in my VRAM. Anyone comparing it to Qwen3.5-27/32B for agentic/technical tasks?
Happy to share raw bench logs. Thanks! 🙏
P.S. my first reddit post(be gentle) :)
MelodicRecognition7@reddit
you could swap fans with quieter models and/or control fan speed manually with
ipmitoolOkBase5453@reddit (OP)
yes, I do use ipmitool to control them, but with lets say 10 the CPU (when GPU+CPU Inference) is getting pretty hot :)
jacek2023@reddit
strange build, single 3090 and 512GB RAM - why?
focus on --n-cpu-moe not -ngl
OkBase5453@reddit (OP)
RAM was there with the server, i just got the RTX 3090 to play around...
jacek2023@reddit
try using two
OkBase5453@reddit (OP)
That is the future plan. I can't find an NVLink for the 30xx here in Germany :)
jacek2023@reddit
I don't use nvlink
OkBase5453@reddit (OP)
I think I have to, since PCIe 3 :/
ttkciar@reddit
My HPC servers are very similar to yours, and PCIe 3.0 does surprisingly well. Give it a shot without NVLink. I think you will find it serviceable, and you will be able to use larger models while watching eBay for NVLink.
OkBase5453@reddit (OP)
yes, also it is a home lab server, so not only inference... do you use one GPU or many? Interested to see you llama.cpp config
Decent-Occasion-2720@reddit
J'ai une configuration semblable mais avec seulement une 3060 12Go.
Pour moi c'est le 122b qui est le plus rapide et performant, à peu près équivalent au 27b. Mais je n'ai pas la vram pour charger tout le 27b....
Pour le 122B charge le 10B sur le GPU et les tous ou quelques expert sur le cpu (--n-cpu-moe, --cpu-moe) . essaye de faire tenir le kv cache sur la vram, selon la taille du contexte désiré.
test des quant pour le kv cache, personnellement j'utilise q5_1 pour k et q4_0 pour v et je suis satisfait avec ça.
Je trouve que les models qwen 3.5 support assez bien la quantisation et pour moi le UD-Q2_K_XL reste très correct avec mon usage (opencode et contextext de 120k tokens).
9tok/s pour le 122b q4 sur cpu c'est mieux que mes xeon 6242 (2x 16core + avx512) et de la ddr4 2933.
j'ai réservé un node cpu pour le system et 8go de ram, et dédié un node a llamacpp je spécifie les 16 cores dédié a utiliser avec numactl -a -C 0,1,2,3,4... -m 0 ..
j'ai pu constater dans mon cas que:
- faire tourner llamacpp sur les deux node cpu dégrade les performances
- l'hyper threading n'apporte rien, il faut mettre thread = nombre de core
OkBase5453@reddit (OP)
root@llama-cpp:\~# numactl -C 0-21,22-44 -m 0 /opt/ik_llama.cpp/build/bin/llama-bench -m /mnt/models/Qwen3.5/Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gg
uf -ngl 14 --n-cpu-moe 5,10,15 -t 22,44 -p 1024 -n 128 --mmap 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
| model | size | params | backend | ngl | threads | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ---: | ------------: | ---------------: |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 22 | 0 | pp1024 | 142.40 ± 3.03 |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 22 | 0 | tg128 | 6.82 ± 0.02 |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 44 | 0 | pp1024 | 121.60 ± 1.17 |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 44 | 0 | tg128 | 3.98 ± 0.00 |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 22 | 1 | pp1024 | 143.57 ± 1.08 |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 22 | 1 | tg128 | 6.82 ± 0.03 |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 44 | 1 | pp1024 | 121.26 ± 1.64 |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 44 | 1 | tg128 | 3.98 ± 0.00 |
build: 13d7178d (4400)
root@llama-cpp:\~# numactl -C 0-43 -m 0 /opt/ik_llama.cpp/build/bin/llama-bench -m /mnt/models/Qwen3.5/Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf -
ngl 14 --n-cpu-moe 10 -t 78 -p 1024 -n 128 --mmap 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
| model | size | params | backend | ngl | threads | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ---: | ------------: | ---------------: |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 78 | 0 | pp1024 | 121.13 ± 1.79 |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 78 | 0 | tg128 | 3.56 ± 0.00 |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 78 | 1 | pp1024 | 123.41 ± 1.39 |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 78 | 1 | tg128 | 3.55 ± 0.00 |
build: 13d7178d (4400)
root@llama-cpp:\~# numactl --interleave=all /opt/ik_llama.cpp/build/bin/llama-bench -m /mnt/models/Qwen3.5/Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf
-ngl 14 --n-cpu-moe 10 -t 22,78 -p 1024 -n 128 --mmap 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
| model | size | params | backend | ngl | threads | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ---: | ------------: | ---------------: |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 22 | 0 | pp1024 | 146.81 ± 1.96 |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 22 | 0 | tg128 | 10.51 ± 0.01 |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 78 | 0 | pp1024 | 219.52 ± 2.50 |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 78 | 0 | tg128 | 10.79 ± 0.09 |
build: 13d7178d (4400)
root@llama-cpp:\~# numactl --interleave=all /opt/llama.cpp/build/bin/llama-bench -m /mnt/models/Qwen3.5/Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf
-ngl 14 --n-cpu-moe 10 -t 22,78 -p 1024 -n 128 --mmap 0
load_backend: loaded BLAS backend from /opt/llama.cpp/build/bin/libggml-blas.so
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24112 MiB):
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
load_backend: loaded CUDA backend from /opt/llama.cpp/build/bin/libggml-cuda.so
load_backend: loaded CPU backend from /opt/llama.cpp/build/bin/libggml-cpu-haswell.so
| model | size | params | backend | n_cpu_moe | threads | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ------: | ---: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | BLAS,CUDA | 10 | 22 | 0 | pp1024 | 50.71 ± 0.58 |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | BLAS,CUDA | 10 | 22 | 0 | tg128 | 6.48 ± 0.02 |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | BLAS,CUDA | 10 | 78 | 0 | pp1024 | 30.63 ± 0.12 |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | BLAS,CUDA | 10 | 78 | 0 | tg128 | 6.40 ± 0.05 |
build: d132f22fc (8739)
root@llama-cpp:\~# numactl -C 0-21 -m 0 /opt/llama.cpp/build/bin/llama-bench -m /mnt/models/Qwen3.5/Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf -ngl 1
4 --n-cpu-moe 10 -t 22 -p 1024 -n 128 --mmap 0
load_backend: loaded BLAS backend from /opt/llama.cpp/build/bin/libggml-blas.so
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24112 MiB):
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
load_backend: loaded CUDA backend from /opt/llama.cpp/build/bin/libggml-cuda.so
load_backend: loaded CPU backend from /opt/llama.cpp/build/bin/libggml-cpu-haswell.so
| model | size | params | backend | n_cpu_moe | threads | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ------: | ---: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | BLAS,CUDA | 10 | 22 | 0 | pp1024 | 28.35 ± 0.26 |
| qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | BLAS,CUDA | 10 | 22 | 0 | tg128 | 4.25 ± 0.00 |
build: d132f22fc (8739)
Well now I got this, and llama.cpp is disaster for my setup
ai_guy_nerd@reddit
That's a serious rig—90dB wife-complaints earned the data. A couple quick observations from running local inference at scale:
Your pp1024 numbers on Qwen are solid. For sysadmin workloads, consider mixed quantization: run most tasks on the 27B, but keep a 70B loaded on GPU for the occasional deep-dive query that needs more reasoning. You've got VRAM headroom and CPU cores to spare. The real win: the 44 cores means you can parallelize inference + log analysis without touching GPU, which keeps your model latency predictable under load.
One thing to test: disable numactl interleave and pin inference threads to socket 0 (CPU-GPU proximity matters more than even distribution on Xeon boards). Should see 3-5% throughput lift. Also log your during load—if it's throttling below 2.5GHz, airflow or thermals are limiting you more than raw performance.
Status_Record_1839@reddit
For your MCP + vendor docs use case, Qwen3.5-27B at ngl=99 is probably the sweet spot - better instruction following than the 80B MoE at ngl=20 and much faster. The 122B at 9 t/s is usable if you can tolerate the wait. HAVE_FANCY_SIMD not defined on Broadwell-EP is expected since it needs AVX-512, nothing broken there.
Status_Record_1839@reddit
For your MCP + vendor doc use case, Qwen3-Coder-Next 80B at ngl=20 is actually solid. For long context instruction following it edges out Qwen3.5-27B, especially on structured technical docs. Gemma 4 31B is competitive but tends to hallucinate more on config generation in my experience.
On the ngl limit at 20: with 24GB VRAM and a 46GB model you're fitting roughly 24/46 = \~52% of layers on GPU. The rest goes to system RAM over PCIe 3.0 which creates a bottleneck. You likely can't go higher without OOM. The formula is roughly (VRAM_GB / model_size_GB) * total_layers.
Qwen3.5-122B at 9 t/s is borderline for interactive chat but workable if you're not in a hurry. With 512GB RAM it fits fine. HAVE_FANCY_SIMD not defined on Broadwell-EP is expected - AVX-512 is required for that path and E5-2696v4 doesn't have it.
a_beautiful_rhind@reddit
I had a system like that as a GPU host. The ram speeds aren't that great but better than consumer DDR4.
Enjoy your time in the numa waiting room.
Impossible_Art9151@reddit
nice setup! some thoughts
ik_llama.cpp consistently outperforms llama.cpp by \~1–5%
I wouldn't focus on tweaking here. 1-5% are neglectable, not worth.
If speed matters, get more VRAM.
With 24GB you are hitting the CPU penalty, speed decreases "exponentially" to CPU speed with each GB not fitting into VRAM.
Qwen3.5-122B at 9 t/s tg — usable for interactive chat?
9t/s would be close to "usable" if there weren't the thinking delays. => more VRAM! thinking off might help, but lowers quality.
Your setup is great, keep in mind, you can run GLM 5.1 in a reasonable quant but awful low speeds.
If I were you: Keep this setup for testing, developing, private use, use cases where speed does not matter, batched processing...
Instead of pimping your setup with expensive cards like rtx 6000 pro consider the purchase like nvidia DGX. where you can run a qwen 3.5 122B in higher speeds. Your system is great in what it is. Other use cases => other hardware.
I am not expert enough for your use cases. They sound ambitious ....
Status_Record_1839@reddit
For your MCP + vendor docs use case, Qwen3.5-27B at 46 tg/s will serve you better than Qwen3-Coder-Next 80B at 23 tg/s. The speed difference is very noticeable when iterating over long manuals. Gemma 4 31B is comparable in quality to Qwen3.5-27B for instruction following but I'd stick with Qwen for structured config generation - it tends to be more reliable with strict output formats. For the HAVE_FANCY_SIMD issue on Broadwell-EP that's expected, AVX-512 is required for it and your E5-2696v4 only has AVX2.
Monad_Maya@reddit
Run a larger model if speed is not the topmost priority. MiniMax 2.5 or Stepfun 3.5 (?)
https://np.reddit.com/r/LocalLLaMA/comments/1mngl7i/how_does_ncpumoe_and_cpumoe_params_help_over/
I wasn't impressed by Qwen 122B's perf so I moved to MiniMax m2.5. Slightly slower but better at coding. Usable for chat but not much more. The generation speed is largely a non issue, it's the prompt processing speed that kills the overall "vibe".
No clue honestly, my 5900X is using some haswell_cpu thing when I look at llamacpp logs.
Qwen 27B is roughly equal to Genna 31B. Gemma has better world knowledge but that might not be an issue for you. I'd prefer Qwen 3.5 27B + more context for programming needs.
It does overthink a bit but less issues with tool call failures or weird repetition bugs.