Dual Xeon E5-2696v4 + 512GB RAM + RTX 3090 Ti local LLM for ISP sysadmin work — benchmarks + questions

Posted by OkBase5453@reddit | LocalLLaMA | View on Reddit | 19 comments

Hi all! finally after 2 Monts of reading, asking, testing... and headaches and a living room environment of over 90 dB(wife threatening to leave at one point) I am posting my setup. I work as a sysadmin/DevOps engineer, and I've been building a local AI inference rig for both professional and personal use with some old company hardware. I've been benchmarking ik_llama.cpp (becouse it was better at only CPU inference than llama.cpp) and would love community input on models and configuration twix/tricks!


Hardware


Benchmarks (ik_llama.cpp build 4400 / llama.cpp build 8739, numactl --interleave=all, --mmap 0)

Model Quant Size Backend Config pp1024 t/s tg128 t/s
Qwen3.5-27B Q4_K_M 15.4 GiB ik_llama.cpp CUDA ngl=999, t=78 1535 46.2
Qwen3.5-27B Q4_K_M 15.4 GiB llama.cpp BLAS+CUDA ngl=99, t=78 1521 44.5
Qwen3.5-27B Distilled (Claude 4.6 reasoning) i1-Q4_K_M 15.4 GiB CUDA ngl=99 t=78 1514 44.4
Gemma 4 31B Q4_K_M 17.8 GiB ik_llama.cpp CUDA ngl=999, t=78 1518 42.9
Gemma 4 31B Q4_K_M 17.1 GiB llama.cpp BLAS+CUDA ngl=99, t=78 1441 40.8
Qwen3.5-27B Q4_K_M 15.4 GiB CPU only t=80 51 5.4
Qwen3.5-35B MoE A3B Q4_K_M 20.5 GiB CPU only t=42 264 23.2
Qwen3-Coder-Next 80B A3B Q4_K_XL 46.2 GiB CUDA ngl=20 + CPU t=65 427 23.7
Qwen3-Coder-Next 80B A3B Q4_K_S 42.4 GiB CPU only t=78 209 21.9
Qwen3.5-122B MoE A10B Q4_K_M 71.3 GiB CPU only t=78 105 9.3

Notable: Gemma 4 31B on CUDA (1518 pp / 42.9 tg) is nearly identical to Qwen3.5-27B (1535 pp / 46.2 tg) despite being a larger. ik_llama.cpp consistently outperforms llama.cpp by ~1–5% on both models. I have a problem with partially offloading the Qwen3.5-122B to the CPU/RAM, so I could not test it further.

root@llama-cpp:~# time numactl --interleave=all /opt/ik_llama.cpp/build/bin/llama-bench -m /mnt/models/Qwen3.5/Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf -ngl 14 -t 79 -p 1024 -n 128 --mmap 0 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB | model | size | params | backend | ngl | threads | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ---: | ------------: | ---------------: | | qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 79 | 0 | pp1024 | 218.72 ± 6.34 | | qwen35moe 122B.A10B Q4_K - Medium | 71.27 GiB | 122.11 B | CUDA | 14 | 79 | 0 | tg128 | 10.87 ± 0.08 |

build: 13d7178d (4400)

real 2m22.338s user 98m2.039s sys 1m5.814s

root@llama-cpp:~# numactl --interleave=all /opt/ik_llama.cpp/build/bin/llama-bench -m /mnt/models/Qwen3.5/Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf -ngl 18 -t 79 -p 1024 -n 128 --mmap 0 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB | model | size | params | backend | ngl | threads | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ---: | ------------: | ---------------: | main: error: failed to load model '/mnt/models/Qwen3.5/Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf'

My Use Cases

  1. scripting & automation — Bash/Python scripts for network ops
  2. Server deployment — Proxmox/LXC planning, application installation, migrations, full deployment workflows
  3. MCP + vendor docs — proprietary Vendor PDFs with > 1000 pages, the model should read them then help/writes configs and installation plans ← main use case
  4. Side project — iOS/Android board game developing

The MCP server use case is the critical one here... I want the model to ingest large vendor manuals via MCP file-system tools and then answer questions, write configs, and create step-by-step installation plans. Context length and instruction-following quality matter a lot here.


Questions

  1. Best model for long vendor doc → installation/migration/uograde plan workflows? Currently on Qwen3-Coder-Next 80B (ngl=20). Is Qwen3.5 27B or Gemma 4 31B better for long-context instruction following? Or any other better ones!?

  2. Optimal ngl for models and other helpfull configuration on rtx 3090 24GB VRAM? At ngl=20: 427 pp / 23.7 tg for Qwen3-Coder-Next. Anyone found a better split? Is there a formula for MoE layer-to-VRAM mapping? Why can i not go more than ngl 20

  3. Qwen3.5-122B at 9 t/s tg — usable for interactive chat? I have 512GB RAM so it fits. Any tricks to squeeze more speed?

  4. HAVE_FANCY_SIMD is NOT defined on Broadwell-EP (AVX2, no AVX-512) — expected or am I missing a compile flag in ik_llama.cpp/llama.cpp?

  5. Gemma 4 31B real-world impressions? fits in my VRAM. Anyone comparing it to Qwen3.5-27/32B for agentic/technical tasks?


Happy to share raw bench logs. Thanks! 🙏

P.S. my first reddit post(be gentle) :)