Update on 12x32gb sxm v100 cluster / local AI for legal drafting

Posted by TumbleweedNew6515@reddit | LocalLLaMA | View on Reddit | 76 comments

Update from the lawyer with the V100 server. A few of you asked what I actually ended up running once the dust settled, so here it is. Still just a lawyer, still driving the whole thing through Claude Code, still not fully sure what I'm doing — but it works now, which is more than I could say last time.

First, the hardware caught up to the plan. The last two V100s are in, so the "final form" I promised is real: twelve V100-SXM2 32GB on the Threadripper Pro. It's Board A on GPUs {4,5,8,9}, Board B on {6,7,10,11}, an NVLink pair on {0,1}, and a mixed pair on {2,3} where one card is a 16GB. Split a model across two different NVLink boards and throughput falls off a cliff (the cross-board hop is PCIe/NUMA, not NVLink), so I keep every model inside one board. Learned that one the expensive way.

And yeah, I caved and built the second box. EPYC 7302P, 512gb RAM, 4x RTX 3090 + 2x V100-PCIe. The mid-life crisis remains on schedule.

The bigger change: I gave up on vLLM for the local models. Not because vLLM is bad — because the models I actually want are MoE GGUFs, and vLLM on Volta is a dead end for those (FP8/AWQ/Marlin all want SM75+, the GPTQ kernels are broken on 7.0). I moved the whole thing to llama.cpp (mainline — a recent build finally fixed a Gemma chat-parser bug that had been mangling my long prompts).

Here's the part that's the opposite of what my first post implied: on V100, dense models are a trap. Only MoE clears a usable speed. Rough decode numbers — Q8 GGUF, Q4 KV cache, flash-attn on, one 4-card board, on real drafting prompts (several thousand tokens of context, not a 5-token "hello"):

Model	Type	tok/s (decode)
Gemma-4-26B-A4B	MoE	\~113
Qwen3.6-35B-A3B	MoE	\~82
Qwen3.5-122B-A10B	MoE	\~50
any dense 27-32B	dense	\~20-28 (under my 40 floor, not worth it)
dense \~128B	dense	\~9 (forget it)

So a 122B/10B-active reasoning model runs at \~50 tok/s on four V100s — faster than the dense 32B managed on vLLM in my first post — and it holds that at long context (I've pushed Gemma past 25k tokens without it falling apart, where the dense models choked). That reframed everything: I stopped chasing big dense weights and built the system around MoE.

What's actually running (the stack you asked for):
It isn't one model answering chat — it's an orchestrator that routes a legal task across several local models, each pinned to its own board so they don't fight over GPUs. When it runs the heaviest job (a full affidavit or motion, intake-to-document), it lights up 16 GPUs across both boxes:

- Workhorse drafting — Qwen3.6-35B-A3B on Board A {4,5,8,9}
- Heavy reasoning + high-stakes drafting — Qwen3.5-122B-A10B on Board B {6,7,10,11}
- A small "does this even have grounds" gate model on the {0,1} pair
- An adversarial reviewer whose entire job is to attack my own draft, on the {2,3} pair
- Gemma-4-26B for financial/extraction + a small Qwen as the router, on the 3090s on the second box via Ollama

It's a sequential pipeline so they don't all hammer at once, but all 16 stay resident. Lighter work uses far less — combining and Bates-stamping exhibits is pure CPU (PyMuPDF + Tesseract, no GPU at all); a plain summary mostly just hits Gemma and the router.

The honest part, since this sub kept me honest last time:
- The local models hallucinate citations and dates. Confidently. I had to build a verifier that checks every cite, date, and Bates number in a draft against the actual source material and blocks anything it can't ground, on top of the adversarial reviewer. Local drafting is bimodal — sometimes it correctly refuses to invent, sometimes it fabricates a whole dated chronology and swears in the same breath that it invented nothing. It does not touch a final document without that gate and without me.
- The dumbest bug I found: my own pipeline was \~79% poisoned. The thing that builds the evidence bundle was scooping up its OWN prior outputs as if they were client evidence, so the models were "grounding" on slop they'd written earlier — at one point it cited an RTX 3060 as a Bates number, which, fair. Fixed the builder to stop eating its own tail and scrubbed it out. If you run any RAG/agent pipeline, go look at what's literally in your context window — mine was a hall of mirrors and I had no idea.
- I also made it refuse to quietly fall back to a cloud model when I tell it to run local-only. If it can't do a step locally it says so, by name, instead of phoning Anthropic behind my back.

Still want the exact thing I wanted in the first post — a model that writes like me and handles the boring form-filling and pattern stuff. I'm closer: the system now captures my edits as correction data, which is the start of a real fine-tune set. Haven't pulled the QLoRA trigger yet. So the same questions stand, and I'd genuinely take advice:
- For QLoRA on this hardware (V100, no bf16, no FA2): do you reach for a 35B-A3B MoE base, or am I smarter to fine-tune a dense \~14B I can actually train and keep the MoE for the heavy serving?
- Anyone serving MoE on Volta found anything faster than llama.cpp — ik_llama, something else? And is there a better long-context KV story than Q4?
- Am I an idiot keeping 122B-A10B around at 50 tok/s when I could just run the 35B for everything?

Tell me what I'm doing wrong.

[-]

Unlikely_Ad_8060@reddit

The pipeline eating its own output is the bug that doesn't look like a bug until something absurd surfaces. In my harness I hit a version of this where the decision log (append-only file agents write to) was also in the search path for "prior context." So the agent started citing its own previous decisions as evidence for new decisions. It never errored, it just got progressively more confident about things that were circular.

The fix that stuck: hard separation between read-only substrate and write-only state. Source material goes in one directory that agents can read but never write to. Agent outputs go in a separate append-only layer that nothing reads unless you explicitly pipe it back through a verification gate. The moment you let generated output live in the same namespace as source documents, you've built a confidence amplifier with no ground truth check.

Your RTX-3060-as-Bates-number story is the perfect illustration. The model wasn't hallucinating in the traditional sense. It was grounding on real text that happened to be its own prior output. That's actually harder to catch than a pure hallucination because the grounding step "succeeds."

[-]

pile-of-V100s@reddit

Try -sm tensor with dense models on the quad nvlink board. 27B dense goes quite a lot faster than 20-28t/s with 4 nvlink'd V100s - guessing you're only testing -sm layer given those speeds.

[-]

In_der_Tat@reddit

If you had to start from scratch now, including hardware sourcing, what would you do differently?

[-]