Update on 12x32gb sxm v100 cluster / local AI for legal drafting
Posted by TumbleweedNew6515@reddit | LocalLLaMA | View on Reddit | 76 comments
Update from the lawyer with the V100 server. A few of you asked what I actually ended up running once the dust settled, so here it is. Still just a lawyer, still driving the whole thing through Claude Code, still not fully sure what I'm doing — but it works now, which is more than I could say last time.
First, the hardware caught up to the plan. The last two V100s are in, so the "final form" I promised is real: twelve V100-SXM2 32GB on the Threadripper Pro. It's Board A on GPUs {4,5,8,9}, Board B on {6,7,10,11}, an NVLink pair on {0,1}, and a mixed pair on {2,3} where one card is a 16GB. Split a model across two different NVLink boards and throughput falls off a cliff (the cross-board hop is PCIe/NUMA, not NVLink), so I keep every model inside one board. Learned that one the expensive way.
And yeah, I caved and built the second box. EPYC 7302P, 512gb RAM, 4x RTX 3090 + 2x V100-PCIe. The mid-life crisis remains on schedule.
The bigger change: I gave up on vLLM for the local models. Not because vLLM is bad — because the models I actually want are MoE GGUFs, and vLLM on Volta is a dead end for those (FP8/AWQ/Marlin all want SM75+, the GPTQ kernels are broken on 7.0). I moved the whole thing to llama.cpp (mainline — a recent build finally fixed a Gemma chat-parser bug that had been mangling my long prompts).
Here's the part that's the opposite of what my first post implied: on V100, dense models are a trap. Only MoE clears a usable speed. Rough decode numbers — Q8 GGUF, Q4 KV cache, flash-attn on, one 4-card board, on real drafting prompts (several thousand tokens of context, not a 5-token "hello"):
| Model | Type | tok/s (decode) |
|---|---|---|
| Gemma-4-26B-A4B | MoE | \~113 |
| Qwen3.6-35B-A3B | MoE | \~82 |
| Qwen3.5-122B-A10B | MoE | \~50 |
| any dense 27-32B | dense | \~20-28 (under my 40 floor, not worth it) |
| dense \~128B | dense | \~9 (forget it) |
So a 122B/10B-active reasoning model runs at \~50 tok/s on four V100s — faster than the dense 32B managed on vLLM in my first post — and it holds that at long context (I've pushed Gemma past 25k tokens without it falling apart, where the dense models choked). That reframed everything: I stopped chasing big dense weights and built the system around MoE.
What's actually running (the stack you asked for):
It isn't one model answering chat — it's an orchestrator that routes a legal task across several local models, each pinned to its own board so they don't fight over GPUs. When it runs the heaviest job (a full affidavit or motion, intake-to-document), it lights up 16 GPUs across both boxes:
- Workhorse drafting — Qwen3.6-35B-A3B on Board A {4,5,8,9}
- Heavy reasoning + high-stakes drafting — Qwen3.5-122B-A10B on Board B {6,7,10,11}
- A small "does this even have grounds" gate model on the {0,1} pair
- An adversarial reviewer whose entire job is to attack my own draft, on the {2,3} pair
- Gemma-4-26B for financial/extraction + a small Qwen as the router, on the 3090s on the second box via Ollama
It's a sequential pipeline so they don't all hammer at once, but all 16 stay resident. Lighter work uses far less — combining and Bates-stamping exhibits is pure CPU (PyMuPDF + Tesseract, no GPU at all); a plain summary mostly just hits Gemma and the router.
The honest part, since this sub kept me honest last time:
- The local models hallucinate citations and dates. Confidently. I had to build a verifier that checks every cite, date, and Bates number in a draft against the actual source material and blocks anything it can't ground, on top of the adversarial reviewer. Local drafting is bimodal — sometimes it correctly refuses to invent, sometimes it fabricates a whole dated chronology and swears in the same breath that it invented nothing. It does not touch a final document without that gate and without me.
- The dumbest bug I found: my own pipeline was \~79% poisoned. The thing that builds the evidence bundle was scooping up its OWN prior outputs as if they were client evidence, so the models were "grounding" on slop they'd written earlier — at one point it cited an RTX 3060 as a Bates number, which, fair. Fixed the builder to stop eating its own tail and scrubbed it out. If you run any RAG/agent pipeline, go look at what's literally in your context window — mine was a hall of mirrors and I had no idea.
- I also made it refuse to quietly fall back to a cloud model when I tell it to run local-only. If it can't do a step locally it says so, by name, instead of phoning Anthropic behind my back.
Still want the exact thing I wanted in the first post — a model that writes like me and handles the boring form-filling and pattern stuff. I'm closer: the system now captures my edits as correction data, which is the start of a real fine-tune set. Haven't pulled the QLoRA trigger yet. So the same questions stand, and I'd genuinely take advice:
- For QLoRA on this hardware (V100, no bf16, no FA2): do you reach for a 35B-A3B MoE base, or am I smarter to fine-tune a dense \~14B I can actually train and keep the MoE for the heavy serving?
- Anyone serving MoE on Volta found anything faster than llama.cpp — ik_llama, something else? And is there a better long-context KV story than Q4?
- Am I an idiot keeping 122B-A10B around at 50 tok/s when I could just run the 35B for everything?
Tell me what I'm doing wrong.
Unlikely_Ad_8060@reddit
The pipeline eating its own output is the bug that doesn't look like a bug until something absurd surfaces. In my harness I hit a version of this where the decision log (append-only file agents write to) was also in the search path for "prior context." So the agent started citing its own previous decisions as evidence for new decisions. It never errored, it just got progressively more confident about things that were circular.
The fix that stuck: hard separation between read-only substrate and write-only state. Source material goes in one directory that agents can read but never write to. Agent outputs go in a separate append-only layer that nothing reads unless you explicitly pipe it back through a verification gate. The moment you let generated output live in the same namespace as source documents, you've built a confidence amplifier with no ground truth check.
Your RTX-3060-as-Bates-number story is the perfect illustration. The model wasn't hallucinating in the traditional sense. It was grounding on real text that happened to be its own prior output. That's actually harder to catch than a pure hallucination because the grounding step "succeeds."
pile-of-V100s@reddit
Try -sm tensor with dense models on the quad nvlink board. 27B dense goes quite a lot faster than 20-28t/s with 4 nvlink'd V100s - guessing you're only testing -sm layer given those speeds.
In_der_Tat@reddit
If you had to start from scratch now, including hardware sourcing, what would you do differently?
TumbleweedNew6515@reddit (OP)
I built and rebuilt like 3-4 windows pcs because Linux was too intimidating initially. Wasted some time there.
I was fixated on fine tuning, and was able to fine tune, but it didn’t appear to have resulted in actual better local models. Maybe I didn’t spend enough time on that, but it wound up not being as necessary as I thought. Haven’t given up but not going to manually curate a huge training set. No time for that.
I would also not have bought the 6 p40s that I never used I suppose. But tbh I may still wind up using those on a third cheap box so who knows.
Every now and then I see a full inspur server setup on eBay, which includes the board and 8 sxm v100s for like 5k. Those listings don’t last long, but having 8 actually nv linked instead of 4 and 4 could make a difference in terms of model size or fine-tuning.
Otherwise, these work really well for my use case, or at least they seem to. I am fine with my 3090s as well, but they feel mostly the same as the v100s when it comes to inference, maybe faster, maybe easier to plug and play. The cost of 3090 nv link bridges has been too much for me, or doesn’t feel justified right now because right now this feels “fast enough.”
All of this hardware keeps getting more expensive, except the very very oldest stuff. If I was starting over I’d consider just getting one big newer card. But what I have so far works, and it was fun learning and building it.
In_der_Tat@reddit
Thank you for your answer. I read your monthly electricity bill premium for running your system is just $80. How could it be if only considering 8x250W equals 2kW? Are you blessed by an ultra-cheap contract, or do you turn on your GPU cluster for few hours a day? Power draw over and above obsolescence is what makes me - an energy-starved fellow European - a little sceptical of the 8xV100 approach, even though it remains attractive.
A single GPU with 256GB VRAM on my book means the AMD Instinct MI325X which is, well, non-team Green and not exactly cheap given the ~€30K price-tag. A Mac Studio M3 Ultra with 512GB unified memory would have been nice, but Chronos snatched it from us after a fleeting window of opportunity.
Taken together, would you still go for the 8xSXM V100-cum-NVink approach?
TumbleweedNew6515@reddit (OP)
They don’t all run at 250w at once. You can limit clock speed and max power, but if you have a 4 card board running at max, only one at a time is at 250w, and that assumes 100% usage. On Linux they idle at 35w. So if I’m hammering the system, running all 4 boards at max, it’s probably 300w each. I am able to run 12 of them on 2 regular office level 120w circuits with breakout boards. Have never had a circuit flip. They manage power much better than advertised. They still are not efficient. Compared to Mac they are hogs. But every single LLM statement about their power consumption was wrong/exaggerated.
McStonkyRex@reddit
It is a constant spiral. I know...
McStonkyRex@reddit
I'm jealous. I'm trying to do something similar right now. Not drafting, but using api's with courtlistener and dawsom via skills/mcp to try to create knowledge graphs on subject-matters/issues I'm interested in. I'm genuinely jealous.
TumbleweedNew6515@reddit (OP)
You can just tell Claude to use google scholar and eventually it will do it for you. I have a fairly limited body of statute and caselaw and so i have all my legal authority in like 3 databases, and then past written work citing to those cases as a 4th database called applied authority. Still trying to perfect qdrant but tbh, I don’t use it to write actual legal briefs. I just have it pull my writing from the last 10 years that applies. It’s like recycling my dead labor.
MRGWONK@reddit
I have the full courtlistener database and I created a "loop" that verifies citations and quotations and re-processes the output and edits out the "fake cases" and hallucinated cases (40Gig VRAM, Gemma 4:26b (Main) and Gemma4:4b (Voice+Text+Small Tasks)
Asthenia5@reddit
At a high level, how exactly does the verifier you built work? How does it integrate into the workflow?
TumbleweedNew6515@reddit (OP)
You’re asking me to ask this to the LLM, which I will do in a minute. But my understanding based on what I told it to do in English, is to verify the truth of everything based on specific sources and evidence, and then when it comes to writing, to compare output to style guides and a set of rules that have been created from my past work, and then also based on 5 months of prompt history.
dbenc@reddit
I think if you look into the concept of a "defeasibility graph" you could integrate that into your workflow. have your cluster prepare each source document with its facts and assertions, and then your synthesis step will pull from all of those to build an actual legal reasoning graph that can be checked programmatically. this will also help with the rag relevance issue since rag isn't great for pulling stuff that is too semantically different. claude should be able to build it just from this description.
NortySpock@reddit
As a software developer who is not lawyer, what does a defeasibility (sp?) graph look like? (Perhaps defensibility?)
I'm familiar with directed acyclic graph (has dependencies, but no loops - common in job sequences or software dependency chains) and I wonder if it's a similar concept or very far away from what you are talking about...
TumbleweedNew6515@reddit (OP)
Claude had an orgasm when I asked it about a defeasibility graph. It told me that I had all the elements just sitting there, waiting for their defensibility to be graphed, and it tells me this fits the seams of my existing repo perfectly.
This was very helpful feedback. It’s amazing how far you can go asking the right questions, telling an LLM to plan/research and building from there.
Thank you!!!
dbenc@reddit
if you need more I do this stuff all day and need a side gig 😅😅😅
TumbleweedNew6515@reddit (OP)
Ok per the LLM: How it works, high level. After the model writes a draft, the verifier pulls every checkable assertion back out of it — every date, every case/statute/rule citation, every "Bates"/exhibit number, every party name — and checks each one against the source bundle the draft was supposed to be built from (the actual records I fed it). The question it's asking is just "is this grounded in the source, or did you make it up":
- A date, Bates number, or fact in the draft that isn't in the source → flagged as fabrication, and it blocks. Sharpest case: if the source records carry no dates at all and the draft confidently lays out a dated chronology, every one of those dates is fabrication.
- Citations get checked against my legal-authority database — a cite it can't resolve comes back unverifiable / possibly fabricated.
- It catches the tell where the draft writes "no facts or authorities were invented" while containing ungrounded ones — that contradiction is itself a flag.
- There's also a second local model whose only job is to be an adversarial reviewer — attack the draft, write up what's weak or unsupported.
How it integrates. It's not an optional lint pass, it's a mandatory gate on the high-stakes drafts (affidavits, motions). Two things matter:
Plane-Marionberry380@reddit
On the fine-tune question, I would probably not start by QLoRA'ing the MoE on V100. The highest-value dataset you have is not "write legal stuff", it is your correction trail: where the draft got too confident, where citations drifted, where your voice had to be put back.
So I would train the boring dense 14B first as a style/patch model and keep the MoE stack for generation, critique, and hard cases. If the small model can reliably turn "almost right but cursed" into "matches my house style and does not smuggle in a fake cite", that is already a win and much easier to iterate.
I would keep the 122B around, but give it a narrow job title. Not every draft needs the expensive adult in the room, but high-stakes reasoning, adversarial review, and "does this argument actually survive contact with the exhibits" feel like exactly where the slower model earns rent.
The self-eating evidence bundle bug is horrifying and useful. I would add a context receipt to every run: source file ids, prior-output ids excluded, verifier pass/fail counts, and any cloud/local fallback status. The model is only half the risk. The context window is the crime scene.
Korici@reddit
For Legal analysis, I personally prefer the larger models over Qwen/Gemma.
\~
I would try Unsloth's Q4_XL of MiniMax M2.7: https://huggingface.co/unsloth/MiniMax-M2.7-GGUF
Which is a 230B-A10B MoE model - should be compatible with both llama.cpp & vLLM
SillyLilBear@reddit
This is a lot of hardware for the models you are running, How much did you spend? I feel two RTX 6000 Pros would been a much better choice.
JaredsBored@reddit
R.e. the 122B-A10B vs using 3.6 35B for everything - you're not dumb for continuing to use 122B. I use my LLM box mostly for proofreading documents and find 3.6 35B just ain't nearly as "smart" as 122B for the purpose. And my workflow isn't sophisticated.
TumbleweedNew6515@reddit (OP)
This will never fully replace my written workflow, because there’s a point of diminishing returns as a lawyer. I can’t ever ever ever submit hallucinated or fake facts or law, and I have to review everything. For routine low level form work, which is maybe 40% of the practice, I can usually do that directly from memory, or using the bundled highlighted exhibits that are served to me.
One thing I have learned the more I’ve done this is to stop using the LLM once it’s beyond its capability. There are things that I will always need to know inside and out, which means that having an LLM summarize the stuff is of no help. I don’t need to know a summary. Same with specific application of fact to law. No part of this orchestrator does any real legal research. It regurgitates my own writing (which I recognize) when it applies to a new set of facts.
mnight75@reddit
have you considered having other models check the work being produced? one model may hallucinate... but several models doing double checks should reduce actual hallucinating to ... a very negligible amount.
TumbleweedNew6515@reddit (OP)
I have two separate Gemma helpers on separate llama severs do that at the end and it works well.
No-Refrigerator-1672@reddit
Did you consider automating fact-checking? After generating the draft, the same model can now recieve the rawt and the task to extract all referenced laws and legal cases; then it gets connected to a "legal database" and is tasked to find a reference link to each extracted entry. Entries that have no reference get highlighted. You can install OpenCode and ask it to write you a separate agentic app to do this automatically. I feel like it can speed up you all-manual fact check by not requiring you to find references manually.
TumbleweedNew6515@reddit (OP)
Lawyers always have to fact check everything manually at the end of the day. It’s why llm utility is ultimately limited in law.
kermitt81@reddit
Much of this manual fact checking can be mitigated with well thought out interfaces combined with a properly thought out generation process.
For our own absolutely-no-hallucinations-allowed system, our final doc review interface allows the user to easily click through every single input source that went into that document’s construction. A click opens a dropdown menu listing the source doc/input, and another click opens it side by side with the draft document. Minor changes can be made directly in the editable draft, while a “regenerate” button allows the user to reprocess from scratch based on whatever sources were added, removed, or changed.
Still very much a work in progress, as all software is, but we’ve been working on it for a few years (since 2019, long before OpenAI first made AI broadly available to the public), and the secret sauce, for us, was to only allow AI to handle those things that *cannot* be handled deterministically.
With Claude Code (or Cursor, Codex, etc) you can now relatively easily build such a semi-deterministic pipeline for your particular legal application.
kermitt81@reddit
We had to solve a similar problem in medical-legal (personal injury) workflows, where we absolutely *cannot* hallucinate evidence.
We handled this by restructuring the entire process into a more human-like system of steps.
For example, we have one step that reads a categorized “input document” (hospital records, mri reports, specialist reports, etc), each of which goes through a near-deterministic AI pipeline to pull the true verbatim data we need. Each step has its own prompts (or prompt chains) that it calls based on a combination of the category of document and the final intended use.
The final outputs are placed into a pre structured document via section variables, and we can then manipulate that document however we want. Since there’s no need to have the LLM conduct “research” at the final writing stage, so there’s no way for it to hallucinate sources or evidence.
We manage this through a combination of straightforward deterministic code with AI/LLM processing happening in smaller chunks where actually needed, but not driving the process from the top down. Very little room for error when the AI has narrowly constrained, discrete tasks it can easily complete successfully rather than large tasks it “solves” by reward-hacking.
You may need to restructure your process from the ground up, but thankfully Claude Code (mostly) makes that pretty easy. And if you have a specialization in a particular area of law (PI, employment, torts, whatever), that makes it even easier.
Happy to get into greater detail if you DM me.
mmazing@reddit
This is the way.
shansoft@reddit
Qwen 3.5 122B is still the best Qwen open weight so far. Qwen 3.6 27B can't keep up with it for complex problem.
Tiny_Arugula_5648@reddit
I'm a pro AI/ML Engineer over 15 years in the job..
You need more than a model that writes like you. As you know legal text has very specific meaning and that can change by local.. A general LLM doesn't understand that and it will make mistakes.. You really need to fine tune the model on a sizable corpus of legal documents so that it can learn what those are.. Otherwise you'll need to spend a LOT of time proof reading everything it writes, all the assumptions it makes, etc. This is common for any domain specific use case like legal, healthcare, etc.
You'll still need to do a lot of critical proof reading but the error rate will be much lower..
Dont trust a general model to do industry specific work you will get burned badly its only a matter of time.
TumbleweedNew6515@reddit (OP)
I have been using variations of this model/system, with Opus at the center of it for the most part, for abt 5 months now, and it has gotten better and better over time in terms of error rate. I don't use it for legal research or to write legal briefs. I have a robust rag database, and when I have it draft things, its main role is to find my past language that applies based on my directions.
Finetuning was my original plan, and it's not totally off the table, but I didn't necessarily see a big difference in terms of the performance of the Qwen and Gemma GGUFFs and adaptors I was creating. My system is also not quite big enough to meaningfully finetune a 122b model.
I have a limited set of tasks that I know that it will be good at, and I try to just use it for that. It does work, in that it saves me time on maybe 40% of recurring work tasks. And I will always have to review all output regardless, due to the rules of professional responsibility.
lupodevelop@reddit
woah. that's huge!
WorthBathroom3268@reddit
The “pipeline eating its own tail” bug is probably the most useful warning in this whole post. In legal drafting the dangerous failure isn't just a fake cite — it's fake evidence becoming part of the next “grounded” context and looking more legitimate each run.
For your QLoRA question, I'd be tempted to keep the MoE as serving/review muscle and fine-tune a smaller dense model only on the boring style/form-filling layer. The edit-capture dataset sounds more valuable there than trying to teach the 122B new law.
jdchmiel@reddit
I did not see any details on your llama config. Over the weekend I found that it is totally worth getting:
-sm tensor \
--spec-type draft-mtp --spec-draft-n-max 2 \
--spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 64 \
split tensor ( have to disable fit, -fit off), MTP heads on the recent unsloth qwen gguf uploads, as well as ngram - IE using TWO speculative decoding methods at the same time. This took qwen3.5 122b up to around 90-100 t/s tg on dual r9700s for me.PP stayed at 1300. Sadly, I only have context space for 20k in 64g vram for that file Qwen3.5-122B-A10B-IQ4_XS
I tend to stick to Qwen3.6 27b at around 70 t/s with same configs and plenty of headroom for full context at 8bit.
If you are at 50/s without mtp or tensor split I bet you get a 2x increase with the combo. tensor split helped PP a lot for me vs layer or row even without any speculative config.
Hannibalj2ca@reddit
Have you tried Qwen 3.5 397B and GLM 4.7 355B? Other options
David-Gallium@reddit
I've got a 4x V100 32gb NVLink setup. I see 80t/s token decode but can achieve \~600 aggregate on 122B. You can absolutely get more out of this hardware. This was key to my setup: https://github.com/1CatAI/1Cat-vLLM
TumbleweedNew6515@reddit (OP)
Just found out about this vLLM fork. Literally implementing now. This is why I post these posts!
No_Afternoon_4260@reddit
Please 🙏 buy 4 more ! So you can tp 16 (tensor parallelism) on all 16 cards. I promise the fact that you have multiple nvlink baseboard on pcie isn't a bottleneck for this (Pcie 4.0 x16 correct?) You'll be able to run somewhere around deepseek v4 flash I guess
Ok-Internal9317@reddit
Who said there’s no perfect crime?
Brilliant-Resort-530@reddit
The hallucination problem is real. RAG on your case documents + a verification pass is the only workflow that actually holds up
DepartedQuantity@reddit
Are you documenting any of this in a GitHub repo? I'm not a lawyer however I am curious as to how well models can reference cases, go through legal documents and just overall do legal work. This isn't for a business, just more curiosity. Do you have a custom harness and/or a series of custom prompts that you built? Would love to see your entire pipeline. Great and awesome work.
TumbleweedNew6515@reddit (OP)
Started with skills, then style guides and forms and writing guides, now a formal ‘harness’ with 14 total GPUs dedicated to the most complex task to fire in sequence. Lots of database stuff and fact/style verification. Trying to make big parts of it truly systematic instead of ‘was opus in a weird mood that day.’
DepartedQuantity@reddit
Cool, thanks for the response. Looking forward to any documentation or information in future updates. Great job!
TumbleweedNew6515@reddit (OP)
If you want me to ask my LLM to give you some kind of architectural description I’m fine doing so. Don’t want to push anything publicly to GitHub just because god knows what’s in the code base.
DepartedQuantity@reddit
Fair enough about the code base. Yeah any insight is always appreciated!
kmouratidis@reddit
This is gonna sound weird, but...
Do you want to present your project to a team inside Thomson Reuters? I don't think we have to make a big deal about it, and I would rather avoid sales or product involvement, just a bunch of engineers and scientists (not sure how many would be interested, but I know at least a couple that would be—they are sub members). If that sounds interesting, happy to pitch it to my manager. Hopefully it won't die a slow corporate death waiting for approval 😅
TumbleweedNew6515@reddit (OP)
1) I am also Greek, 2) yes, dm me, I’m in line to give a cle on some of this stuff, but could share a whole lot more with your engineers because they can see how a lawyer with just html took his Claude code journey
OnkelBB@reddit
If that's possible I'd like to participate, very much curious about your whole pipeline achitecture.
noctrex@reddit
Δικός μας είσαι πουλάκι μου; :)
Quantum_Daedalus@reddit
Stick with 122b over 35b since you have the vram for it. Have you tried with MTP? I have a similar workflow and found it significantly improved performance.
ryfromoz@reddit
I love this! I get a kick out of seeing similar frankenrigs to my own actually being useful.
Question though, how much would you day it all ended up costing?
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
firiana_Control@reddit
Power recs are insane
imnotzuckerberg@reddit
Did you tinker with model params (like temperature?).
Check ktransformers (from the kimi team, I think it's part of sglang now). I am not sure if they have volta support.
Sisaroth@reddit
There is no substitute for stupid, if you are not happy with 35B's intelligence then stick with 122B.
However, I think there is still a lot of performance left on the table. If I can run 35B at 30 tps on a RX 7800 XT. Then I would think your monster setup should be able to get speeds much higher than 80 tps. Unless I'm overestimating the V100.
Napster3301@reddit
fine-tune the 35b-a3b moe, only the shared expert layers. dense lora on 14b sounds easier but youll burn the v100 memory on optimizer state and the rank you can afford gets so low style adaptation looks like noise. shared expert lora on 35b keeps trainable params under 1% of model size, fits your existing footprint, and routing weights stay frozen so you dont blow up moe behavior teaching it your tone. ive seen clean style transfer at rank 32-64 on simlar volta setups. on the 122b question, dont keep it for throughput. keep it because moe routing for long legal text activates completely different experts than the 35b, you lose nuance even when its slower. measure side by side on one full motion before you cut it.
curious what your correction-capture format is, jsonl pairs or full diffs with rationale?
zeferrum@reddit
I wonder how deepseek v4 flash would run on this and if it would help with hallucinations
TumbleweedNew6515@reddit (OP)
V100s are very powerful but are old architecture, I tried ds4 flash on my other box that has 4 rtx 3090s and 512 gb of ecc ram. It scored very highly at drafting, but I only got to 7 tok/s. Main issue on this setup the the 4 card boards. If I had an inspur with 8 v100s, I could probably run deepseek flash at q4 or a reap of it fast enough. But these are live workflows and have to hit at least 40tok/s.
zeferrum@reddit
Speaking of Q4 are you aware of this special build ? https://github.com/1CatAI/1Cat-vLLM ? Do you have details on the sxm part of your build ?
TumbleweedNew6515@reddit (OP)
I have 2 of the 1catai 4 card boards. Or at least I was under the impression that they are the only ones that make the 4 card boards.
zeferrum@reddit
Something like this ? https://ebay.us/m/9YhvAk
TumbleweedNew6515@reddit (OP)
The link you just send me is for a 2 card board rather than a 4 card board. I have 3 of those 2 card boards. The 2 4 card boards, one I got from alibaba. The other i do think I found on eBay, but more expensive.
TumbleweedNew6515@reddit (OP)
This is incredible. Amazing share thank you. Will improve my performance big time.
Those are the ppl that make the boards. The amount of hbm on these cards makes me wonder if there will be more upgrades like this, if the ram on them will have physical recycling value.
patchy319@reddit
I wish the software support for V100 was better. I just got 2x V100 32GB pcie for $600 each and have mainly been using them for LLM with llama.cpp now compared the 2x 3090 I also have. But vllm support wouldve been really nice, I tried the Chinese fork and couldn't get it going either. I don't think deepseekv4-flash can be ran locally on v100 with the software issues in mind? But I'd love to know the performance if someone does it.
TumbleweedNew6515@reddit (OP)
I made the PCIe mistake. Have 2 of them. Fine to run small models on but the sxm2 cards are cheaper and faster. They sell the 2 card nvlink boards on eBay (cheaper w buying agent). Those are nvlink 6, 64gb. That actually outperforms 2 3090s. Can run some bigger models. And costs half as much as one.
patchy319@reddit
I was considering upgrading my AI server chassis for more slots. Currently have a Asus ESC4000 G4 and was looking at a Gigabyte G481-HA0. I am personally leaning toward more 3090 or newer cards if I add GPUs because vllm would give me better parallelism kvcache management than llama.cpp. llama.cpp has recently fixed a lot of the issues with the new qwen3.6 models and mtp helps a lot with performance, but parallelism is still limited. I use litellm to balance between the two currently. Also by going pcie I would have more flexibility in the future of using different GPU or new server, apart from size limitations- sxm2 is v100 forever.
TumbleweedNew6515@reddit (OP)
Every use case is different and you’re absolutely right about sxm2 vs PCIe
McStonkyRex@reddit
Context/compaction is also nuking me too.
Available_Hornet3538@reddit
Damn not bad
gfe86@reddit
How about the data retrieving do u use bm25, bge-m3, I'm working on something for a law firm, not in English using r9700 and didnt reach what I'm looking for, if u have any tips regarding reasoning the client questions so it can retrieve only the correct codes and law.
TumbleweedNew6515@reddit (OP)
Honest answer up front: retrieval is the part of my stack I'm least confident in, so salt this heavily — but here's what I run and what I've actually learned.
I'm on Qdrant with dense embeddings (Qwen3-embedding, which is multilingual, on the V100s) — not BGE-M3. For your non-English case, though, I'd look hard at BGE-M3: it's strong multilingual and gives you dense + sparse + ColBERT-style late interaction out of one model, which is convenient. I'm NVIDIA/CUDA and you're on an R9700 (ROCm), so none of my serving notes will transfer — but the retrieval design below is hardware-agnostic.
BM25 vs dense — for law, don't choose, go hybrid. Dense alone quietly fails you on the thing that matters most: exact statute/code numbers and citations. "§ 20-3-160" isn't a semantic concept, it's a string, and a pure vector search will hand you something "about" custody when you needed that exact subsection. So BM25/sparse for the exact code/cite/numbered-rule lookups, dense for the "what is this actually about" semantic match, then fuse the two. Qdrant does hybrid natively; BGE-M3 gives you both halves in one model, which is part of why I'd consider it for your setup.
Structure matters more than the embedding model, honestly. I split authority into separate collections by TYPE — statutes/codes, rules, cases, and a fourth one that's the real unlock: an "applied" collection. That one isn't the law itself, it's how authorities got APPLIED to fact patterns, mined out of real briefs/orders. Searching raw statute text gets you statutes that sound related; searching "given THESE facts, the authority actually used was X" is how you land the correct code instead of a plausible-sounding wrong one.
Now your actual question — reasoning over the client question so it retrieves only the right law. Biggest lever for me: don't embed the raw client question. It's messy, it mixes three issues, and it's full of facts that drag the retriever sideways. Put a reasoning step IN FRONT of retrieval:
1. A small local model does issue-spotting first — turn the rambling client question into the discrete legal issues, the area of law, and any explicit code/section references it names.
2. Retrieve per-issue, not per-question. Pull exact codes by lexical/BM25 match on any section the issue-spotter found; go semantic for the rest.
3. Re-rank, then GATE: only return codes/cases the system can actually resolve to a real authority. Anything it can't ground, it drops or flags — same fail-closed idea as the drafting verifier. That's what stops it from confidently "retrieving" garbage.
Two hard-won caveats. One: your corpus is everything — it has to be the actual codes/caselaw in your language, clean, and chunked sanely. I just found MY pipeline was \~79% feeding the model its own prior output as "source," so validate what's literally in your index before blaming the retriever. Two: for non-English, evaluate retrieval IN-LANGUAGE with a small gold set ("this question → these correct sections") — don't trust English-centric benchmarks or vibes.
TumbleweedNew6515@reddit (OP)
I’m just letting you use my max effort tokens to answer questions about my orchestrator. I must be a saint.
farkinga@reddit
I love your updates; the project is unhinged, you are fully-self-aware, and you're getting real results from a technically difficult build. Nice work.
weldawadyathink@reddit
It appears that the front heat sinks are perpendicular to the air flow direction. May want to reorient those.
TumbleweedNew6515@reddit (OP)
I have a vertical metal tube air mover thing that just blows down into there. Have not had any overheating issues on those cards. The setup is not especially pretty but it works. Also, claims about power costs seem to be greatly exaggerated. Maybe an extra $80 per month.