Locally running LLMs on DGX Spark as an attorney?
Posted by Viaprato@reddit | LocalLLaMA | View on Reddit | 126 comments
I'm an attorney and under our applicable professional rules (non US), I'm not allowed to upload client data to LLM servers to maintain absolute confidentiality.
Is it a good idea to get the Lenovo DGX Spark and run Llama 3.1 70B or Qwen 2.5 72B on it for example to review large amount of documents (e.g. 1000 contracts) for specific clauses or to summarize e.g. purchase prices mentioned in these documents?
Context windows on the device are small (\~130,000 tokens which are about 200 pages), but with "RAG" using Open WebUI it seems to still be possible to analyze much larger amounts of data.
I am a heavy user of AI consumer models, but have never used linux, I can't code and don't have much time to set things up.
Also I am concerned with performance since GPT has become much better with GPT-5 and in particular perplexity, seemingly using claude sonnet 4.5, is mostly superior over gpt-5. i can't use these newest models but would have to use llama 3.1 or qwen 3.2.
What do you think, will this work well?
SillyLilBear@reddit
I would recommend 1+ RTX 6000 Pro, you will be much happier, unless speed isn't a concern.
Viaprato@reddit (OP)
Speed is a concern. Thanks.
Things are evolving so fast. How long is such a 10 grand investment really fast enough? If I buy a macbook today, it's still fine in 4 years, but with AI graphic cards, I'm not so sure
SillyLilBear@reddit
It depends, but it should last quite a while. I would recommend against LLama models, they suck. If you are not technically inclined, then Strix Halo would be your best bet, but it will be dog shit slow.
Viaprato@reddit (OP)
Ok so both are terrible :) what to do then?
Kooky-Somewhere-2883@reddit
rtx 6000 pro. enough vram
worth it since this is your job
Viaprato@reddit (OP)
Thanks will look into this
Blindax@reddit
I am experimenting on local lol and law since some time now.
I have the impression that the spark is better at running one or several small models rather than big ones. The memory is high but the compute power is limited compared to dedicated gpu. I would look into something else. You can get very good result (close from what got can do) with larger local llm (qwen 235b, glm air etc.) when using large context.
I tried rag sometimes but the result I got were much less good that full context injection. I think you need the document to be well prepared for it to be useful.
There is a few models that can manage more than 132k context. The last qwen models go to 264k. Granite models do 1m.
You don’t need Linux mandatorily. You could use lm studio and open web ui with docker (you need to install wml but no real need to do cli).
Viaprato@reddit (OP)
Thanks! Which hardware do you use
Working-Magician-823@reddit
Before you buy anything, you can do the following
Option 2
cosimoiaia@reddit
The most stupid way to break the law and lose your practice. Don't upload any legal document anywhere unless it is a proven and identified sever that you own and runs on your trusted network.
Working-Magician-823@reddit
No one said use you real client cases, use publicly available sample cases
cosimoiaia@reddit
Not in EU, Data protection is serious here.
Viaprato@reddit (OP)
Google is really the dark lord GDPR-wise. They store everything forever, don't delete anything, transfer everything to the US
Working-Magician-823@reddit
Google and Microsoft have cloud data centers in the EU, you can choose an EU Region, so should be fine
For Option 2: I am not sure about OpenRouter.ai but it is a large business, you can check their website before using it with real data
And for eworker (E-Worker Inc., Canada) it runs on your machine, but I will review the EU guidelines as well, why not, make it EU compatible.
Docker (for local AI install) is massive business, for sure it has something for the EU
Ollama? not sure, I know they are on github, and it is used worldwide, so there is something, who knows
cosimoiaia@reddit
Software that you download and can run offline is definitely fine. I use docker extensively and since it came out.
I can't endorse ollama because of their shady practices about licensing and models but if don't have alternative, it's fine.
Any online services (openrouter etc similar) have tricky user agreement, I've read them for work and we rule them all out.
Also it's not enough to have DC in Europe, companies that use Google and Microsoft services professionally have specific SLA and other duties that comply with GDPR.
You can rent a GPU on Hetzner and still be in violation of regulations because you send data across a network that is not necessarily protected, to the very least you must have a data protection officer that is responsible for anything regarding user data and privacy.
It might seem complex from the outside but we have established practices for over a decade now, we know what's necessary and we barely think about it.
The best AI provider in terms of data security is Mistral, they are fully compliant with EU regulations, also they are very good in everything else.
Working-Magician-823@reddit
Thank you so much for the info, a question about:
"You can rent a GPU on Hetzner and still be in violation of regulations because you send data across a network that is not necessarily protected"
Is https considered protected communication in EU? or do you mean something else the above?
Viaprato@reddit (OP)
I would upload a "cleaned" version with no names, amounts, and no meaning just to test it. Thats fine. But its not efficient for everyday use
Viaprato@reddit (OP)
That's a great suggestion. Before spending thousands of dollars, this is what we should do. However, we are one step before this..
Viaprato@reddit (OP)
Still a great instruction, really thankful for this! Will do it
FullOf_Bad_Ideas@reddit
I think it's a relatively good idea if you can afford to see it as an experiment and return Lenovo DGX Spark if it won't work out for you. As long as you can let it do it's thing for a long time, and maybe if you have access to someone from IT that could help you set it up.
You'd need to watch out for quality here - I am not sure if models are up to the task, especially if those documents aren't in English. And there will be some caveats. You might need to review their work carefully and it's likely that you won't be satisfied with their quality, but idk, it could work. I am not an attorney.
You can load up Seed OSS 36B Instruct with a context window of about 512K tokens - I think it might fit.
RAG won't capture details, those will get lost as only a part of context will be retrieved. For anything compliance related, I think it'll probably not perform good enough.
Ideally if you can set up a few specific tasks, you can have GPT-5 or local model write you a script that will ingest documents and ran them through a particular pipeline. It might or not be worth the effort. Spark should have decent compute power to analyze those documents relatively quickly - prefill isn't instant like with cloud models but it should be in the realm of 300-2000 t/s on models like Seed OSS 36B Instruct.
Viaprato@reddit (OP)
Thanks, very helpful, especially about the model with the larger context windows
I figured Qwen because it performs similiarly well as gpt-5 on benchmarks
But I agree, RAG will not be very useful, since I can't ask questions then - I can only give "search and find" orders but not interact with the model on the entire content
Not sure about speed though - you say 300 t/s, others say llama will only produce 3 t/s which obviously is a 100x difference
FullOf_Bad_Ideas@reddit
My "300-2000 t/s" number is for prefill - when you send text to the LLM it does prefill on the text, so it ingests it, and then once it finishes ingestion, it outputs tokens. Prefill is typically as fast as your compute power is, and decode is down to how fast your memory is. DGX Spark has a reasonably good compute power, I think it's higher than AMD 395+, and it's roughly in the ballpark of RTX 3090. But it has memory speed 4x slower than RTX 3090. Since the tasks you've mentioned are all long context with relatively short answers, prefill number is important since slow prefill like 10 t/s would make it take a few hours to process a single 100k token document. Cloud-based LLMs have optimized their pipelines to make prefill fast, so you often don't notice it's there and think that LLMs can generate text immediately - that's not true and it requires compute, it's just hidden away.
For generating tokens, DGX Spark will not be fast, I imagine around 5-10 t/s for Seed OSS 36B Instruct 4bpw at high context length. But if your task isn't latency dependant and you can make it run in the background, it still might make sense and it still might make your work easier, but it's not seamless. If you run it on multiple sessions at once, it'll also often be faster. For example single 3090 can support single session llama 3 8B model with 50 t/s decoding speed, but if you send in 200 requests at once, it'll decode at 2000 t/s - it has the compute but it can't use it fully on single request due to autoregressive nature of LLMs. This works very well for short context tasks where you can squeeze in the kv cache of all users easily. With long context tasks you can't parallelize as much as you would need more VRAM to make it happen, but you can probably do 3-5 parallel generations on Spark if you need it to process hundreds of documents faster.
A dedicated full-tower desktop computer with 2x RTX 3090 / RTX 5090 would be much faster, but I think you are attracted to DGX Spark largely because of it's small formfactor - you can place it on pretty much every desk quite easily, while custom PC that takes 1000W from the plug, weights 30kg would look out of place in most workplaces.
Alternatively, you could try using a MoE models instead of dense models, there are some with 256K context length. For example Jamba Mini 1.7 - https://huggingface.co/ai21labs/AI21-Jamba-Mini-1.7
Viaprato@reddit (OP)
Thanks that's helping me understand what the relevant numbers are. According to most here this can't really be done by lawyers without starting a larger project with external consultants
cosimoiaia@reddit
I've done this a few times with a few law firms in EU so let me give you the good, the bad and the ugly.
The good: - You're on the right track on the idea and the privacy side, churning a ton of document is a good job for AI. - Your intuition is correct, context size is quite critical and using a system to summarize, retrieve and extract information is a crucial point, you can't just send all the documents into a model, it will just get 'drunk'. - If done correctly the results are amazing, you'll have an army of interns producing perfect documents for exactly what you need and all the information will be perfectly stored and organized and retrievable in the way that most suites you at that particular moment.
The bad: - You want be able to run it on a single Desktop machine, you need a bit more than that. Not hundreds of euros but neither your all-in-one mediamarkt system. You spend money for your backups system required by law, consider this a part of it. - It's more complicated than just a 'RAG' and openwebUI (which I highly suggest you don't even touch since it's full of bloatware and a security risk). You'll need a system that has 2 major part, one that builds a knowledge base from your docs and extract organized and correlated information and then another that retrieves those info's and write whatever you need it to write. This is how an AI studies your docs.
The Ugly: - You can't do it yourself and it won't be ready by yesterday. You need somebody who's done something kinda similar before. - it's gonna take trials and errors before it gets to the point where it is exactly how you want it. It's a fairly big project that can positively have a big impact on your business, you want it to be right and not missing details. - it won't be one and none. Documents change and so needs. If you want to be able to use it consistently, some maintenance is required.
Of course everything depends on the scales. If information are in 10-20 docs, one decent PC might do the job and the setup can be done pretty easily. From here the it gets more complex according to the amount of docs and information, just as if you were to ask an intern to do the job.
Viaprato@reddit (OP)
I wouldnt post on reddit if we had the resources to hire a company to advise us on this and set it all up and maintain it. Unfortunately. But it will be OK. I agree that the effort needed should not be underestimated!
What you are suggesting is kind of a knowledge base that I can consult on everything that I have available - e.g. all e-mails that I have ever sent, all contracts that I have every draftet. It would be nice to drag and drop a client request into the LLM and ask: "I have something similar before. Please dig out all the drafts and e-mails that we sent in that other matter and provide a summary with a table of everything that we had sent, including our statement of fees (from another directory)."
But would this not request the LLM to have ALL the information INSIDE the context window? that's pretty much impossible as I have understood.
What should be possible is using it as an agent that works through my laptop each time with a search function, somehow pre-indexed, right? but thats not really AI functionality, except for the agent twisting the search and analyzing its outcomes
BTW many times it's just one document, especially when drafting or reviewing. But if set this up, it would be good if it also works for analyzing larger amounts of data
cosimoiaia@reddit
If you want to set it up by yourself be ready to learn a lot of different things and fail a lot of time. Again, you put a value on your time and see if it's worth to you.
It's like saying that I want to sue a few megacorp, for a bunch of different things, but I don't wanna be bother hiring a lawyer. Chatgpt can do it for me, so I can also do it locally, right? And then start asking in a lawyer forum how I should setup my lawsuits.
Viaprato@reddit (OP)
My question was rather innocent, I acknowledge what you're saying and you're probably right that running an Ai locally currently is still a massive investment, not the 20 $ / month for pro, not the $ 5000 for the dgx, but rather a project in the high 100,000s and then get something as capable as could models were 18 months ago. That's what most people here are saying
Viaprato@reddit (OP)
Needless to say, I was thinking about a 5k project, I do not have the use case for a $ 500,000 project. Cleaning proprietary data and asking isolated questions is much cheaper
Serprotease@reddit
Not all the information will be passed to the context window.
This what happens. Your query + client requests -> query rephrased in a few sub queries -> the queries are embedded -> these embedding are checked against all your documents -> the top x documents chunks matching the queries are returned and sent to the Llm context -> the Llm spit the answers based on your initial input and the retrieved relevant documents (so maybe 8-10k total context).
Viaprato@reddit (OP)
Thanks, that's helpful
cosimoiaia@reddit
Yes, that's a basic rag. But doesn't really work for law workflow, too much loss and misinformation.
Viaprato@reddit (OP)
Yes, that's correct
mcAlt009@reddit
You don't need a full company. You need a contractor to spend at least a month setting this up and helping you and your team to understand exactly what an LLM can do.
The tech economy is in shambles, it won't be hard to find someone with experience who seriously needs work.
This isn't something you can typically get right if you don't have a significant tech background.
Now if you're going to just ignore everything I said and try to set it up anyway, go ahead and build a normal desktop with a 5090. Install PopOS , it comes with the Nvidia drivers.
https://system76.com
Then use Chat GPT to write a Python script to process the documents against Ollama.
This is all hypothetical and probably isn't a great idea.
Seriously, find a way to hire help. I reckon you can get someone decent for 50$ an hour. No one is hiring in tech right now and most companies don't hire from November until mid January.
cosimoiaia@reddit
If you have a solid experience on this, you don't seriously need work 😉
mustafar0111@reddit
I'd be very dubious about trusting AI to accurately summarize legal documents and clauses given the consequences of mistakes.
Viaprato@reddit (OP)
I use the large cloud models a lot (in particular gpt-5 and claude sonnet 4.5 via perplexity) and I'm well aware of what they can do and what they cannot do.
LLMs have gotten very good at extracting and computing information, they are certainly not perfect yet understanding legal information. reviewing stuff yourself is crucial but LLMs can save a lot of time.
and, sometimes, if you can't review stuff at all due to capacity constraints, it can be better to do a quick and dirty review (and not dislose it) than doing nothing
also, i would like to have the LLMs proofread the contracts that we draft, but as I said privacy concerns prevent us from uploading the entire document (including purchase prices etc)
samplebitch@reddit
I can say that for work I did a proof of concept where we scanned fairly long PDFs which were scans of B2B contracts (so each page in the PDF was really just an image, as opposed to text that you could highlight/copy, like in a Word document). The goal was to extract certain data points - who the two parties were, key delivery dates, contact names, office locations, and any terms that fell outside of typical contract terms - and for each datapoint, a reference to the location in the document where each datapoint was (Page 5, paragraph 3, etc).
The point was to reduce time (and costs) involved for the legal department. Essentially a first pass to find the data, this would then be passed to someone in legal who would be able to quickly locate the information and confirm that it was accurate. So, this wasn't to replace someone but to save time and money.
I work from home and have an RTX 4090 on my home PC (until recently, the top of the line video card - still quite capable). I was able to run a model on that machine and using LM Studio you can have it act as a server. I could then load a PDF on my work laptop, connect over local network to the server, pass it the PDF and prompt for extracting the desired content, get back results using structured response (forces the output into a specific format without all the usual LLM chat fluff), save to excel.
It worked surprisingly well. It was not 100% accurate due to the quality of some of the PDFs (copy of a copy of a fax, etc), but considering the goal was to reduce (but not eliminate) human involvement, it was considered a success. And this was back in March/April, so some of the newer Vision models are likely even more accurate now (Qwen V3 VL, Deepseek OCR, etc). I actually ran each document through multiple times using different Vision models, prompts and temperatures to compare outputs. The more times each data point's extracted info matched across each pass gave me more confidence the information was accurate. If memory serves, I think Qwen 2.5 VL was the most accurate - that was the latest model available at the time.
I know of the DGX Spark but haven't looked into it too much - I've heard they're a bit overhyped and not as powerful as they claim them to be. They should definitely be capable of doing what you're looking to do, but you could definitely get away with something cheaper - a regular computer with a high end video card should be enough. You also don't really need RAG for your use case - you just want to extract information from a document, unless you want to search across documents ('find me all contracts involving client XYZ', etc).
Alternatively you could use something like Azure. Yes, it's cloud based but it's private infrastructure you're provisioning, so it's basically 'your cloud', running whatever models you want, not like you're sending a sensitive document to ChatGPT or Claude where you can't be entirely sure the chats are 100% secure and not being data mined by the host company. A setup like this would allow you and anyone in your company to run the tool rather than having to keep a dedicated device on site.
Serprotease@reddit
If you’re not comfortable with Linux, maybe an Amd 365 AI 128gb based computer will work? They can run windows.
For the model, oss 120b is your best bet.
The main hurdle in my opinion is the rag pipeline.
You need to split your docs in chunks and make sure that table and images are handled correctly and save the embedding to a local qdrant/faiss db.
You can probably vibe-code something here (using ibm docling, and qdrant/Faiss). But still, it’s not plug and play.
The good thing is that you can later use citations in the response to see exactly where in the document the answer is.
The Llm part is the easy part. The tricky thing for good rag is the document management. You can just throw a big document at your Llm but remember, garbage in-garbage out.
profcuck@reddit
Just because OP is new to this area, I wanted to correct a small typo. I am pretty sure you meant AMD 395+.
Serprotease@reddit
Fixed, thanks!
btdeviant@reddit
A couple things - I’ve done some work for lawyers and law offices and if you’re trying to use a Spark for what you are accustomed to using hosted providers then, respectfully, you should hire someone. Legal has somewhat novel demands vs just plain generative text depending on what you’re trying to use it for, and the models that would run on a Spark would shit the bed on that front.
Also, and I say this in the most respectful, gentle way I possibly can, but Perplexity is a dogshit abomination and I can only hope you’re not using it in a professional setting.
agoodepaddlin@reddit
I think you're confusing the models inbuilt parametric knowledge with documented memory.
Review and summary of docd memory is highly accurate and probably more trustworthy than a human.
They're not the same thing.
New-Significance6497@reddit
Ok. so DGX Spark would not work. But how about if we build a 2x 5090 or 4 x 5090 system? I am an attorney but I could build something like that with no issues. Would that work better for the same usecase?
Baldur-Norddahl@reddit
It is not easy to build something good with multiple consumer cards. Yes a lot of people here like the approach, but that is because they like the challenge. The cards can't be beside each other, because one will pull in hot air from the other. You won't find a motherboard that can fit more than max two anyway. So you will need PCI riser cables and cards laying all over the place etc. Not something that will look like a professional job.
Instead just get a 6000 Pro. Or even multiple of those. The max q version is designed to be next to another card, because it blows out of the back.
New-Significance6497@reddit
I can do the hardware. I used to build eth miners as a hobby. I know how to handle all problems. The question is whether we can use it for what this thread intends :))
neoscript_ai@reddit
I am basically having the same setup for clinics and therapy offices but with the Ryzen AI Max 395. It works pretty well with Qwen3 30B A3B
sirfitzwilliamdarcy@reddit
I would get a maxed out Mac instead. It’s faster and you can use it for other things.
Viaprato@reddit (OP)
I am not an apple user. But I guess I can replace "mac" with any tower PC right
Baldur-Norddahl@reddit
No Macs are special at this task. In particular you want the "M4 Max" or "M3 Ultra". The new M5 has not been released in this spec yet. The M4 Max has double the memory speed of the Spark and the M3 Ultra has three times more. That matters a lot. Compared to a tower PC it is more like 5 to 10 times faster.
There is also the M3 Ultra Mac Studio with 512 GB of unified memory. This thing is special because it can run the largest models. It is the only box you can buy and without doing any assembling, just turn it on, and run Kimi K2 Thinking at q3. Models that actually surpass many of the commercial ones. Use it as a server, so you won't have to become a Mac user for this.
Loud_Communication68@reddit
Just get a macbook. Your clients will like you more
Viaprato@reddit (OP)
Why that? Pls explain
Safe_Leadership_4781@reddit
Der Tipp in Richtung Mac ist gut. Entweder jetzt einen Mac M3 Ultra 512GB UM oder auf den M5 Ultra mit hoffentlich 1TB UM nächstes Jahr warten. Er ist langsamer als die Nividia GPUs, aber die beste All-in-one Lösung. Die lokalen Modelle werden immer besser, aber auch immer größer.Die Hauptherausforderung wird aber die Sicherstellung der Cybersecurity, es sei denn das Teil wird völlig abgeschottet, was schon bei updates schwer wird. Mein Setting Langflow workflows mit lm studio als server und mlx modelle, aber bisher nur test nicht produktiv.
Safe_Leadership_4781@reddit
Ist das bei euch nicht standesrechtskonform: https://www.noxtua.com/de/? Gibt es auch für Österreich.
Loud_Communication68@reddit
They get like twice the memory latency. If you're doing inference only then it's like a faster version of the spark for the same price.
Macbook gets like 128 gb of ram and you'll get laid more
Lixa8@reddit
Hire someone who can. Let them to the research with cost analysis and whatever instead of posting here. But from the simple fact that you posted this, you don't sound like someone who values IT.
Optimalutopic@reddit
You might be looking for this as well: https://github.com/SPThole/CoexistAI --> Can connect to web, local files,docs, images etc with all local mode, you can create knowledge bases and directly do RAGs
Igot1forya@reddit
I have a spark and setup Portainer for Docker. I then created a "stack" consisting of Ollama and OpenWebUI. Ollama serves up models for OpenWebUI which would act as your friendly traditional browser LLM. It works shockingly well. I simply type in the model I want to try and it pulls down the model from Ollama's robline model epository for local use.
Viaprato@reddit (OP)
Sounds great Why are some ppl here then telling me I need to spend 100 hours, 400k and then get the quality of legacy models
Eugr@reddit
Just a few observations as Spark owner:
- It is a great little low power device, but it's very constrained by its memory bandwidth. Its sweet spot is MOE models with relatively small number of active parameters. GPT-OSS-120B is a sweet spot. Dense models, even 32B ones will be slow.
- You can get very close performance for gpt-oss-120b with a regular PC with RTX5090, but it will run dense models much faster.
- It has some quirks - it's not x86, so while most Linux stuff works on it, there will be a few things that won't. You probably won't have this issue, but something to be aware of. Also, current kernel is not very well optimized. Model loading performance is terrible. I was able to improve it significantly (we are talking 20 seconds vs. a minute for gpt-oss-120b!) by compiling a new Linux kernel from NVidia repo, but it requires some extra tinkering. Things get even more trickier if you need something like vllm/sglang.
- For 2x price of Spark you can get RTX 6000 Pro that will be 5x faster and won't have any weird quirks of DGX Spark.
And a few things as just a general advice:
- Never trust LLM output, even the cloud ones, especially as a lawyer. Always verify. But you probably know that already.
- LLM performance starts degrading at large context, both in terms of speed and quality of output. So keep your context as small as possible, IOW, limit to a single contract - this will also help to avoid context poisoning to some extent.
- Make sure you set the context size properly, especially if you are using Ollama (but better don't, it's just not worth it in late 2025). Ollama used to set context to 2048 by default, then they raised it to 4096, and it's a rolling window, so you would never know that you exceeded it, but your model would start producing weird results.
Igot1forya@reddit
Biggest reason is the Spark is a grower and not a shower. 128GB of unified memory will get you a 200B parameters model roughly. Double that for 2 Soarks. The GPT OSS 120B model or Deepseek R1 70B, for examples, works decent enough for me. I have no pressure of time. I attach a document and walk away. It can take several minutes (like 6) on average for it to even finish thinking before it spits out text. I'm working on finding better models that strike a balance between speed and accuracy.
I'm also aware that Nvidia is still making updates to drivers and libraries. There's a number of known bugs right now and it's still early days.
These folks who recommend other options are not invalid in their sticking points, the Spark is amazing, but if you want an experience identical to the big players, you need to drop a ton of money on hardware. Of course, that depends on your goals and expectations.
I spent about 4 hours of my time setting up Portainer and the Ollama/OpenWebUI stack. It could be deployed in less than 15 minutes if I knew what I was doing when I started. I was setting it up in my homelab so I can deploy a similar setup at my work. But my work has several 100K in servers I plan to leverage which have several orders of magnitude more RAM so I can run Kimi K2 (1T parameters model) if I wanted.
The point is, the Spark CAN run your workload, just at a smaller scale. It's great for PoC but don't expect it to do it quickly or using the highest precision possible. For document summary, it should be fine, though others point out, it's best to use in connection with a pair of human eyes or at minimum a human double checking the output. It's totally an "upload in the morning and check results an hour later" kind of device.
Tyme4Trouble@reddit
I’ve got a Spark and tell you it would work and you have a use case where it might actually make sense.
You are talking about a workload that requires fast prompt processing, which the Spark can do if configured correctly.
You will need to use a terminal to get things running, but Nvidia has documented just about everything necessary to make it work.
Using GPT-OSS-120B you can expect about 6000 tok/s prompt processing using TensorRT LLM. That’s roughly 200 double spaced pages in 10 seconds. Faster with rag but no guarantees of success there.
Where you need to be careful is text generation. The Spark has pretty limited memory bandwidth which means text generation will be slow. But if you craft your prompt carefully you can have it generate a list of pages flagged for having the clauses you’re describing.
As a legal office I’m sure you’ve seen how dangerous trusting LLMs without confirming can be.
Some other warnings. You definitely do not want to use Ollama for this. Llama.cpp can work, but I recommend TensorRT LLM and MXFP4 / NVFP4 quantized models for the Spark. vLLM / SGLang could be viable but I haven’t tested them since Spark launched and at the time they didn’t support FP4 activations on gpt-oss. If you’re not using OpenAI’s open models then they should be fine.
You will get a lot of folks directing you toward Ryzen AI Max 395+ systems which are cheaper, but as I understand it they are much slower for this kind of workload.
If you can affford to spend 10-20K on a workstation. A RTX Pro 6000 Blackwell would be a better buy, but I wanted to answer your question first before pointing this out.
Viaprato@reddit (OP)
Thanks for the inputs, I will check them out.
Do you think the models mentioned will keep up quality-wise with the newest could models? I think GPT-5 is much much better than gpt-4o and same with perplexity which has improved vastly (in the "research" mode) - I get the idea that all of what I will achieve here will rather be in the gpt-4o world, if at all, even if some source I found say that qwen will perform almost as good as gpt-5 or better in some instances
choikwa@reddit
cloud based offerings are in the trillions of parameters vs billions on open models locally hosted on consumer-feasible hardwares… in addition, you should be wary of hallucinations and lack of expertise in subjects models weren’t trained for.
GalaxYRapid@reddit
I’ll put it this way all open source and local models will be a step or more behind the cutting edge from anthropic or open AI. Your use case however makes it so you have no choice if you want to use LLMs to help speed up your work. Your best bet is set it up with the best model you can today, verify that it’s up to your standards, then come back in 6 months to see what’s new and how well that model will work. Currently gpt-oss-120B is a very solid build and is competitive with gpt-4o as far as output but that’s just from what I’ve heard my hardware isn’t capable of running it unfortunately. I’ve also had some luck with qwen3 but that’s mostly with programming so your mileage will vary there.
1H4rsh@reddit
I wouldn’t say Ryzen AI Max+ 395 is much slower. As far as I have researched (and I’ve done quite a bit since I just bought the Ryzen AI Max+ 395), the best thing that the DGX Spark has going for itself is much better support. Nvidia has made sure that everything is optimized whereas on the AMD side you have the community trying to patch things. In terms of actual hardware, they’re pretty much the same, with similar RAM and memory bandwidth, the two largest factors affecting LLM performance.
Since OP doesn’t have a tech background, the DGX Spark would still be a better buy but for the money, you’re still getting much more value with the Ryzen AI Max+ 395, which with some builds is almost half as cheap!
1H4rsh@reddit
I wouldn’t say Ryzen AI Max+ 395 is much slower. As far as I have researched (and I’ve done quite a bit since I just bought the Ryzen AI Max+ 395), the best thing that the DGX Spark has going for itself is much better support. Nvidia has made sure that everything is optimized whereas on the AMD side you have the community trying to patch things. In terms of actual hardware, they’re pretty much the same, with similar RAM and memory bandwidth, the two largest factors affecting LLM performance.
Since OP doesn’t have a tech background, the DGX Spark would still be a better buy but for the money, you’re still getting much more value with the Ryzen AI Max+ 395, which with some builds is almost half as cheap!
DataGOGO@reddit
No, that is not what the spark is for, build a server.
Viaprato@reddit (OP)
OK it's not the primary purpose, but would it work? Thx
abnormal_human@reddit
Could you technically run it? Yes. Would it run at a useful speed? No.
You're a freaking law office, don't cheap out. Get some RTX 6000 Blackwells and be done.
Also, getting performance out of RAG systems is not plug+play. You will need someone with MLE expertise to tune and customize the RAG for your use case, as well as established quality metrics, compute to run the experiments to validate, etc. You cannot just "i don't have much time to set this up" this problem--this stuff is young and not plug'n'play yet.
In general even for long context models (130k is large, not small for these things. Small is 4k), the performance is not super consistent in all parts of the context window. Especially when using small-mid-sized models like your 70Bs. I would absolutely not try to put 130k tokens of critical material into a model and expect any reliability, especially not with smaller models like these.
In your plan, you are collecting too many compromises that harm either performance (speed) or performance (quality), and doing so when there are real consequences for your clients. I realize that AI is not your expertise, but if you go down this road carelessly it will not end well.
Viaprato@reddit (OP)
Thanks for taking the time! I understand from your post that the output will not be in any way comparable to what I get when I upload to gpt-5 or perplexity research. Correct?
abnormal_human@reddit
Correct.
Viaprato@reddit (OP)
I guess it doesnt make much sense to spend 5,000 $ to get something like gpt-4o which is 18 months old already now.
abnormal_human@reddit
There’s no model that runs on that box in a quant that will even get you to 4o, and the closest you can get on there will slow as dogshit. I think you are not computing what that product is for—namely ML engineers that deploy on NVIDIAs larger similar architecture but actually fast hardware.
On 4xRTX 6000 Blackwell workstation you can run stuff that is 4o level in 8bit. Thats more like a $50k machine and more typical of what I’d expect a law firm to start with if they were taking this idea seriously. You still need people/expertise but at least you’d be starting with suitable equipment.
Viaprato@reddit (OP)
Ok thx that's helpful (and too much money compared to what this will save us)
DataGOGO@reddit
$5k? Nope
You are looking at about $50k on the entry level to do it local.
DataGOGO@reddit
GPT-5 and other such services suck at what your are describing, as that is not what they are designed for. You can get much. Better results doing it local.
starkruzr@reddit
yeah, something like an MGX spec machine is a lot better for this than a Spark. (feels obvious but here we are I guess)
Viaprato@reddit (OP)
then it's not for us - I can do 2 hours of setup, but I cannot hire staff for this. I guess thats more on the "plug and play" side
illathon@reddit
Or even better pay some one who knows what they are doing to build it for you.
DataGOGO@reddit
Yes, but slow as hell .
Viaprato@reddit (OP)
Thanks for the feedback, then it will not be useful. I upload e.g. 50 documents, and need the output some hours later, and I would also need to be able to then ask questions and get immediate outputs
DataGOGO@reddit
Yeah, that is far more complex than you think.
No matter if you use RAG, an MCP, or a database, all of that just chews your context.
The right way to did this is to do one of the following:
1.) Take an open weight model, and fine train the model on a dataset of your collective documents, they then become part of the model, not just loaded into Context.
2.) Use a document management model, fine train it on your document formats, ingest all of your docs and use a search service (likely presented via an MCP) and do tool calls.
For number 2, Azure Document Intelligence is 100% the way to go, and meets your privacy and compliance needs.
Viaprato@reddit (OP)
That's all way out of our budget. Then it's cheaper to hand clean our docs and upload them to the cloud on a 20 dollar chatgpt pro license
DataGOGO@reddit
Yep, it isn’t cheap.
A project like this, using our professional services would be at least 400k
Viaprato@reddit (OP)
😅 Jensen sells 5k machines Do you happen by chance to be selling these services?
DataGOGO@reddit
No, I own a consulting firm, we are not a hardware reseller.
We mainly service mid-size / enterprise customers.
To buy Nvidia hardware to do this would be at least 50k for say 5-10 users. To service 100 users, multiply that by 10.
The 400k is just the hourly labor to build and train the models.
Viaprato@reddit (OP)
I did not really understand why I need to train models. Models are out there and they are good
DataGOGO@reddit
Custom training is how you build knowledge into a model, without using context ($$$).
You either pay a little upfront for custom training, or you pay more over time for tokens in context.
Icy-Swordfish7784@reddit
According to lmsys.org the DGX spark decodes Llama 3.1 70B at 2.7 tokens/second. It will load but it will be very slow.
By comparison the M4 Max 128GB model does around 12 tokens/second.
Llama 3.3 70B Model Achieves 10 Tokens Per Second on 64GB M3 Max, 12 Tokens Per Second on 128GB M4 Max | DeepNewz AI Modeling
NVIDIA DGX Spark In-Depth Review: A New Standard for Local AI Inference | LMSYS Org
Viaprato@reddit (OP)
So with 50,000 token documents thats 308 minutes = 5 hours. can I ask questions then with immediate answers once the model knows the data? It's still a long time to wait, not really usable..
Serprotease@reddit
No, the previous reply was not very clear here. He was using token generation speed (How fast are the words displayed in the answer).
To see how for how long your 50,000 tokens will need to be processed before the answer starts to be displayed , you need to look at the prefill speed. IRC, for a 70b it’s an about 400 tk/s. For oss 120b, 2,000 tk/s.
Prefill is something you don’t really see when using api and this the reason why companies buys h100/A6000 pro and not MacStudio for their multi users local needs.
Viaprato@reddit (OP)
That's helpful thx
Icy-Swordfish7784@reddit
There's no cheap way to run large dense models like the llama 3 70B. All in one boxes mostly work with mixture of expert models that have small active parameters.
Viaprato@reddit (OP)
and that's something I cannot replicate locally right?
Icy-Swordfish7784@reddit
Realistically, you would either need a server with multiple 24GB GPUs, a very pricey workstation with 3x Nivdia A100 80GB(assuming you're running a q5 or q6 quantized model), or 6x Nvidia A40's 40GB. This is needed to include overhead for a 130k token context window(that isn't a small context).
Llama 3.3 70B: Specifications and GPU VRAM Requirements
coding_workflow@reddit
Avoid better get an RTX 6000 pro and plug it in Windows system rather pain of DGX system that is Linux on ARM.
Disastrous_Look_1745@reddit
I've been dealing with a similar confidentiality challenge at Nanonets where we process sensitive docs for financial services clients. The DGX Spark route could work but there's a massive learning curve if you're not already comfortable with linux and model deployment. You're looking at probably 40-60 hours just to get everything running smoothly, and that's assuming nothing goes wrong. The performance gap between local models and GPT-5/Claude is real - we've tested Llama 3.1 extensively and while it's good, it's maybe 70-80% as capable on complex document understanding tasks.
For contract review specifically, the context window limitation is going to hurt more than you think. RAG helps but it's not magic - when you're looking for specific clauses across 1000 contracts, you need the model to understand relationships between different sections, cross-references, definitions that appear 50 pages earlier. We tried running similar workflows locally and kept hitting walls where the model would miss critical context. Plus setting up a proper RAG pipeline that actually works reliably is its own beast - you need vector databases, chunking strategies, retrieval optimization.
Have you considered a middle ground approach? Some firms I know use air-gapped cloud instances or work with vendors who can provide on-premise deployments with proper security certifications. There are also managed solutions where the compute happens in your controlled environment but someone else handles all the setup and maintenance. The time you'd spend wrestling with linux and debugging CUDA errors could probably pay for a proper enterprise solution. If you absolutely need local, maybe start with something simpler like LM Studio on a high-end workstation before jumping into DGX territory - at least you'll know if the workflow even makes sense for your use case.
Viaprato@reddit (OP)
That's helpful. Summary: a lot of work, a lot of money, and then context doesn't suffice and LLM quality is what could did 18 months ago
Street_Smart_Phone@reddit
I would give AWS a good shake and see if they can fulfill your needs. They have AWS Bedrock where they run it in their own internal servers and no logs are taken and they cost same as API prices of Claude, ChatGPT etc.
https://aws.amazon.com/bedrock/faqs/#security:\~:text=Amazon%20Bedrock%20support%3F-,Amazon%20Bedrock%20offers%20several%20capabilities%20to%20support%20security%20and%20privacy%20requirements,Amazon%20Bedrock%2C%20without%20having%20to%20expose%20your%20data%20to%20internet%20traffic.,-Will%20AWS%20and
Ryanmonroe82@reddit
Check out Kiln AI and their RAG setup. You will need a qwen embedding model and a qwen VL model running at the same time, use powershell and Ollama serve. Open 2 new power-shell windows and run the qwen VL model(find them on hugging face) and Then the embedding model. The embedding model can’t be used with “Ollama run”, type Ollama list, make sure your model is there and then use “curl.exe http://localhost:11434/api/embed -d '{\"model\": \"qwen3-embedding:0.6b\", \"input\": \"test\"}'.” Load all your documents into the place it tells you and customize or don’t the chunking strategy, and voila. Very intuitive
Viaprato@reddit (OP)
Sounds like something I cannot do :)
Also, from what I have read above, RAG is not what I need - right?
Ambitious-Profit855@reddit
I would say RAG is exactly what you need. Use RAG to retrieve the interesting text chunks and use your brain to analyze the retrieved text areas. Major advantage: you don't hallucinate when the search didn't retrieve the correct text chunk. The AI many times does. For testing you can of course also use AI on the retrieved text chunks, but as others said you shouldn't trust the AI output alone.
For the DGX Spark: I wouldnt go for that. I would get an AI Max+ 395 with 128GB RAM. Use the remaining money to pay someone to set up the RAG solution. Test it and when you decide it makes you money/saves you time go for a (much faster) real solution using GPUs.
jacek2023@reddit
There are newer models, faster, smaller, wiser.
Viaprato@reddit (OP)
Which ones would your recommend for contract analysis? And would this hardware setup work?
weird_gollem@reddit
You cannot put information on the Cloud, but having a server "on premise" doesn't have a limitation as far as I know. You can buy the hardware to have good GPU, memory, SSD and all you need, and have that server to contain everything you need, with models, and all that you need, and you can isolate it so it cannot be accessed outside the local network (meaning, no internet access). Since it's on premise (not on cloud) you're not sending your client information outside your organization (or your practice). As far as I understand, for all intent and purpose, it's the same as having the information printed in folders in a drawer.
With that, you can have a model, use RAG /MCP or whatever you need. There you can see which model is better for what you need (it will be a local model like many mentioned). Hopes this info is useful for you, good luck!
ArchdukeofHyperbole@reddit
Why are you aiming to run dense models? There's larger parameter moe models that I bet would work as well as llama 70B but generate much faster. I guess for one, if you wanna go with llama, there's llama maverick which is a moe model. Chatgpt's oss 120B, qwen next 80B, and that new Kimi linear 48B model should work for basic summary and such and they're all moe models. kimi linear and qwen next both have linear memory, which means generation speeds aren't reduced as much with longer context and they already generate faster than dens models of the same size. Anyway, I'd run moe models on the spark if I had a spare 5k to buy one.
KrugerDunn@reddit
This would not work for this purpose.
Viaprato@reddit (OP)
Why wouldnt it? Thanks.
KrugerDunn@reddit
It’s not to that spec. Why can’t a Honda Civic drive 200mph? Cuz it’s not a race car
ScienceEconomy2441@reddit
Get a MacBook or a Mac mini. Apples hardware is great for your use case. Apple is making running LLMs locally a part of their AI strategy. Just do a search for “running LLMs on Apple Silicon” and it should get you sorted out.
Viaprato@reddit (OP)
Please explain: How can a macbook do this, if a ryzen 9 tower is too slow?
ScienceEconomy2441@reddit
Sure read through this article: https://scalastic.io/en/apple-silicon-vs-nvidia-cuda-ai-2025/
Specifically, look at section 2.2 Inference, which covers possible tokens per second on Mac silicon with 70b model.
The use case described in the OP is specifically for inference, which as explained in that article, Apple silicon is optimal for this. Let me know if you need more info, happy to help.
Ryanmonroe82@reddit
Transformer Labs is another good one.
nmrk@reddit
Check with your local Bar Assn. to ask their opinion about using AI for legal work. Be sure to give them your Bar number.
Viaprato@reddit (OP)
Thank you - hosting it locally to double or triple check my work is just fine
DerFreudster@reddit
I'm beginning feel like this sub needs a banner that says, No, DGX Spark, No. The hype around that thing is insane.
Viaprato@reddit (OP)
Somehow the personalized ads worked :)
new-to-VUE@reddit
could you run a script only does a small percentage of of the docs at a time? if all docs and checks are independent of eachotehr you woudn't need to load everythign into context.
Viaprato@reddit (OP)
If we decide to spend same serious money (instead of cleaning docs for the cloud), it would be better if we could upload the docs and then run a dialogue on whats in there.. to really understand it.
false79@reddit
I think these all in one boxes are great if you're not into making your own multi-GPU inference server/station.
However, I wouldn't use it as the only device in your infrastructure. Imgaine you had all those confidential files in a single offline device, no back up, and disaster strikes. You would be at a loss in an attempt to replace what you've lost.
Viaprato@reddit (OP)
We certainly have ALL files safely stored in multiple locations. We're just not permitted to upload them to US based LLMs
so your first sentences means that what I described would work? thx
false79@reddit
Yeah I think it it would. It wouldn't be fast but it should be faster than your reading speed, provided you have your RAG system setup and you're not afraid of linux.
If linux is not your thing, there is the cheaper and in some cases faster 128GB Strix Halo machines that have Windows 11.
Viaprato@reddit (OP)
Thanks, I'll look into this
Fun_Smoke4792@reddit
You can test on your own data, find out a standard for yourself. When there's any new better models, try them on your test. Don't use LLM as the judge.
Prestigious_Thing797@reddit
If you don't want to use linux you're gonna have a bad time with it.
I would look at an the AMD Ryzen AI Max 395+ with 128GB combined memory.
You can put windows on it and run LM studio or something pretty easily.
Won't be as fast as setups with dedicated GPUs or proper hosting through vLLM or SGLang but it'll work and be easier to setup.
MitsotakiShogun@reddit
Can't you get a subscription to a legal services company? They probably handle your data as carefully as you, or maybe more. E.g. some services I've heard people use (in alphabetical order): * https://www.harvey.ai/solutions/transactional * https://www.lexisnexis.com/en-us/products/lexis-plus/document-analysis/agreement-analysis.page * https://legal.thomsonreuters.com/en/products/highq/contract-analysis
Not sure untuned open models are the best. I remember there being one or more open finetunes for legal data, but they're likely outdated at this point. Also RAG can be deceptively easy at first, but you will quickly see that it's not as easy as "just drop it into a vector DB and let WebUI handle it".
And Spark (or Strix Halo, or Mac) is not great for batch processing. If you only analyze a few contracts a day and these 1000 are over the course of multiple hours or days, it will likely be fine though, but you likely have better options regardless. And most importantly, don't get a Spark if you're not a developer, the Arm CPUs paired with the Blackwell GPU are not going to do you any favors. See if you can get a couple 5090s for roughly the same amount. But again, in your situation I'd first look into your cloud options.