Requesting advice on local AI setup for academic use
Posted by The_Paradoxy@reddit | LocalLLaMA | View on Reddit | 15 comments
I'm about to do a clean install of Ubuntu 26.04 on a desktop that has a 5060ti 16gb and a 4060ti 16gb. Can you help me work out the best local AI setup for my use cases? All advice no matter how minimal is greatly appreciated, 🙏 thank you!
My most immediate question is vLLM vs llama.cpp and with what settings? But I'm also trying to figure out what sort of agent workflow makes sense for me. I am concerned about security if that makes a difference between llama.cpp and vLLM or between all of the different agent harnesses. I've heard that I should disable thinking for Hermes, but would that also make sense for open code? Is it possible to do multiagent orchestration on my hardware or do I need to dream a little smaller? If I want to be able to remotely ssh into my desktop to use agents, what are best practices for security?
Full specs
GPU 1: 5060ti 16gb on pcie gen 5 x16
GPU 2: 4060ti 16gb on pcie gen 4 x4
CPU: 7950x3d
Motherboard: B650 aorus pro
USE CASES:
Code documentation and generation:
- I do research using computational game theoretic models. My code makes heavy use of numpy, numba jit compiling, and is written for performance (parallelizing as many independent computations as possible) and is not written for easy readability/interpretability. My understanding is that, if I want actually useful code assistance, the first thing I need to do is generate clear documentation what my code is doing, and how it is implementing a model as described in a paper.
- Once I've gotten the code reasonably documented I'm hoping I can get decent assistance at extending my models without butchering all of the optimizations I've put into my code. Any advice on agentic workflow for coding complex dynamical systems, or any context in which you make relatively abstract use of array operations, is much appreciated.
Research writing assistance:
- I am hoping that I can use an agent to search the Internet for relevant background literature and to compile summaries of what it finds.
--- however I am concerned about security for this. How much is an issue is prompt injection for local AI? Are there any best practices for using an agent for broad web search?
--- I'm also wondering in anyone had advice on prompting for this long is work. I'm my experience LLMs tent to focus more on key word similarities rather than a paper's actual content. This is a big issue for me since I do interdisciplinary research where the most relevant terms on a topic differ between researchers who are trained as economist, anthropologists, cognitive scientists, etc. . I'd really appreciate any advice on how to get a model to pay attention to the bigger picture, what conclusions are being drawn, and to not over index on key words or what happens to be said in the first couple pages of a paper
(Possible use case) Question answering for students:
- I teach an intro data science class and often spend time responding to student emails with simply telling them where to look in the lecture notes or giving them Socratic questions to help them think through their problem. I'd love to be able to set up an email address that the students can use to ask an AI questions where the AI has access to lecture notes and has learned to not just give students the answers but instead to help them think through the problem. I only have about 100 students a semester, so I'm not too concerned about heavy traffic. My biggest concerns are:
--- All of the local models I can run will have a bias towards just giving students the answers rather than helping them think no matter how much I try to prompt them to reply to emails in a particular way.
--- This feels like it will be asking for trouble from students who are just trying to cause problems. If I give an agent access to an email address, are students going to be able to prompt it to change the password for the email address?
onyxlabyrinth1979@reddit
For your setup I’d lean llama.cpp for control and isolation, especially if you’re worried about security. vLLM is great for throughput, but it’s more of a serving layer. For agents, keep it simple first, one loop, tight tools, no open browsing. Prompt injection is real even locally if you pipe in web data, so treat external text as untrusted.
PuzzleheadedMind874@reddit
With your dual 16GB setup, vLLM is great for high-throughput serving, but llama.cpp might feel more responsive for iterative coding tasks where you need to quickly swap models or quantizations. You could try managing these complex agentic flows through a modular, node-based system (I'm building Heym for this, https://github.com/heymrun/heym). It's probably safer to avoid exposing your agent host directly to the internet, so sticking to a WireGuard or Tailscale tunnel is a sensible way to handle your remote SSH needs. Running internet-connected agents carries inherent risks regarding prompt injection, so you should isolate the agent's environment from your primary credentials and system configuration files. Separating your research workflows from your student-facing RAG pipeline into distinct sandboxed containers will give you much more control over how models handle sensitive tasks like email interactions.
natermer@reddit
I don't know which is better, but I install llama through Linux brew and it works fine for me.
The settings are going to depend on the specific model. Llama.cpp depends on "GGUF" model format and because of that I tend to use usloth models. On the popular models they have documentation and blog posts on good settings for the models they offer.
I use llama-fit-params script to help figure out how to size/allocate the LLMs to my GPUs. You'll have to set the context size to the workflow you use and the specific model. I use llama-server to then provide API access to the LLM for the agents and my editor.
Agents are a security risk as they can make autonomous changes on your system. Popular agents like Claude Code have guardrails in place and a permission model were you approve changes before they are made... but I feel it is giving people a false sense of security. A good agent has the ability to do most anything you can, including downloading things and making arbitrary changes or reading secrets stored on the file system, etc.
Llama.cpp or vLLM just run the models. They don't do anything besides that. They can't make changes or interact with the outside world on their own. They can't read files or go out on the internet.. they just run the models. They can use what you feed them, but that is about it. It is the software that you hook up to use through them that can provide more then just chat interface.
For security for Agents the ideal setup is to have a system dedicated for them. That way you can control what they have access to.
Otherwise you can containerize them or run them in VMs.
Most people just run agents directly on their workstations, but you have to be careful.
Bootes-sphere@reddit
Dual 16GB cards is a solid setup. You've got \~32GB vRAM to work with, which opens real options.For academic work, I'd split it differently than most would: keep one card dedicated to inference (Llama 3.1 70B fits clean on a single 16GB), let the other handle batched processing or fine-tuning jobs. Avoids context switching and gives you reproducible performance for papers.
Ubuntu 26.04 + CUDA 12.x + vLLM or Ollama on the inference side. If you're doing any training, throw your data at the second card with LoRA—way cleaner than OOM hunting.
What's your actual workload? Are you running experiments, building datasets, or just needing a capable local backbone for research? That changes the stack pretty significantly.
bluelobsterai@reddit
My hot take r/Proxmox then install 24.04 and pass GPU’s through.
The_Paradoxy@reddit (OP)
I've been having trouble figuring out what the benefit of proxmox over simple docker containers is. Do you mind elaborating?
bluelobsterai@reddit
For me, it's about ease of default installation and then having a lot of tools like ZFS and Ceph at my fingertips. I have 3 hosts in my homelab - not just one dev server. If I did have just one dev server, I might run eight or twelve virtual machines, on those virtual machines I might run containers. All this can be done via 24.04 but it’s less easy.
For instance, let's say you wanted to mess with Kubernetes. I would take three virtual machines, assign them each two virtual cores and 4 GB of RAM. I would have that be my control plane. Then I would create three more virtual machines with 8 GB of RAM each and four virtual cores. Those would be my compute nodes, and I would have my control plane operate my compute nodes. Learning Kubernetes using one system with, say, 64 GB of RAM would be an amazing developer platform.
Advantages to me would be: - ZFS - Proxmox Backup Server - simple installation of a root mirror - simple web access - headless UI/UX - hardware-compatible with my ConnectX 5 and ConnectX 6 cards
So many reasons. I really, really, really like Proxmox as a hypervisor only.
But for you, the most important would be ease of GPU pass-through. If you use Ubuntu 24.04 with GPU pass-through, you're going to have to set up a lot of things manually. Whereas Proxmox 9 is going to make that pretty trivial, and the pass-through is just going to work.
You can create some SSH keys and let your clanker do most of the work from your desktop computer.
bluelobsterai@reddit
Oh, you can do this all with Terraform, so your scripts are portable to the cloud. When you spin up your EKS cluster in Amazon, having this Terraform Ansible experience in your home lab is priceless.
The_Paradoxy@reddit (OP)
Okay thanks I hadn't thought much about GPU pass through. Won't most agent harnesses have built in support for docker containers and pass through? I thought that was standard on Open Code
Ha_Deal_5079@reddit
llamacpp is solid at 32gb. qwen2.5-coder on ollama does the job for coding agents - and skillsgate on github handles the config mess between tools if you start using a bunch of em
The_Paradoxy@reddit (OP)
Thanks! I wasn't aware of skillsgate
Ok-Lobster-919@reddit
You can run a good local model for orchestration, tool calling, parsing and basic analysis probably.
But I have tried so many models, really nothing beats a call to an expensive model like Opus on openrouter.
I guess you could try a local model for coding, and have the agent orchestrator detect when it is failing, and then send a request through openrouter to correct the local model. You can get a lot of code out of opus for a few bucks. It does add up fast though.
The_Paradoxy@reddit (OP)
Any suggestions on what to use for orchestration? Any opinion on Turnstone?
Desperate-Body-5462@reddit
For your GPU combo, llama.cpp with tensor split is probably the easier starting point vLLM's multi-GPU support works better when both cards are identical, and the 5060ti/4060ti mismatch can cause uneven load. You can still get good throughput with llama.cpp splitting layers across both
For the coding use case with numba/numpy-heavy code, I'd lean toward a model like Qwen2.5-Coder or Devstral they tend to handle array-heavy and performance-oriented code better than general models. Disabling thinking for Hermes makes sense for most tool-calling workflows, it reduces latency and repetition without hurting much on structured tasks
On the student email agent the prompt injection concern is real. Worth sandboxing it so the agent only has read access to lecture notes and can only reply, not take any account actions. Giving it a very tight system prompt with explicit you cannot change settings or passwords" instructions helps, but I wouldn't give it write access to anything sensitive regardless
For research literature search, the keyword-over-content problem is tricky from what I've seen, chunking papers at the section level and including abstracts plus conclusions separately in your RAG pipeline helps the model reason about actual content rather than just matching terms
FullstackSensei@reddit
Nothing screams bot more than referencing a model from 2024q