jaMMint

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer

Posted by One_Slip1455@reddit | LocalLLaMA | View on Reddit | 226 comments

[-]

jaMMint@reddit

Let me know how it goes, thanks for trying it out!

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer

Posted by One_Slip1455@reddit | LocalLLaMA | View on Reddit | 226 comments

[-]

For folks using Blackwell cards (eg 5090 or RTX 6000 pro), here is a guide I wrote to reach up to 120t/s for the dense 27b model, and up to 200t/s for the 35b MoE qwen 3.6. https://github.com/lastloop-ai/vllm-blackwell-guide

local vibe coding

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 146 comments

[-]

jaMMint@reddit

try opencode vanilla, tell it to add the playwright mcp server to opencode. once that is active you are halfway there, closing the feedback loop turns a meh coder into a great one..

Getting slow speeds with RTX 5090 and 64gb ram. Am I doing something wrong?

Posted by Virtual-Listen4507@reddit | LocalLLaMA | View on Reddit | 37 comments

[-]

jaMMint@reddit

your RTX 5090 has 32GB of VRAM, try to stay well under that (so that you can also have context fit into VRAM). The moment you go to RAM, your speeds drop quite a bit.

Local programming vs cloud

Posted by Photo_Sad@reddit | LocalLLaMA | View on Reddit | 59 comments

[-]

jaMMint@reddit

I use a q3, works very nicely with around 90k context.

Local programming vs cloud

Posted by Photo_Sad@reddit | LocalLLaMA | View on Reddit | 59 comments

[-]

jaMMint@reddit

there is also GLM-4.7 for 192GB, otherwise good assessment.

Just got an RTX Pro 6000 - need recommendations for processing a massive dataset with instruction following

Posted by Sensitive_Sweet_1850@reddit | LocalLLaMA | View on Reddit | 39 comments

[-]

jaMMint@reddit

you probably just prepare a couple of test cases from your data and then try out some models. Eg gpt-120B OSS is very performant on the RTX 6000 Pro and could be a good start. Obviously if you can get away with smaller and even faster models, use them..

I made Soprano-80M: Stream ultra-realistic TTS in <15ms, up to 2000x realtime, and <1 GB VRAM, released under Apache 2.0!

Posted by eugenekwek@reddit | LocalLLaMA | View on Reddit | 108 comments

[-]

jaMMint@reddit

Also interested

Gertrude, a 94-year-old widow, was heartbroken after her husband Harold died. She decided to end it all with his old Army pistol.

Posted by dinosaurer@reddit | Jokes | View on Reddit | 75 comments

[-]

jaMMint@reddit

That pun was a low hanging fruit.

I can't be the only one annoyed that AI agents never actually improve in production

Posted by GloomyEquipment2120@reddit | LocalLLaMA | View on Reddit | 12 comments

[-]

jaMMint@reddit

And it is indeed shilling for a startup...

Most Economical Way to Run GPT-OSS-120B for ~10 Users

Posted by theSavviestTechDude@reddit | LocalLLaMA | View on Reddit | 44 comments

[-]

jaMMint@reddit

you can just software limit the power draw of the RTX Pro, same thing but better really

[Looking for model suggestion] <=32GB reasoning model but strong with tool-calling?

Posted by ForsookComparison@reddit | LocalLLaMA | View on Reddit | 20 comments

[-]

jaMMint@reddit

You can try some human like method of not forgetting something in the sequence. Similar to something called "the method of locii", or the method of places. You are to complete one journey through the house, there are 6 rooms you have to go through in the correct order. I each of these rooms you MUST complete a task (call a tool) in order to be able to proceed. 1) You stand on the porch and open the front door. Toolcall 1 ... 2) You enter and stand in ... Could also use landmarks/landscapes or anything really that anchors the thought process in 3 dimensional space. In humans that works well because of our very sequential planning and executing together with our continuous experiences in 3d space. It could work well for LLMs too.

Local models handle tools way better when you give them a code sandbox instead of individual tools

Posted by juanviera23@reddit | LocalLLaMA | View on Reddit | 43 comments

[-]

jaMMint@reddit

Look at https://github.com/gradion-ai/freeact, it's similar to what you want to achieve. Code runs in a container and the agent can add working code as new tools to his tool calling list.

How you get over 200 tok/s on full Kimi K2 Thinking (or any other big MoE Model) on cheapish hardware - llama.cpp dev pitch

Posted by _serby_@reddit | LocalLLaMA | View on Reddit | 35 comments

[-]

jaMMint@reddit

I propose we should improve all LLM WebUIs to include a "Submit output straight to reddit". I guess that makes your work easier..

Dynamic LLM generated UI

Posted by ItzCrazyKns@reddit | LocalLLaMA | View on Reddit | 7 comments

[-]

jaMMint@reddit

So how exactly do you give it access to you component library?

How much VRAM needed for Qwen3-VL-235B-A22B

Posted by Ok_Television_9000@reddit | LocalLLaMA | View on Reddit | 11 comments

[-]

jaMMint@reddit

thx

How much VRAM needed for Qwen3-VL-235B-A22B

Posted by Ok_Television_9000@reddit | LocalLLaMA | View on Reddit | 11 comments

[-]

jaMMint@reddit

what speed do you get with this setup?

Finishing touches on dual RTX 6000 build

Posted by ikkiyikki@reddit | LocalLLaMA | View on Reddit | 165 comments

[-]

jaMMint@reddit

reddit posts

Finishing touches on dual RTX 6000 build

Posted by ikkiyikki@reddit | LocalLLaMA | View on Reddit | 165 comments

[-]

jaMMint@reddit

Yeah, I think 128 cuts it too close for loading into VRAM if they use models larger than that.

Top small LLM as of September '25

Posted by _-inside-_@reddit | LocalLLaMA | View on Reddit | 40 comments

[-]

jaMMint@reddit

phi-mini reasoning 2.5GB, 3.8B model

GPT-OSS 120B is unexpectedly fast on Strix Halo. Why?

Posted by RaltarGOTSP@reddit | LocalLLaMA | View on Reddit | 65 comments

[-]

jaMMint@reddit

I meant that the other way round. If you already have an RTX 6000 Pro, this model is fantastic. Not that it's the best hardware for it.

GPT-OSS 120B is unexpectedly fast on Strix Halo. Why?

Posted by RaltarGOTSP@reddit | LocalLLaMA | View on Reddit | 65 comments

[-]

jaMMint@reddit

Best model for the RTX 6000 Pro with 96GB VRAM. This thing screams at 156 tok/secs. It's by far the best quality for the speed provided.

Apple M3 Ultra 512GB vs NVIDIA RTX 3090 LLM Benchmark

Posted by ifioravanti@reddit | LocalLLaMA | View on Reddit | 57 comments

[-]

jaMMint@reddit

It's a good comparison, but do you really limit the M3 with 512GB (!) unified RAM to models that fit in RTX 3090 VRAM in the real world? How would that setup compare using gpt-oss-120B, qwen-235B, GLM 4.5, Kimi-2?

How close can non big tech people get to ChatGPT and Claude speed locally? If you had $10k, how would you build infrastructure?

Posted by EducationalText9221@reddit | LocalLLaMA | View on Reddit | 156 comments

[-]

jaMMint@reddit

You can run the gpt-oss-120B at 150+ tok/sec on a RTX 6000 PRO.

Building a RAG-based Bot with a large knowledge base.

Posted by champ_undisputed@reddit | LocalLLaMA | View on Reddit | 9 comments

[-]

jaMMint@reddit

Maybe in your case you can obtain better results if you give the the LLM a function to query your database (eg build SQL/noSQL queries), especially results for dates and counting stuff will be much easier and solid. For more involved cases (that go beyond a single RAG retrieval), you could use some map-reduce as SELECT basis. Eg "give me all projects where John was responsible for marketing in 2024", could run a RAG on the descriptions and with the project ids collected run a subsequent query to fetch them, order them and add the date cutoff.

New code benchmark puts Qwen 3 Coder at the top of the open models

Posted by mr_riptano@reddit | LocalLLaMA | View on Reddit | 103 comments

[-]

jaMMint@reddit

You get steak alright, but it's overcooked by 10 minutes.

Testing qwen3-30b-a3b-q8_0 with my RTX Pro 6000 Blackwell MaxQ. Significant speed improvement. Around 120 t/s.

Posted by swagonflyyyy@reddit | LocalLLaMA | View on Reddit | 50 comments

[-]

jaMMint@reddit

Just one data point, but I get 153 tok/sek on this model in LM Studio under Windows on the RTX 6000 Pro.

From 4090 to 5090 to RTX PRO 6000… in record time

Posted by Fabix84@reddit | LocalLLaMA | View on Reddit | 259 comments

[-]

jaMMint@reddit

you could just use a riser cable to give it more space in between if thats a problem..

OpenAI gpt-oss-20b & 120 model performance on the RTX Pro 6000 Blackwell vs RTX 5090M

Posted by traderjay_toronto@reddit | LocalLLaMA | View on Reddit | 43 comments

[-]

jaMMint@reddit

what tok/sec did you get?

OpenAI gpt-oss-20b & 120 model performance on the RTX Pro 6000 Blackwell vs RTX 5090M

Posted by traderjay_toronto@reddit | LocalLLaMA | View on Reddit | 43 comments

[-]

jaMMint@reddit

I run the TQ1 UD quant from unsloth on the RTX 6000 Pro completely in VRAM at ~45 tok/sec

OpenAI gpt-oss-20b & 120 model performance on the RTX Pro 6000 Blackwell vs RTX 5090M

Posted by traderjay_toronto@reddit | LocalLLaMA | View on Reddit | 43 comments

[-]

jaMMint@reddit

I run the IQ4_XS from unsloth on the RTX 6000 Pro at 96 tok/sec. The 3_K_M version from DevQuasar runs at 90 tok/sec. Small differences depending on how many token are generated.

Local LLM Deployment for 50 Users

Posted by NoobLLMDev@reddit | LocalLLaMA | View on Reddit | 56 comments

[-]

jaMMint@reddit

for 50 users, really?

Looking for help with terrible vLLM performance

Posted by Render_Arcana@reddit | LocalLLaMA | View on Reddit | 32 comments

[-]

jaMMint@reddit

Sorry no, I have not yet managed to have vllm play nice with the RTX 6000 pro and the correct pytorch version. I wouldn't try running concurrent requests on anything llama.cpp based.

Looking for help with terrible vLLM performance

Posted by Render_Arcana@reddit | LocalLLaMA | View on Reddit | 32 comments

[-]

jaMMint@reddit

For what it's worth, I tried this model (mistralai/devstral-small-2507, 4Q_K_M quant @ 14,33GB size) on a RTX 6000 Pro, DDR5 6400, and Ultra 9 285K in vanilla LM Studio and got 80t/s on a simple prompt and small context.

Is it just me or is Qwen3-235B is bad at coding ?

Posted by maayon@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

jaMMint@reddit

I honestly think that inference is not wholly correct for qwen3 on ollama. It behaves weird/bad quality and repetition for me as well.

How we used NVIDIA TensorRT-LLM with Blackwell B200 to achieve 303 output tokens per second on DeepSeek R1

Posted by avianio@reddit | LocalLLaMA | View on Reddit | 16 comments

[-]

jaMMint@reddit

There must be some sort of efficiency gain in concurrency, right?

Most people are worried about LLM's executing code. Then theres me...... 😂

Posted by DataScientist305@reddit | LocalLLaMA | View on Reddit | 43 comments

[-]

jaMMint@reddit

Try out https://github.com/gradion-ai/freeact/tree/main, it's a great library for letting the LLM generate code it can execute and follow some agentic goal you set.

I am considering buying a Mac Studio for running local LLMs. Going for maximum RAM but does the GPU core count make a difference that justifies the extra $1k?

Posted by mehyay76@reddit | LocalLLaMA | View on Reddit | 358 comments

[-]

jaMMint@reddit

That would rather be the "M4 Extreme", one can dream

I am considering buying a Mac Studio for running local LLMs. Going for maximum RAM but does the GPU core count make a difference that justifies the extra $1k?

Posted by mehyay76@reddit | LocalLLaMA | View on Reddit | 358 comments

[-]

jaMMint@reddit

Better to buy one used (even an M1 Ultra one as the bandwidth is also 800GB/s) and then switch it for the M4 Ultra when it comes out in the near future. It will be considerably faster, 30-40% in inference speeds and 50%+ in preprocessing if you max out GPU cores of the M4 Ultra. As others have said, don't buy the current studio at full price if you don't have cash to burn.

I haven't seen many quad GPU setups so here is one

Posted by dazzou5ouh@reddit | LocalLLaMA | View on Reddit | 124 comments

[-]

jaMMint@reddit

Maybe that just is what your setup/model/inference engine demands. Have a look at https://www.reddit.com/r/LocalLLaMA/comments/1d8kcc6/psa_multi_gpu_tensor_parallel_require_at_least/, maybe that helps you compare.

I haven't seen many quad GPU setups so here is one

Posted by dazzou5ouh@reddit | LocalLLaMA | View on Reddit | 124 comments

[-]

jaMMint@reddit

No idea, but why would you expect PCI utilisation to fluctuate during generation?

I haven't seen many quad GPU setups so here is one

Posted by dazzou5ouh@reddit | LocalLLaMA | View on Reddit | 124 comments

[-]

jaMMint@reddit

It should not be as the communication needs between GPUs is much less than between GPU compute and VRAM.

o3-mini is now the SOTA coding model. It is truly something to behold. Procedural clouds in one-shot.

Posted by LocoMod@reddit | LocalLLaMA | View on Reddit | 228 comments

[-]

jaMMint@reddit

I think it helps if you prompt it with a reference style. "Write a ... in the style of Philip K. Dick". I got some super interesting and creative results.

What's the Best Current Setup for Multi Document (10k+) Retrieval-Augmented Generation (RAG)? Need Accuracy and Citations

Posted by United-Rush4073@reddit | LocalLLaMA | View on Reddit | 13 comments

[-]

jaMMint@reddit

This is just an idea as I have only toyed with RAG solution so far. Everybody expects the vector store pull answer documents out of its hat when given a question. You might want to explore preprocessing of these documents, having an LLM generate questions (like in a FAQ) for each one, questions that would be answered by the content of the documents. Then these questions along with their origin document meta data are additionally stored in the vector DB. Thus we have added vectors prealigned with questions that real users could ask and that definitely have an answer found in the documents.

Running Deepseek on a CPU only cluster of machines?

Posted by jaMMint@reddit | LocalLLaMA | View on Reddit | 28 comments

[-]

jaMMint@reddit (OP)

0,1 t/s ouch. But even at 0.5t/s that's just not worth the effort.

Running Deepseek on a CPU only cluster of machines?

Posted by jaMMint@reddit | LocalLLaMA | View on Reddit | 28 comments

[-]

jaMMint@reddit (OP)

I feared so, next to gigantic power draw..

Running Deepseek on a CPU only cluster of machines?

Posted by jaMMint@reddit | LocalLLaMA | View on Reddit | 28 comments

[-]

jaMMint@reddit (OP)

CPU single-thread performance is nothing to brag about and I expect that the bottleneck will somewhere lie in the communication speed of all the subsystems, whether for single user or batched inference. I've never tried to split up a MoE model yet, let alone on multiple networked servers, so I hope someone from this community already has..

Running Deepseek on a CPU only cluster of machines?

Posted by jaMMint@reddit | LocalLLaMA | View on Reddit | 28 comments

[-]

jaMMint@reddit (OP)

Not yet. I have the possibility to get them all for $3,5k, and I am wondering if it's worth it.

Mac Pro 2019 with DeepSeek R1

Posted by skipfish@reddit | LocalLLaMA | View on Reddit | 12 comments

[-]

jaMMint@reddit

You should be able to load it alright. You may get around 1-3 t/s? Also longer context will destroy performance.

Berkley AI research team claims to reproduce DeepSeek core technologies for $30

Posted by Slasher1738@reddit | LocalLLaMA | View on Reddit | 254 comments

[-]

jaMMint@reddit

Yeah, unfortunately you need to build it in order to know if people are going to pay for it.. But it could be really fun, with a wall of donors, some message and leader board and a bit of gamified progress status of the model and trained hours.. Of course you'd need to automatically run a selection of benchmarks each day and show the model's progress in nice charts. Could be great and you could even take a couple percent for administration and running the site. That surely would be acceptable..