zelkovamoon

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink

Posted by zelkovamoon@reddit | LocalLLaMA | View on Reddit | 20 comments

zelkovamoon@reddit (OP)

I appreciate the help. I *think* it should be as simple as appending those commands, probably don't need to change much else about your configuration - but I guess I'm not 100% sure

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink

Posted by zelkovamoon@reddit | LocalLLaMA | View on Reddit | 20 comments

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink

Posted by zelkovamoon@reddit | LocalLLaMA | View on Reddit | 20 comments

zelkovamoon@reddit (OP)

Update to my previous comment side note: --reasoning-budget 1536 \ --reasoning-budget-message ". Okay, enough thinking. Let's answer now." \ This actually works. Looks like meats back on the menu boys.

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink

Posted by zelkovamoon@reddit | LocalLLaMA | View on Reddit | 20 comments

zelkovamoon@reddit (OP)

I had been running on an octominer x12 - and it was surprisingly pretty good - if i could use nvlink to bridge the cards, it might be a big unlock -- per snapo84's comments, it looks like it is possible. The octominer is going to be retired for a newer platform soon - but anyway, yeah, as long as the cards work these might be the second best 'budget' option, number one being going with SXM2 + V100s

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink

Posted by zelkovamoon@reddit | LocalLLaMA | View on Reddit | 20 comments

zelkovamoon@reddit (OP)

This is actually very useful - thank you. I took your initial comment to mean that you literally didnt have nvlink, not that you just felt it was unnecessary - so, that's on me. Looking at your setup - have you tried running with '--tensor-split' and '--split-mode row' to see how performance changes? It looks like you're probably still running in pipeline - i'd be curious to know what difference in tps you'd see. ====== Side note: \*apparently\* there are new controls for reasoning budget in llama.cpp that i was not aware of - see '--reasoning-budget' at [https://manpages.debian.org/unstable/llama.cpp-tools/llama-server.1.en.html](https://manpages.debian.org/unstable/llama.cpp-tools/llama-server.1.en.html) I'm literally about to try it - i had reasoning disabled like you do, but if i can limit thinking to a reasonable number of tokens i would be interested in doing that. We'll see if it works!

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink

Posted by zelkovamoon@reddit | LocalLLaMA | View on Reddit | 20 comments

zelkovamoon@reddit (OP)

Yeah, on one of my servers I tried using a 2070 super and even that handles small model inference like a boss. How long have you had the cards? Do they seem well built, reliable? The nvlink angle is specifically for tensor parallelism, which would be relevant to what I want to do - so I still need to know if it would work, but I'll take your experience under advisement

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink

Posted by zelkovamoon@reddit | LocalLLaMA | View on Reddit | 20 comments

zelkovamoon@reddit (OP)

Current pricing says I can get these cards at sub 500$; so for the same money you could ostensibly get 44gb instead of 24gb - and at this point, the extra memory is more valuable to me than the extra speed. A single 3090 can run Qwen 3.5 35b heavily quantized, but you're making a lot of concessions that you definitely wouldn't have to make if you had more memory.

Qwen3.5-27B Q4 Quantization Comparison

Posted by TitwitMuffbiscuit@reddit | LocalLLaMA | View on Reddit | 116 comments

Liquid AI releases LFM2-2.6B-Transcript, an incredibly fast open-weight meeting transcribing AI model on-par with closed-source giants.

Posted by KaroYadgar@reddit | LocalLLaMA | View on Reddit | 31 comments

Liquid Ai released LFM2.5, family of tiny on-device foundation models.

Posted by Difficult-Cap-7527@reddit | LocalLLaMA | View on Reddit | 59 comments

zelkovamoon@reddit

LFM2 was pretty good, so im excited to try this. Really hoping tool calling is better with these models, that was basically my biggest complaint.

llama.cpp performance breakthrough for multi-GPU setups

Posted by Holiday-Injury-9397@reddit | LocalLLaMA | View on Reddit | 205 comments

zelkovamoon@reddit

Ok so two questions Does ik_llama broadly support the same models as llama.cpp but with optimizations, or is it a subset Are these improvements going to apply broadly to any type of model?

Senator in Tennessee introduces bill to felonize making AI "act as a companion" or "mirror human interactions"

Posted by CanineAssBandit@reddit | LocalLLaMA | View on Reddit | 218 comments

Best Local LLMs - 2025

Posted by rm-rf-rm@reddit | LocalLLaMA | View on Reddit | 219 comments

zelkovamoon@reddit

Seconding LFM2-8B A1B; Seems like a MOE model class that should be explored more deeply in the future. The model itself is pretty great in my testing; tool calling can be challenging, but that's probably a skill issue on my part. It's not my favorite model; or the best model; but it is certainly good. Add a hybrid mamba arch and some native tool calling on this bad boy and we might be in business.

Stop wasting your MCP context window. LTP (Lazy Tool Protocol) reduces tool-calling overhead by up to 93 percent.

Posted by song-junhyeong@reddit | LocalLLaMA | View on Reddit | 43 comments

llama.cpp - useful flags - share your thoughts please

Posted by mossy_troll_84@reddit | LocalLLaMA | View on Reddit | 34 comments

llama.cpp - useful flags - share your thoughts please

Posted by mossy_troll_84@reddit | LocalLLaMA | View on Reddit | 34 comments

llama.cpp - useful flags - share your thoughts please

Posted by mossy_troll_84@reddit | LocalLLaMA | View on Reddit | 34 comments

Without a connection to a live data source, an LLM faces critical limitations: Hallucinations and Trust

Posted by balianone@reddit | LocalLLaMA | View on Reddit | 5 comments

Key Highlights of Google's New Open Model, FunctionGemma

Posted by Dear-Success-1441@reddit | LocalLLaMA | View on Reddit | 12 comments

8x Radeon 7900 XTX Build for Longer Context Local Inference - Performance Results & Build Details

Posted by Beautiful_Trust_8151@reddit | LocalLLaMA | View on Reddit | 231 comments

zelkovamoon@reddit

Have a look here https://www.reddit.com/r/LocalAIServers/s/TeikNe9MuB If you write that post and remember, please dm it to me. I'm looking for good ways to build a high performance server still. I gotta be honest, very surprised to see that level of performance without an infinity fabric coupler on your mi50s; and that's also giving me encouragement to buy if we get this bulk order off the ground.

8x Radeon 7900 XTX Build for Longer Context Local Inference - Performance Results & Build Details

Posted by Beautiful_Trust_8151@reddit | LocalLLaMA | View on Reddit | 231 comments

zelkovamoon@reddit

One of the most useful series of build posts I've seen in a while; hardware, well described, performance, everything. Linked this to the bulk mi50 thread that's been floating around.

I have 4 V100s. What do I do?

Posted by MackThax@reddit | LocalLLaMA | View on Reddit | 18 comments

zelkovamoon@reddit

You should use a used server with SXM2 connections and known nvlink support. Benefit - intra GPU bandwidth will be much much higher than pcie. Additional ram is fine, but with four v100s I would try to run models that fit within vram. CPU isn't a big factor, probably. The focus is really vram and interconnect speed; other details matter but in an extraneous way. I am waiting for prices to drop on 8x v100 servers. We'll see.

zai-org/GLM-4.6V-Flash (9B) is here

Posted by Cute-Sprinkles4911@reddit | LocalLLaMA | View on Reddit | 68 comments

Function calling Finetuners?

Posted by zelkovamoon@reddit | LocalLLaMA | View on Reddit | 11 comments

zelkovamoon@reddit (OP)

My guess is they probably have a TA or intern or something like this actually run and update the leaderboard, and it's not a focus right now. This has led me to think that what we really need is a wiki style database of benchmarks, and we'll just have individuals upload benchmark results - because we can just run BFCL on our own. But until that's created, getting good cross comparable info is difficult.

Function calling Finetuners?

Posted by zelkovamoon@reddit | LocalLLaMA | View on Reddit | 11 comments

zelkovamoon@reddit (OP)

The issue with BFCL is that their leaderboard is incomplete, it seems. Maybe I'm just looking at the wrong thing - I've been going here -> https://gorilla.cs.berkeley.edu/leaderboard.html if you type 'oss' in the search, nothing. Now I'm aware that information on gpt-oss and it's tool calling is available - but for being the main leaderboard for this, why wouldn't they have that, or have at least run the benchmark? In isolation this issue would be fine if model builders always ran benchmarks and published the info, but hugging face is always woefully lacking in information.

Intel Arc Pro B60 Battlematrix Preview: 192GB of VRAM for On-Premise AI

Posted by reps_up@reddit | LocalLLaMA | View on Reddit | 42 comments

Ministral-3 has been released

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 61 comments

A list of models released or udpated last week on this sub, in case you missed any (3rd Oct)

Posted by aifeed-fyi@reddit | LocalLLaMA | View on Reddit | 30 comments

Junyang Lin is drinking tea

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 30 comments

Why I do not like to see AI tools are implemented to distros!

Posted by BlokZNCR@reddit | linux | View on Reddit | 238 comments

Rise of the linux desktop will be driven by developing economies

Posted by KanonBalls@reddit | linux | View on Reddit | 71 comments

Linus Torvalds used to speak to engineers in 2012 the way I speak to LLMs now.

Posted by underbillion@reddit | linux | View on Reddit | 875 comments

support for the upcoming ERNIE 4.5 0.3B model has been merged into llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 17 comments

NVIDIA’s Highly Anticipated “Mini-Supercomputer,” the DGX Spark, Launches This Month — Bringing Immense AI Power to Your Hands — up to 4000$

Posted by _SYSTEM_ADMIN_MOD_@reddit | LocalLLaMA | View on Reddit | 290 comments

NVIDIA’s Highly Anticipated “Mini-Supercomputer,” the DGX Spark, Launches This Month — Bringing Immense AI Power to Your Hands — up to 4000$

Posted by _SYSTEM_ADMIN_MOD_@reddit | LocalLLaMA | View on Reddit | 290 comments

zelkovamoon@reddit

The metrics are different, and we won't really know until we get hands on with the hardware you're right - but that's not the overwhelm sentiment in this thread! It's 90% people shitting on the DGX and I'm like, this don't make no sense!

NVIDIA’s Highly Anticipated “Mini-Supercomputer,” the DGX Spark, Launches This Month — Bringing Immense AI Power to Your Hands — up to 4000$

Posted by _SYSTEM_ADMIN_MOD_@reddit | LocalLLaMA | View on Reddit | 290 comments

NVIDIA’s Highly Anticipated “Mini-Supercomputer,” the DGX Spark, Launches This Month — Bringing Immense AI Power to Your Hands — up to 4000$

Posted by _SYSTEM_ADMIN_MOD_@reddit | LocalLLaMA | View on Reddit | 290 comments

zelkovamoon@reddit

A100s and H100s benefit from NVLink, which specifically addresses the issue of cross GPU high speed memory - and is not something most pcie gpus these days leverage.

NVIDIA’s Highly Anticipated “Mini-Supercomputer,” the DGX Spark, Launches This Month — Bringing Immense AI Power to Your Hands — up to 4000$

Posted by _SYSTEM_ADMIN_MOD_@reddit | LocalLLaMA | View on Reddit | 290 comments

zelkovamoon@reddit

The M3 ultra cannot compete with the GB10 on compute, which is a relevant factor. Your option 2 gives you a maximum theoretical memory bandwidth of 32gb/s, which is *much lower* than the DGX Spark. While individually, a 3090 may have faster on card memory, if you at any point need to share that model across cards your effective bandwidth is the bandwidth of your pcie bus.

NVIDIA’s Highly Anticipated “Mini-Supercomputer,” the DGX Spark, Launches This Month — Bringing Immense AI Power to Your Hands — up to 4000$

Posted by _SYSTEM_ADMIN_MOD_@reddit | LocalLLaMA | View on Reddit | 290 comments

zelkovamoon@reddit

the memory bandwidth, compute, and price point are all relevant. The fact is, you will not find a consumer product that offers 128gb of memory, bandwidth at 273 gbps, with comparable compute.

NVIDIA’s Highly Anticipated “Mini-Supercomputer,” the DGX Spark, Launches This Month — Bringing Immense AI Power to Your Hands — up to 4000$

Posted by _SYSTEM_ADMIN_MOD_@reddit | LocalLLaMA | View on Reddit | 290 comments

NVIDIA’s Highly Anticipated “Mini-Supercomputer,” the DGX Spark, Launches This Month — Bringing Immense AI Power to Your Hands — up to 4000$

Posted by _SYSTEM_ADMIN_MOD_@reddit | LocalLLaMA | View on Reddit | 290 comments

NVIDIA’s Highly Anticipated “Mini-Supercomputer,” the DGX Spark, Launches This Month — Bringing Immense AI Power to Your Hands — up to 4000$

Posted by _SYSTEM_ADMIN_MOD_@reddit | LocalLLaMA | View on Reddit | 290 comments

zelkovamoon@reddit

The DGX spark has a memory bandwidth of 273 GB/s A 5090 has a much higher memory bandwith if your'e just using vram, BUT it has a maximum *slot* bandwidth of 63 GB/s, it's not \*really\* a competitor here unless you're running a model entirely on the card - in which case you're limited to 32gb, not 128gb. Halo Strix has a peak bandwidth of around 212 GB/s, so again, the DGX spark wins - and if that werent enough, it's also native CUDA and outperforms Halo Strix on FP4 by a factor of 14X. ASUS is selling a GX10 model for 3K, which isn't much more expensive than Halo Strix, and about what you'd have to pay for the 5090. You people acting like this is \*clearly\* a bad product are insane. HP, Dell, and Lenovo are also making skus if you dont want it straight from nvidia. Give me one actually good reason why this is a bad product compared to the other options available on the market? A *good* reason, not a *i dont like nvidia* reason.

NVIDIA’s Highly Anticipated “Mini-Supercomputer,” the DGX Spark, Launches This Month — Bringing Immense AI Power to Your Hands — up to 4000$

Posted by _SYSTEM_ADMIN_MOD_@reddit | LocalLLaMA | View on Reddit | 290 comments

NVIDIA’s Highly Anticipated “Mini-Supercomputer,” the DGX Spark, Launches This Month — Bringing Immense AI Power to Your Hands — up to 4000$

Posted by _SYSTEM_ADMIN_MOD_@reddit | LocalLLaMA | View on Reddit | 290 comments

zelkovamoon@reddit

Nope, wrong. AMD Halo Strix is cheaper, yes - but you can get a GB10 based system from Asus for 3k - so the price difference isnt that big, and you get more compute, memory bandwidth, and CUDA. This thread is just lets make insane claims tuesday.

NVIDIA’s Highly Anticipated “Mini-Supercomputer,” the DGX Spark, Launches This Month — Bringing Immense AI Power to Your Hands — up to 4000$

Posted by _SYSTEM_ADMIN_MOD_@reddit | LocalLLaMA | View on Reddit | 290 comments

NVIDIA’s Highly Anticipated “Mini-Supercomputer,” the DGX Spark, Launches This Month — Bringing Immense AI Power to Your Hands — up to 4000$

Posted by _SYSTEM_ADMIN_MOD_@reddit | LocalLLaMA | View on Reddit | 290 comments

llama-4-scout-17B-16E GGUF running on Strix Halo (Ryzen AI MAX 395 + 128GB) (13s prompt processing edited out)

Posted by jfowers_amd@reddit | LocalLLaMA | View on Reddit | 47 comments

zelkovamoon@reddit

Pretty decent tps on that. Glad to see AMD is doing stuff - I'll be honest though, you know what would make me *really* consider AMD? I have a DGX GB10 coming. It was a genuinely good idea on Nvidia and friends part to offer a Blackwell basically datacenter grade variant for AI workstation scale instead of rack scale - I couldn't justify an AI max setup given that at that price, you might as well spend a little more and get *way more* compute. If you guys could offer a product that competes equally in terms of vram, tops, and FP4 support but maybe with a lower price; or more vram, that would be a real contender. If you toss a product together with double the vram and comparable compute, that would do gangbusters.

Current best options to convert to FP4

Posted by zelkovamoon@reddit | LocalLLaMA | View on Reddit | 9 comments

zelkovamoon@reddit (OP)

People on reddit are insufferable I really don't get it. I dread asking questions no matter how legitimate these days, and I gotta be honest, as soon as there is a better platform I'm jumping ship.

Current best options to convert to FP4

Posted by zelkovamoon@reddit | LocalLLaMA | View on Reddit | 9 comments

support for the upcoming ERNIE 4.5 0.3B model has been merged into llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 17 comments