Formal-Exam-8767

The Future of Free & Local Models: Training Co-Ops? Professional Orgs? Churches?

Posted by liftheavyscheisse@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

The Future of Free & Local Models: Training Co-Ops? Professional Orgs? Churches?

Posted by liftheavyscheisse@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

Formal-Exam-8767@reddit

Maybe they mean church based around the model, like "Church of Qwen" or "Church of Gemma"?

Macbook M5 Pro 24GB or 48GB

Posted by Resident_Bell_4457@reddit | LocalLLaMA | View on Reddit | 69 comments

[-]

Formal-Exam-8767@reddit

I am not talking about technical running, but practical useful running. You can technically run 400B models off SSD with 8GB RAM, but is that usable for anything?

Macbook M5 Pro 24GB or 48GB

Posted by Resident_Bell_4457@reddit | LocalLLaMA | View on Reddit | 69 comments

[-]

Formal-Exam-8767@reddit

With 24GB you can forget about running LLMs.

I burned a weekend making the models "remember" me. The fix had nothing to do with trying to run bigger models locally

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 20 comments

[-]

Would you consider getting an NVIDIA RTX Spark laptop?

Posted by gamblingapocalypse@reddit | LocalLLaMA | View on Reddit | 167 comments

[-]

Formal-Exam-8767@reddit

No, for two reasons: 1. Too expensive for what it provides 2. ARM-based CPU

Direct 100.0 t/s on Strix Halo with Qwen3 30B-A3B. Can anyone reproduce or beat this?

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 22 comments

[-]

Formal-Exam-8767@reddit

Can you do a Q8 bench since Q4 quality-wise is not really usable for anything.

Are GPUs getting cheaper?

Posted by iMakeSense@reddit | LocalLLaMA | View on Reddit | 11 comments

[-]

Formal-Exam-8767@reddit

> if prices are going down dramatically You don't have to worry about that, prices definitely won't go down dramatically or even go down.

Another shout out to llama.cpp build b9455 2x3090

Posted by Fabulous_Fact_606@reddit | LocalLLaMA | View on Reddit | 43 comments

[-]

Formal-Exam-8767@reddit

Are they connected via NVLink? I've been reading here that tensor-split will be slow without NVLink.

For those of you running vllm locally for inference what quantifications do you use

Posted by Limp_Classroom_2645@reddit | LocalLLaMA | View on Reddit | 4 comments

[-]

Formal-Exam-8767@reddit

https://docs.vllm.ai/en/stable/features/quantization/#supported-hardware

Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

Posted by Atomynos_Atom@reddit | LocalLLaMA | View on Reddit | 44 comments

[-]

Formal-Exam-8767@reddit

I dug some 3090 results for comparison: /r/LocalLLaMA/comments/1rqljv4/benchmarked_all_unsloth_qwen3535ba3b_q4_models_on/

Intel Arc Pro B70 llama.cpp benchmarks posted

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 48 comments

[-]

Formal-Exam-8767@reddit

Yes, there appears to be lots of room for improvement.

Intel Arc Pro B70 llama.cpp benchmarks posted

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 48 comments

[-]

Formal-Exam-8767@reddit

B70 has theoretical memory bandwidth of 608.0 GB/s and this does not even reach 150.0 GB/s if my math is correct?

RTX Spark does not have 600GB/s Bandwith

Posted by rpiguy9907@reddit | LocalLLaMA | View on Reddit | 194 comments

[-]

RTX Spark does not have 600GB/s Bandwith

Posted by rpiguy9907@reddit | LocalLLaMA | View on Reddit | 194 comments

[-]

Formal-Exam-8767@reddit

How would a slightly modified DGX Spark using LPDDR5X have 600GB/s memory bandwidth?

Stop asking what model to run. There are literally only two.

Posted by Wrong_Mushroom_7350@reddit | LocalLLaMA | View on Reddit | 571 comments

[-]

Cheap V100 32gb

Posted by MachineZer0@reddit | LocalLLaMA | View on Reddit | 28 comments

[-]

Someone out there likely needs this: TP vs PP for 2 identical GPUs

Posted by xspider2000@reddit | LocalLLaMA | View on Reddit | 33 comments

[-]

Formal-Exam-8767@reddit

Where is this misinformation coming from? How much data is actually exchanged between two cards that it would require fast interconnect?

NVIDIA RTX Spark — Slim Laptops & Small Desktops

Posted by zxyzyxz@reddit | LocalLLaMA | View on Reddit | 56 comments

[-]

Formal-Exam-8767@reddit

So they plan to sell device with 20% performance degradation out-of-the-box?

NVIDIA RTX Spark — Slim Laptops & Small Desktops

Posted by zxyzyxz@reddit | LocalLLaMA | View on Reddit | 56 comments

[-]

Formal-Exam-8767@reddit

I thought being ARM-based limits it's usability for anything outside AI. Do current Windows games support ARM?

Why is there no community project for training your own LLM from scratch on consumer hardware?

Posted by tevlon@reddit | LocalLLaMA | View on Reddit | 69 comments

[-]

Formal-Exam-8767@reddit

So, how many years are you willing to wait for an LLM to finish training on a single 8GB VRAM card?

Models still being vulnerable to Prompt Injection is actually a huge architectural red flag...

Posted by Comrade_Mugabe@reddit | LocalLLaMA | View on Reddit | 39 comments

[-]

Formal-Exam-8767@reddit

That is just how auto-complete works, it completes what came before. Don't attribute intelligence to a system without any.

Custom 4x RTX PRO 6000 Blackwell server vs Dell GB300 for ~30 fine-tuned production pipelines — looking for honest input on direction

Posted by Consistent_Wash_276@reddit | LocalLLaMA | View on Reddit | 72 comments

[-]

Formal-Exam-8767@reddit

> Not looking for "buy a 5090 instead" Buy 16x 3090 instead.

RTX5080 vs RTX 3090 ?

Posted by DarkAndrei@reddit | LocalLLaMA | View on Reddit | 48 comments

[-]

Formal-Exam-8767@reddit

No amount of compute can help in a memory bandwidth bottlenecked use-case. RTX 5080 - 16GB @ 960.0 GB/s RTX 3090 - 24GB @ 936.2 GB/s

Not sure if this was posted. But I think it's highly relevant to us.

Posted by Paradigmind@reddit | LocalLLaMA | View on Reddit | 194 comments

[-]

Formal-Exam-8767@reddit

And they would love to pass on electricity bill to the customer.

Poor performance on RX 9070 XT

Posted by WhatererBlah555@reddit | LocalLLaMA | View on Reddit | 25 comments

[-]

Formal-Exam-8767@reddit

Yeah. I would not underestimate data center cards even if they are old.

Poor performance on RX 9070 XT

Posted by WhatererBlah555@reddit | LocalLLaMA | View on Reddit | 25 comments

[-]

Formal-Exam-8767@reddit

MI50 Bandwidth 1.02 TB/s RX 9070 XT Bandwidth 644.6 GB/s

Stop pretending self-hosting is cheaper. It's not. We do it for different reasons and we should say so.

Posted by Napster3301@reddit | LocalLLaMA | View on Reddit | 88 comments

[-]

Formal-Exam-8767@reddit

> "local is cheaper" I never saw this argument here.

llama.cpp oom issue

Posted by TheTerrasque@reddit | LocalLLaMA | View on Reddit | 23 comments

[-]

Formal-Exam-8767@reddit

How much RAM do you have? I don't think it pre-allocates context on start.

server: fix checkpoints creation by jacekpoplawski · Pull Request #22929 · ggml-org/llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 40 comments

[-]

Formal-Exam-8767@reddit

Can KV cache be spliced? What if you kept question KV-cache, spliced out reasoning part, and glued rest of response KV-cache to end of question KV-cache?

New Release of ROCm based MLX LLM Engine - lemon-mlx-engine

Posted by GeramyL@reddit | LocalLLaMA | View on Reddit | 35 comments

[-]

Formal-Exam-8767@reddit

So AMD GPU cards work with Apple Silicon?

Gemma 4 thinks I'm gaslighting it when I talk about Gemma 4 line of models

Posted by Jorlen@reddit | LocalLLaMA | View on Reddit | 14 comments

[-]

Tell me about the features of "Gamma 8 43B A5B" LLM model. > The model string "Gamma 8 43B A5B" represents a specific technical configuration of an open-weights Mixture-of-Experts (MoE) Large Language Model (LLM). > > While the naming convention breaks down into standard machine learning nomenclature, the precise string reflects a model from the extended ecosystem of open-weight architectures (often iteratively released by community developers or institutions expanding on foundational lines like Google's Gamma/Gemma or similar MoE structures). [1] > > The key features and architecture can be accurately parsed from its naming blueprint: > > ## 🛠️ Structural & Technical Specifications > > * Total Parameters (43B): The model contains a total of 43 billion parameters within its complete architecture. This positions it firmly as a medium-to-large localized model, capable of high-tier reasoning, complex coding tasks, and strong contextual synthesis while remaining deployable on localized enterprise hardware or multi-GPU home setups. > > * Active Parameters (A5B): The "A5B" designation stands for 5 Billion Active Parameters. As a Mixture-of-Experts (MoE) model, only a sub-network (expert layers) totaling roughly 5 billion parameters is activated for any single token processing step. > > * "Gamma 8" Baseline: This points to the version or generation cycle (Iteration 8) of the underlying architecture baseline, building upon the optimization patterns of highly efficient open-weight models. [2, 3, 4, 5, 6] > > ## ⚡ Key Architectural Advantages > > * Blazing Inference Speed: Because it only computes 5 billion parameters per token instead of the full 43 billion, the model delivers the execution speed and low latency of a much smaller 5B model. > > * High-Capacity Intelligence: Despite its fast processing speed, it retains the knowledge depth, reasoning capabilities, and broad vocabulary of a massive 43B parameter model, because the remaining specialized parameters are dynamically routed as needed. > > * Resource Efficiency: It drastically lowers hardware serving costs (VRAM compute cycles during generation) compared to dense, non-MoE 40B+ models, making it ideal for cost-efficient enterprise or local offline deployments. [2, 7, 8, 9] > > ## 🎯 Primary Use Cases and Capabilities > > Given this specific 43B/A5B MoE design footprint, the model features high proficiency in: > > * Agentic Workflows: Rapid token throughput allows it to execute multi-turn agentic loops (thought-action-observation cycles) without choking system speed. > > * Coding & Logic Synthesizing: The deep 43B parameter pool enables rich syntactic understanding of dense, multi-file codebases and mathematical constraints. > > * Highly Scalable Local Deployments: It serves as an optimal solution for organizations requiring strong data sovereignty and offline operations without the hardware budget for massive dense clusters. [2, 10, 11, 12, 13] > > To help tailor this, are you looking to deploy this model locally, or are you comparing it against a dense model configuration (like a standard 8B or 70B model) for a specific task?

Gemma 4 thinks I'm gaslighting it when I talk about Gemma 4 line of models

Posted by Jorlen@reddit | LocalLLaMA | View on Reddit | 14 comments

[-]

Formal-Exam-8767@reddit

Ask it about Gemma 5, 6, 7, 8, 9.

AMD Powers Next-Generation Agent Computers with New Ryzen AI Halo Developer Platform and Ryzen AI Max PRO 400 Series Processors

Posted by Baumpaladin@reddit | LocalLLaMA | View on Reddit | 66 comments

[-]

Formal-Exam-8767@reddit

> make their own hardware? Sometimes, that is not the right move, don't forget what happened to 3Dfx...

AMD BC-250 and the search for Cheap Compute

Posted by dugganmania@reddit | LocalLLaMA | View on Reddit | 41 comments

[-]

Formal-Exam-8767@reddit

If you got 12 of these BC-250 boards together into a 4U rackmount server chassis (like the ASRock 4U12G), that would be 192GB and no issues with cooling right?

[WIP] Gemma 4 MTP

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 50 comments

[-]

Formal-Exam-8767@reddit

MTP is new, yes. But using smaller models as draft models for larger ones is not. Support for that landed in llama.cpp in 2024: https://github.com/ggml-org/llama.cpp/pull/10455

[WIP] Gemma 4 MTP

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 50 comments

[-]

Formal-Exam-8767@reddit

Speculative decoding is nothing new, and there is always a desire to overcome memory bandwidth bottleneck.

Intel's Crescent Island PCB Leaks, Showing a Massive Xe3P GPU, 16-Pin Connector, 160GB LPDDR5X as Intel Sidesteps the HBM Shortage

Posted by FullstackSensei@reddit | LocalLLaMA | View on Reddit | 89 comments

[-]

Formal-Exam-8767@reddit

I see, so that is not the reason why Intel and AMD don't want to put more than 2 memory channels in consumer CPU/MBOs?

Intel's Crescent Island PCB Leaks, Showing a Massive Xe3P GPU, 16-Pin Connector, 160GB LPDDR5X as Intel Sidesteps the HBM Shortage

Posted by FullstackSensei@reddit | LocalLLaMA | View on Reddit | 89 comments

[-]

Formal-Exam-8767@reddit

How many layers does it have? I've been told that to achieve more than 2 memory channels you need multiple PCB layers because of the traces and that's what makes multi-channel MBOs expensive.

What is the point of MoE models, beyond being faster?

Posted by ihatebeinganonymous@reddit | LocalLLaMA | View on Reddit | 135 comments

[-]

Formal-Exam-8767@reddit

The main issue of running LLMs is not the amount of memory or even compute but memory bandwidth. You can ignore the amount of memory as it should be given that you have enough memory, if you don't, you can't run it, period. MoE have less active parameters so they better utilize memory bandwidth and require less compute since they are working with lower amount of data than dense models.

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)

Posted by VolandBerlioz@reddit | LocalLLaMA | View on Reddit | 130 comments

[-]

Formal-Exam-8767@reddit

Am I missing something here, is "IQ4_KS" not supported on stock llama.cpp?

Reliable Open Source LLM as a Service

Posted by pravictor@reddit | LocalLLaMA | View on Reddit | 10 comments

[-]

Formal-Exam-8767@reddit

On-prem depending on your budget: * NVIDIA GB200 NVL72 * NVIDIA HGX B300 * NVIDIA HGX B300 * NVIDIA DGX H100

Different gpu mixed node

Posted by Force88@reddit | LocalLLaMA | View on Reddit | 10 comments

[-]

Formal-Exam-8767@reddit

If the choice is between 5060ti 16GB and 5070 12GB, it makes no sense to go with less VRAM since you want really see any speed benefits that 5070 has over 5060ti. To me, it only makes sense to consider 5060ti 16GB or 5070ti 16GB.

The "the future is fictional" problem of many local LLMs

Posted by PromptInjection_@reddit | LocalLLaMA | View on Reddit | 54 comments

[-]

Formal-Exam-8767@reddit

But can the model even respond differently? If it has a system prompt with specified knowledge (to model this probably means all knowledge, both baked-in and from context) cutoff date, anything beyond is fiction.

Turboquant+MTP for ROCm(Llama CPP)

Posted by DrBearJ3w@reddit | LocalLLaMA | View on Reddit | 16 comments

[-]

Formal-Exam-8767@reddit

Thanks for sharing. How does it compare to Vulkan?

I've seen a lot of folks ask "can local LLMs actually do anything useful?"

Posted by NoWorking8412@reddit | LocalLLaMA | View on Reddit | 121 comments

[-]

Formal-Exam-8767@reddit

Define useful. Considering that people are mostly using chats bots the wrong way (as **authoritative** source of truth), I am not surprised they don't find local LLMs useful.

Estimate inference speed of local Qwen3.6-35B on Mac M5...

Posted by Altruistic-Dust-2565@reddit | LocalLLaMA | View on Reddit | 17 comments

[-]

Formal-Exam-8767@reddit

Estimate is just and estimate. With MoE it is even harder to precisly estimate. And PP is close to impossible since it depends on actual architecture and ops not just size.

DIY market declining amid high RAM prices

Posted by Terminator857@reddit | LocalLLaMA | View on Reddit | 114 comments

[-]

Formal-Exam-8767@reddit

Why only DIY? Wouldn't prebuilts and brand names be even more expensive now?

how you justify your spending time and resource for Local LLM to your love one ?

Posted by Merchant_Lawrence@reddit | LocalLLaMA | View on Reddit | 6 comments

[-]

Formal-Exam-8767@reddit

Say it's an electric heater. You can demonstrate that it blows hot air out.

Strix Halo Clustering (Hardware Setup Discussion)

Posted by Thanks-Suitable@reddit | LocalLLaMA | View on Reddit | 20 comments

[-]

Formal-Exam-8767@reddit

What is the minimal interconnect bandwidth requirement so it does not affect token generation speed?