Formal-Exam-8767

The Future of Free & Local Models: Training Co-Ops? Professional Orgs? Churches?

Posted by liftheavyscheisse@reddit | LocalLLaMA | View on Reddit | 18 comments

The Future of Free & Local Models: Training Co-Ops? Professional Orgs? Churches?

Posted by liftheavyscheisse@reddit | LocalLLaMA | View on Reddit | 18 comments

Macbook M5 Pro 24GB or 48GB

Posted by Resident_Bell_4457@reddit | LocalLLaMA | View on Reddit | 69 comments

Formal-Exam-8767@reddit

I am not talking about technical running, but practical useful running. You can technically run 400B models off SSD with 8GB RAM, but is that usable for anything?

Macbook M5 Pro 24GB or 48GB

Posted by Resident_Bell_4457@reddit | LocalLLaMA | View on Reddit | 69 comments

I burned a weekend making the models "remember" me. The fix had nothing to do with trying to run bigger models locally

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 20 comments

Would you consider getting an NVIDIA RTX Spark laptop?

Posted by gamblingapocalypse@reddit | LocalLLaMA | View on Reddit | 167 comments

Direct 100.0 t/s on Strix Halo with Qwen3 30B-A3B. Can anyone reproduce or beat this?

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 22 comments

Are GPUs getting cheaper?

Posted by iMakeSense@reddit | LocalLLaMA | View on Reddit | 11 comments

Another shout out to llama.cpp build b9455 2x3090

Posted by Fabulous_Fact_606@reddit | LocalLLaMA | View on Reddit | 43 comments

For those of you running vllm locally for inference what quantifications do you use

Posted by Limp_Classroom_2645@reddit | LocalLLaMA | View on Reddit | 4 comments

Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

Posted by Atomynos_Atom@reddit | LocalLLaMA | View on Reddit | 44 comments

Intel Arc Pro B70 llama.cpp benchmarks posted

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 48 comments

Intel Arc Pro B70 llama.cpp benchmarks posted

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 48 comments

RTX Spark does not have 600GB/s Bandwith

Posted by rpiguy9907@reddit | LocalLLaMA | View on Reddit | 194 comments

RTX Spark does not have 600GB/s Bandwith

Posted by rpiguy9907@reddit | LocalLLaMA | View on Reddit | 194 comments

Stop asking what model to run. There are literally only two.

Posted by Wrong_Mushroom_7350@reddit | LocalLLaMA | View on Reddit | 571 comments

Cheap V100 32gb

Posted by MachineZer0@reddit | LocalLLaMA | View on Reddit | 28 comments

Someone out there likely needs this: TP vs PP for 2 identical GPUs

Posted by xspider2000@reddit | LocalLLaMA | View on Reddit | 33 comments

NVIDIA RTX Spark — Slim Laptops & Small Desktops

Posted by zxyzyxz@reddit | LocalLLaMA | View on Reddit | 56 comments

NVIDIA RTX Spark — Slim Laptops & Small Desktops

Posted by zxyzyxz@reddit | LocalLLaMA | View on Reddit | 56 comments

Why is there no community project for training your own LLM from scratch on consumer hardware?

Posted by tevlon@reddit | LocalLLaMA | View on Reddit | 69 comments

Models still being vulnerable to Prompt Injection is actually a huge architectural red flag...

Posted by Comrade_Mugabe@reddit | LocalLLaMA | View on Reddit | 39 comments

Custom 4x RTX PRO 6000 Blackwell server vs Dell GB300 for ~30 fine-tuned production pipelines — looking for honest input on direction

Posted by Consistent_Wash_276@reddit | LocalLLaMA | View on Reddit | 72 comments

RTX5080 vs RTX 3090 ?

Posted by DarkAndrei@reddit | LocalLLaMA | View on Reddit | 48 comments

Not sure if this was posted. But I think it's highly relevant to us.

Posted by Paradigmind@reddit | LocalLLaMA | View on Reddit | 194 comments

Poor performance on RX 9070 XT

Posted by WhatererBlah555@reddit | LocalLLaMA | View on Reddit | 25 comments

Poor performance on RX 9070 XT

Posted by WhatererBlah555@reddit | LocalLLaMA | View on Reddit | 25 comments

Stop pretending self-hosting is cheaper. It's not. We do it for different reasons and we should say so.

Posted by Napster3301@reddit | LocalLLaMA | View on Reddit | 88 comments

llama.cpp oom issue

Posted by TheTerrasque@reddit | LocalLLaMA | View on Reddit | 23 comments

server: fix checkpoints creation by jacekpoplawski · Pull Request #22929 · ggml-org/llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 40 comments

New Release of ROCm based MLX LLM Engine - lemon-mlx-engine

Posted by GeramyL@reddit | LocalLLaMA | View on Reddit | 35 comments

Gemma 4 thinks I'm gaslighting it when I talk about Gemma 4 line of models

Posted by Jorlen@reddit | LocalLLaMA | View on Reddit | 14 comments

Formal-Exam-8767@reddit

Tell me about the features of "Gamma 8 43B A5B" LLM model. > The model string "Gamma 8 43B A5B" represents a specific technical configuration of an open-weights Mixture-of-Experts (MoE) Large Language Model (LLM). > > While the naming convention breaks down into standard machine learning nomenclature, the precise string reflects a model from the extended ecosystem of open-weight architectures (often iteratively released by community developers or institutions expanding on foundational lines like Google's Gamma/Gemma or similar MoE structures). [1] > > The key features and architecture can be accurately parsed from its naming blueprint: > > ## 🛠️ Structural & Technical Specifications > > * Total Parameters (43B): The model contains a total of 43 billion parameters within its complete architecture. This positions it firmly as a medium-to-large localized model, capable of high-tier reasoning, complex coding tasks, and strong contextual synthesis while remaining deployable on localized enterprise hardware or multi-GPU home setups. > > * Active Parameters (A5B): The "A5B" designation stands for 5 Billion Active Parameters. As a Mixture-of-Experts (MoE) model, only a sub-network (expert layers) totaling roughly 5 billion parameters is activated for any single token processing step. > > * "Gamma 8" Baseline: This points to the version or generation cycle (Iteration 8) of the underlying architecture baseline, building upon the optimization patterns of highly efficient open-weight models. [2, 3, 4, 5, 6] > > ## ⚡ Key Architectural Advantages > > * Blazing Inference Speed: Because it only computes 5 billion parameters per token instead of the full 43 billion, the model delivers the execution speed and low latency of a much smaller 5B model. > > * High-Capacity Intelligence: Despite its fast processing speed, it retains the knowledge depth, reasoning capabilities, and broad vocabulary of a massive 43B parameter model, because the remaining specialized parameters are dynamically routed as needed. > > * Resource Efficiency: It drastically lowers hardware serving costs (VRAM compute cycles during generation) compared to dense, non-MoE 40B+ models, making it ideal for cost-efficient enterprise or local offline deployments. [2, 7, 8, 9] > > ## 🎯 Primary Use Cases and Capabilities > > Given this specific 43B/A5B MoE design footprint, the model features high proficiency in: > > * Agentic Workflows: Rapid token throughput allows it to execute multi-turn agentic loops (thought-action-observation cycles) without choking system speed. > > * Coding & Logic Synthesizing: The deep 43B parameter pool enables rich syntactic understanding of dense, multi-file codebases and mathematical constraints. > > * Highly Scalable Local Deployments: It serves as an optimal solution for organizations requiring strong data sovereignty and offline operations without the hardware budget for massive dense clusters. [2, 10, 11, 12, 13] > > To help tailor this, are you looking to deploy this model locally, or are you comparing it against a dense model configuration (like a standard 8B or 70B model) for a specific task?

Gemma 4 thinks I'm gaslighting it when I talk about Gemma 4 line of models

Posted by Jorlen@reddit | LocalLLaMA | View on Reddit | 14 comments

AMD Powers Next-Generation Agent Computers with New Ryzen AI Halo Developer Platform and Ryzen AI Max PRO 400 Series Processors

Posted by Baumpaladin@reddit | LocalLLaMA | View on Reddit | 66 comments

AMD BC-250 and the search for Cheap Compute

Posted by dugganmania@reddit | LocalLLaMA | View on Reddit | 41 comments

[WIP] Gemma 4 MTP

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 50 comments

Formal-Exam-8767@reddit

MTP is new, yes. But using smaller models as draft models for larger ones is not. Support for that landed in llama.cpp in 2024: https://github.com/ggml-org/llama.cpp/pull/10455

[WIP] Gemma 4 MTP

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 50 comments

Intel's Crescent Island PCB Leaks, Showing a Massive Xe3P GPU, 16-Pin Connector, 160GB LPDDR5X as Intel Sidesteps the HBM Shortage

Posted by FullstackSensei@reddit | LocalLLaMA | View on Reddit | 89 comments

Intel's Crescent Island PCB Leaks, Showing a Massive Xe3P GPU, 16-Pin Connector, 160GB LPDDR5X as Intel Sidesteps the HBM Shortage

Posted by FullstackSensei@reddit | LocalLLaMA | View on Reddit | 89 comments

Formal-Exam-8767@reddit

How many layers does it have? I've been told that to achieve more than 2 memory channels you need multiple PCB layers because of the traces and that's what makes multi-channel MBOs expensive.

What is the point of MoE models, beyond being faster?

Posted by ihatebeinganonymous@reddit | LocalLLaMA | View on Reddit | 135 comments

Formal-Exam-8767@reddit

The main issue of running LLMs is not the amount of memory or even compute but memory bandwidth. You can ignore the amount of memory as it should be given that you have enough memory, if you don't, you can't run it, period. MoE have less active parameters so they better utilize memory bandwidth and require less compute since they are working with lower amount of data than dense models.

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)

Posted by VolandBerlioz@reddit | LocalLLaMA | View on Reddit | 130 comments

Reliable Open Source LLM as a Service

Posted by pravictor@reddit | LocalLLaMA | View on Reddit | 10 comments

Different gpu mixed node

Posted by Force88@reddit | LocalLLaMA | View on Reddit | 10 comments

Formal-Exam-8767@reddit

If the choice is between 5060ti 16GB and 5070 12GB, it makes no sense to go with less VRAM since you want really see any speed benefits that 5070 has over 5060ti. To me, it only makes sense to consider 5060ti 16GB or 5070ti 16GB.

The "the future is fictional" problem of many local LLMs

Posted by PromptInjection_@reddit | LocalLLaMA | View on Reddit | 54 comments

Formal-Exam-8767@reddit

But can the model even respond differently? If it has a system prompt with specified knowledge (to model this probably means all knowledge, both baked-in and from context) cutoff date, anything beyond is fiction.

Turboquant+MTP for ROCm(Llama CPP)

Posted by DrBearJ3w@reddit | LocalLLaMA | View on Reddit | 16 comments

I've seen a lot of folks ask "can local LLMs actually do anything useful?"

Posted by NoWorking8412@reddit | LocalLLaMA | View on Reddit | 121 comments

Formal-Exam-8767@reddit

Define useful. Considering that people are mostly using chats bots the wrong way (as **authoritative** source of truth), I am not surprised they don't find local LLMs useful.

Estimate inference speed of local Qwen3.6-35B on Mac M5...

Posted by Altruistic-Dust-2565@reddit | LocalLLaMA | View on Reddit | 17 comments

Formal-Exam-8767@reddit

Estimate is just and estimate. With MoE it is even harder to precisly estimate. And PP is close to impossible since it depends on actual architecture and ops not just size.

DIY market declining amid high RAM prices

Posted by Terminator857@reddit | LocalLLaMA | View on Reddit | 114 comments

how you justify your spending time and resource for Local LLM to your love one ?

Posted by Merchant_Lawrence@reddit | LocalLLaMA | View on Reddit | 6 comments

Strix Halo Clustering (Hardware Setup Discussion)

Posted by Thanks-Suitable@reddit | LocalLLaMA | View on Reddit | 20 comments