OcelotOk8071

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Posted by johnnyApplePRNG@reddit | LocalLLaMA | View on Reddit | 93 comments

Inferencing at 10.33 t/s on Qwen 3.5 35B on a $300 laptop

Posted by OcelotOk8071@reddit | LocalLLaMA | View on Reddit | 20 comments

OcelotOk8071@reddit (OP)

q4 is the golden rule. It's the absolute minimum. even with this hardware q4 was the bare minimum for functionality. a3b means that only 3billion parameters are actually active at once, i.e., you only need the vram to accommodate those 3b parameters, but can offload the total params in RAM.

Inferencing at 10.33 t/s on Qwen 3.5 35B on a $300 laptop

Posted by OcelotOk8071@reddit | LocalLLaMA | View on Reddit | 20 comments

OcelotOk8071@reddit (OP)

I've heard some anecdotal evidence + personal experience that seems to suggest 3.5 is better for general writing tasks, whereas 3.6 is better for agentic tasks.

Inferencing at 10.33 t/s on Qwen 3.5 35B on a $300 laptop

Posted by OcelotOk8071@reddit | LocalLLaMA | View on Reddit | 20 comments

OcelotOk8071@reddit (OP)

Qwen 3.5/3.6 35b a3b would definitely work well in your setup. You can load the total params on the RAM and stream the active params into the vram.

Inferencing at 10.33 t/s on Qwen 3.5 35B on a $300 laptop

Posted by OcelotOk8071@reddit | LocalLLaMA | View on Reddit | 20 comments

Inferencing at 10.33 t/s on Qwen 3.5 35B on a $300 laptop

Posted by OcelotOk8071@reddit | LocalLLaMA | View on Reddit | 20 comments

Inferencing at 10.33 t/s on Qwen 3.5 35B on a $300 laptop

Posted by OcelotOk8071@reddit | LocalLLaMA | View on Reddit | 20 comments

Google AI Edge Gallery v1.0.13 & v1.0.14 updates: Gemma 4 Multi-Token Prediction, Pixel TPU support, experimental MCP, new skills, now saves chat history

Posted by AnticitizenPrime@reddit | LocalLLaMA | View on Reddit | 38 comments

PSA: If you haven’t updated Llama.cpp for a couple of days and find MTP to not be performing well, update llamacpp.

Posted by Borkato@reddit | LocalLLaMA | View on Reddit | 34 comments

I hope that someday we will have a 124B Gemma.

Posted by cgs019283@reddit | LocalLLaMA | View on Reddit | 77 comments

[Paper on Hummingbird+: low-cost FPGAs for LLM inference] Qwen3-30B-A3B Q4 at 18 t/s token-gen, 24GB, expected $150 mass production cost

Posted by ayake_ayake@reddit | LocalLLaMA | View on Reddit | 56 comments

Kimi K2.6 Released (huggingface)

Posted by BiggestBau5@reddit | LocalLLaMA | View on Reddit | 277 comments

GPT Image 2 finally killed the 'yellow filter'—everyday Chinese scenes are usable now

Posted by TroyHarry6677@reddit | LocalLLaMA | View on Reddit | 5 comments

I made the "Mafia" Party game with leading LLMs to see how good these SOTA models are at social deduction and manipulating

Posted by Cyrax21_@reddit | LocalLLaMA | View on Reddit | 4 comments

How are you handling output inconsistency in local LLM setups?

Posted by nipundwivedi@reddit | LocalLLaMA | View on Reddit | 1 comments

Released Qwen3.6-35B-A3B

Posted by NewEconomy55@reddit | LocalLLaMA | View on Reddit | 93 comments

Me right now

Posted by -dysangel-@reddit | LocalLLaMA | View on Reddit | 10 comments

What it took to launch Google DeepMind's Gemma 4

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 136 comments

OcelotOk8071@reddit

but its not just vibes. if you look at the well known creative writing models, they are mostly all above atleast 24b minimum. look at cydonia for example.

What it took to launch Google DeepMind's Gemma 4

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 136 comments

What it took to launch Google DeepMind's Gemma 4

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 136 comments

OcelotOk8071@reddit

Small MoEs are always good for "dry" intelligence (STEM, reasoning, coding, math) and large dense models(like 31b) are better at soft intelligence (seeing the overall picture, understanding nuance, creative writing, etc. It's very likely that the MoE had HIGH intelligence in both types of skills. Probably is pretty legendary.

Are we currently in a "Golden Time" for low VRAM/1 GPU users with Qwen 27b?

Posted by inthesearchof@reddit | LocalLLaMA | View on Reddit | 117 comments

Are we currently in a "Golden Time" for low VRAM/1 GPU users with Qwen 27b?

Posted by inthesearchof@reddit | LocalLLaMA | View on Reddit | 117 comments

OcelotOk8071@reddit

I have been skeptical of MoEs for quite some time, especially in the fact that there are actually no reputable creative writing models for MoEs. There's no such thing as cheating physics. We can't just gain performance from nothing

Are we currently in a "Golden Time" for low VRAM/1 GPU users with Qwen 27b?

Posted by inthesearchof@reddit | LocalLLaMA | View on Reddit | 117 comments

Am I doing something wrong? Or is Qwen 3.5VL only capable of writing dialogue like it's trying to imitate some kind of medieval knight?

Posted by Parogarr@reddit | LocalLLaMA | View on Reddit | 24 comments

Am I doing something wrong? Or is Qwen 3.5VL only capable of writing dialogue like it's trying to imitate some kind of medieval knight?

Posted by Parogarr@reddit | LocalLLaMA | View on Reddit | 24 comments

llama.cpp build b8338 adds OpenVINO backend + NPU support for prefill + kvcache

Posted by stormy1one@reddit | LocalLLaMA | View on Reddit | 13 comments

Qwen 3 32B on M2 Max 32GB — my honest 3-week assessment

Posted by Budulai343@reddit | LocalLLaMA | View on Reddit | 16 comments

WTF? Was Qwen3.5 9B trained with Google?

Posted by powerade-trader@reddit | LocalLLaMA | View on Reddit | 11 comments

Qwen3.5 2B giving weird answers

Posted by Dean_Thomas426@reddit | LocalLLaMA | View on Reddit | 10 comments

70B llm on 4gb android phone !

Posted by Vast_Lingonberry7259@reddit | LocalLLaMA | View on Reddit | 25 comments

Any hope for Gemma 4 release?

Posted by gamblingapocalypse@reddit | LocalLLaMA | View on Reddit | 38 comments

Speculation: new Gemma, Granite, Arcee Trinity models when?

Posted by RobotRobotWhatDoUSee@reddit | LocalLLaMA | View on Reddit | 7 comments

4x4090 build running gpt-oss:20b locally - full specs

Posted by RentEquivalent1671@reddit | LocalLLaMA | View on Reddit | 96 comments

My LLM trained from scratch on only 1800s London texts brings up a real protest from 1834

Posted by Remarkable-Trick-177@reddit | LocalLLaMA | View on Reddit | 174 comments

My rice emergency supply has been destroyed by bugs.

Posted by WoodgladeRiver@reddit | preppers | View on Reddit | 172 comments

OpenAI teases to open-source model(s) soon

Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 113 comments

Do any of you have a "hidden gem" LLM that you use daily?

Posted by ForsookComparison@reddit | LocalLLaMA | View on Reddit | 53 comments

How are people using models smaller than 5b parameters?

Posted by Vegetable_Sun_9225@reddit | LocalLLaMA | View on Reddit | 130 comments

Copyright protection method that has the "key" on both the distribution service server, and the actual copyright holder themselves, and only requires verification every 7 days or 30 days

Posted by dickcheney600@reddit | CrazyIdeas | View on Reddit | 19 comments

Copyright protection method that has the "key" on both the distribution service server, and the actual copyright holder themselves, and only requires verification every 7 days or 30 days

Posted by dickcheney600@reddit | CrazyIdeas | View on Reddit | 19 comments

Deepseek bitnet

Posted by Thistleknot@reddit | LocalLLaMA | View on Reddit | 52 comments

Deepseek bitnet

Posted by Thistleknot@reddit | LocalLLaMA | View on Reddit | 52 comments

Deepseek bitnet

Posted by Thistleknot@reddit | LocalLLaMA | View on Reddit | 52 comments

LLMs like ChatGPT should nerf the output when people are being rude to it.

Posted by pastafarian24@reddit | CrazyIdeas | View on Reddit | 43 comments

LLMs like ChatGPT should nerf the output when people are being rude to it.

Posted by pastafarian24@reddit | CrazyIdeas | View on Reddit | 43 comments

LLMs like ChatGPT should nerf the output when people are being rude to it.

Posted by pastafarian24@reddit | CrazyIdeas | View on Reddit | 43 comments

OcelotOk8071@reddit

Idea rejected, because it's already a thing. Nicer prompts result in better results, whereas impolite prompts lead to worse generations. https://arxiv.org/abs/2402.14531#:~:text=This%20phenomenon%20suggests%20that%20LLMs,language%20processing%20and%20LLM%20usage.

Who will release a new model in 2025 firstly?

Posted by foldl-li@reddit | LocalLLaMA | View on Reddit | 44 comments

PoSe

Posted by Sufficient-Smile-481@reddit | LocalLLaMA | View on Reddit | 3 comments

Restaurant called "The Birds Nest" where you order your food, the waiter eats it, then regurgitates it into your mouth.

Posted by n_thomas74@reddit | CrazyIdeas | View on Reddit | 16 comments

Meta's Byte Latent Transformer (BLT) paper looks like the real-deal. Outperforming tokenization models even up to their tested 8B param model size. 2025 may be the year we say goodbye to tokenization.

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 190 comments

OcelotOk8071@reddit

But couldn't we represent machine code as letters? Infact, due to the model being optimized for language, wouldn't it make it better with this approach?