OcelotOk8071

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Posted by johnnyApplePRNG@reddit | LocalLLaMA | View on Reddit | 93 comments

[-]

OcelotOk8071@reddit

AGI

Reply

Inferencing at 10.33 t/s on Qwen 3.5 35B on a $300 laptop

Posted by OcelotOk8071@reddit | LocalLLaMA | View on Reddit | 20 comments

[-]

OcelotOk8071@reddit (OP)

q4 is the golden rule. It's the absolute minimum. even with this hardware q4 was the bare minimum for functionality. a3b means that only 3billion parameters are actually active at once, i.e., you only need the vram to accommodate those 3b parameters, but can offload the total params in RAM.

Reply

Inferencing at 10.33 t/s on Qwen 3.5 35B on a $300 laptop

Posted by OcelotOk8071@reddit | LocalLLaMA | View on Reddit | 20 comments

[-]

OcelotOk8071@reddit (OP)

I've heard some anecdotal evidence + personal experience that seems to suggest 3.5 is better for general writing tasks, whereas 3.6 is better for agentic tasks.

Reply

Inferencing at 10.33 t/s on Qwen 3.5 35B on a $300 laptop

Posted by OcelotOk8071@reddit | LocalLLaMA | View on Reddit | 20 comments

[-]

OcelotOk8071@reddit (OP)

Qwen 3.5/3.6 35b a3b would definitely work well in your setup. You can load the total params on the RAM and stream the active params into the vram.

Reply

Inferencing at 10.33 t/s on Qwen 3.5 35B on a $300 laptop

Posted by OcelotOk8071@reddit | LocalLLaMA | View on Reddit | 20 comments

[-]

OcelotOk8071@reddit (OP)

will try this.

Reply

Inferencing at 10.33 t/s on Qwen 3.5 35B on a $300 laptop

Posted by OcelotOk8071@reddit | LocalLLaMA | View on Reddit | 20 comments

[-]

OcelotOk8071@reddit (OP)

That's interesting. Perhaps running igpu inference on this rig might be considerably better.

Reply

Inferencing at 10.33 t/s on Qwen 3.5 35B on a $300 laptop

Posted by OcelotOk8071@reddit | LocalLLaMA | View on Reddit | 20 comments

[-]

OcelotOk8071@reddit (OP)

slow, yeah. you prob can. but cost efficiency isnt the point

Reply

Google AI Edge Gallery v1.0.13 & v1.0.14 updates: Gemma 4 Multi-Token Prediction, Pixel TPU support, experimental MCP, new skills, now saves chat history

Posted by AnticitizenPrime@reddit | LocalLLaMA | View on Reddit | 38 comments

[-]

OcelotOk8071@reddit

this is great, but watch out, this is corporate takeover.

Reply

PSA: If you haven’t updated Llama.cpp for a couple of days and find MTP to not be performing well, update llamacpp.

Posted by Borkato@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

OcelotOk8071@reddit

let us c pp

Reply

I hope that someday we will have a 124B Gemma.

Posted by cgs019283@reddit | LocalLLaMA | View on Reddit | 77 comments

[-]

OcelotOk8071@reddit

I'm not sure if they covered it last Google I/O

Reply

[Paper on Hummingbird+: low-cost FPGAs for LLM inference] Qwen3-30B-A3B Q4 at 18 t/s token-gen, 24GB, expected $150 mass production cost

Posted by ayake_ayake@reddit | LocalLLaMA | View on Reddit | 56 comments

[-]

OcelotOk8071@reddit

Definitely a gamer changer. Id buy it.

Reply

Kimi K2.6 Released (huggingface)

Posted by BiggestBau5@reddit | LocalLLaMA | View on Reddit | 277 comments

[-]

OcelotOk8071@reddit

that's not a foregone conclusion.

Reply

GPT Image 2 finally killed the 'yellow filter'—everyday Chinese scenes are usable now

Posted by TroyHarry6677@reddit | LocalLLaMA | View on Reddit | 5 comments

[-]

OcelotOk8071@reddit

https://preview.redd.it/e9rh20mk49wg1.jpeg?width=1024&format=pjpg&auto=webp&s=9817d24541a0dc7a72c71e69a366b232d16a0f63

Reply

I made the "Mafia" Party game with leading LLMs to see how good these SOTA models are at social deduction and manipulating

Posted by Cyrax21_@reddit | LocalLLaMA | View on Reddit | 4 comments

[-]

OcelotOk8071@reddit

These are not local models.

Reply

How are you handling output inconsistency in local LLM setups?

Posted by nipundwivedi@reddit | LocalLLaMA | View on Reddit | 1 comments

[-]

OcelotOk8071@reddit

Adjust the temperature down

Reply

Released Qwen3.6-35B-A3B

Posted by NewEconomy55@reddit | LocalLLaMA | View on Reddit | 93 comments

[-]

OcelotOk8071@reddit

why is that?

Reply

Me right now

Posted by -dysangel-@reddit | LocalLLaMA | View on Reddit | 10 comments

[-]

OcelotOk8071@reddit

Can someone familiar with minimax say what this model is about? What is it typically used for? Why not use qwen?

Reply

What it took to launch Google DeepMind's Gemma 4

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 136 comments

[-]

OcelotOk8071@reddit

but its not just vibes. if you look at the well known creative writing models, they are mostly all above atleast 24b minimum. look at cydonia for example.

Reply

What it took to launch Google DeepMind's Gemma 4

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 136 comments

[-]

OcelotOk8071@reddit

okay, who says claude isn't a large MoE then?

Reply

What it took to launch Google DeepMind's Gemma 4

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 136 comments

[-]

OcelotOk8071@reddit

Small MoEs are always good for "dry" intelligence (STEM, reasoning, coding, math) and large dense models(like 31b) are better at soft intelligence (seeing the overall picture, understanding nuance, creative writing, etc. It's very likely that the MoE had HIGH intelligence in both types of skills. Probably is pretty legendary.

Reply

Are we currently in a "Golden Time" for low VRAM/1 GPU users with Qwen 27b?

Posted by inthesearchof@reddit | LocalLLaMA | View on Reddit | 117 comments

[-]

OcelotOk8071@reddit

i have 32gbs ddr4 laptop ram. no vram

Reply

Are we currently in a "Golden Time" for low VRAM/1 GPU users with Qwen 27b?

Posted by inthesearchof@reddit | LocalLLaMA | View on Reddit | 117 comments

[-]

OcelotOk8071@reddit

I have been skeptical of MoEs for quite some time, especially in the fact that there are actually no reputable creative writing models for MoEs. There's no such thing as cheating physics. We can't just gain performance from nothing

Reply

Are we currently in a "Golden Time" for low VRAM/1 GPU users with Qwen 27b?

Posted by inthesearchof@reddit | LocalLLaMA | View on Reddit | 117 comments

[-]

OcelotOk8071@reddit

Question, what model kicked off this new interest in dense models? Qwen 3.5 27b?

Reply

Am I doing something wrong? Or is Qwen 3.5VL only capable of writing dialogue like it's trying to imitate some kind of medieval knight?

Posted by Parogarr@reddit | LocalLLaMA | View on Reddit | 24 comments

[-]

OcelotOk8071@reddit

🤣

Reply

Am I doing something wrong? Or is Qwen 3.5VL only capable of writing dialogue like it's trying to imitate some kind of medieval knight?

Posted by Parogarr@reddit | LocalLLaMA | View on Reddit | 24 comments

[-]

OcelotOk8071@reddit

full prompt?

Reply

llama.cpp build b8338 adds OpenVINO backend + NPU support for prefill + kvcache

Posted by stormy1one@reddit | LocalLLaMA | View on Reddit | 13 comments

[-]

OcelotOk8071@reddit

Would this allow for acceleration with Intel Cpu/Ram Setups?

Reply

Qwen 3 32B on M2 Max 32GB — my honest 3-week assessment

Posted by Budulai343@reddit | LocalLLaMA | View on Reddit | 16 comments

[-]

OcelotOk8071@reddit

Can you elaborate on what you're using agents for?

Reply

WTF? Was Qwen3.5 9B trained with Google?

Posted by powerade-trader@reddit | LocalLLaMA | View on Reddit | 11 comments

[-]

OcelotOk8071@reddit

Models don't know about themselves.

Reply

Qwen3.5 2B giving weird answers

Posted by Dean_Thomas426@reddit | LocalLLaMA | View on Reddit | 10 comments

[-]

OcelotOk8071@reddit

Q3 on a 2b is quite aggressive.

Reply

70B llm on 4gb android phone !

Posted by Vast_Lingonberry7259@reddit | LocalLLaMA | View on Reddit | 25 comments

[-]

OcelotOk8071@reddit

AI wrote this

Reply

Any hope for Gemma 4 release?

Posted by gamblingapocalypse@reddit | LocalLLaMA | View on Reddit | 38 comments

[-]

OcelotOk8071@reddit

do you have a link for that post?

Reply

Speculation: new Gemma, Granite, Arcee Trinity models when?

Posted by RobotRobotWhatDoUSee@reddit | LocalLLaMA | View on Reddit | 7 comments

[-]

OcelotOk8071@reddit

Gemma models are the best base models for their size. Lots of knowledge, and cooked for a long time.

Reply

4x4090 build running gpt-oss:20b locally - full specs

Posted by RentEquivalent1671@reddit | LocalLLaMA | View on Reddit | 96 comments

[-]

OcelotOk8071@reddit

Taylor Swift when she wants to run gpt oss 20b locally:

Reply

My LLM trained from scratch on only 1800s London texts brings up a real protest from 1834

Posted by Remarkable-Trick-177@reddit | LocalLLaMA | View on Reddit | 174 comments

[-]

OcelotOk8071@reddit

"two halfpennies" LOL

Reply

My rice emergency supply has been destroyed by bugs.

Posted by WoodgladeRiver@reddit | preppers | View on Reddit | 172 comments

[-]

OcelotOk8071@reddit

Please, please learn what capitalization is.

Reply

OpenAI teases to open-source model(s) soon

Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 113 comments

[-]

OcelotOk8071@reddit

You should already be doing that. They popularized it.

Reply

Do any of you have a "hidden gem" LLM that you use daily?

Posted by ForsookComparison@reddit | LocalLLaMA | View on Reddit | 53 comments

[-]

OcelotOk8071@reddit

ChatterUI shouldn't require an internet connection. It has options to connect to a local or external api

Reply

How are people using models smaller than 5b parameters?

Posted by Vegetable_Sun_9225@reddit | LocalLLaMA | View on Reddit | 130 comments

[-]

OcelotOk8071@reddit

Speculative decoding, but also some tasks like creative writing are usable with small models

Reply

Copyright protection method that has the "key" on both the distribution service server, and the actual copyright holder themselves, and only requires verification every 7 days or 30 days

Posted by dickcheney600@reddit | CrazyIdeas | View on Reddit | 19 comments

[-]

OcelotOk8071@reddit

We wouldn't need gog if we didn't have scammy practices like DRM, would we?

Reply

Copyright protection method that has the "key" on both the distribution service server, and the actual copyright holder themselves, and only requires verification every 7 days or 30 days

Posted by dickcheney600@reddit | CrazyIdeas | View on Reddit | 19 comments

[-]

OcelotOk8071@reddit

Screw DRM.

Reply

Deepseek bitnet

Posted by Thistleknot@reddit | LocalLLaMA | View on Reddit | 52 comments

[-]

OcelotOk8071@reddit

RemindMe! 1 day

Reply

Deepseek bitnet

Posted by Thistleknot@reddit | LocalLLaMA | View on Reddit | 52 comments

[-]

OcelotOk8071@reddit

!remindme 1d

Reply

Deepseek bitnet

Posted by Thistleknot@reddit | LocalLLaMA | View on Reddit | 52 comments

[-]

OcelotOk8071@reddit

Replying for interest

Reply

LLMs like ChatGPT should nerf the output when people are being rude to it.

Posted by pastafarian24@reddit | CrazyIdeas | View on Reddit | 43 comments

[-]

OcelotOk8071@reddit

Precisely

Reply

LLMs like ChatGPT should nerf the output when people are being rude to it.

Posted by pastafarian24@reddit | CrazyIdeas | View on Reddit | 43 comments

[-]

OcelotOk8071@reddit

Idk if that's been tested before, so idk.

Reply

LLMs like ChatGPT should nerf the output when people are being rude to it.

Posted by pastafarian24@reddit | CrazyIdeas | View on Reddit | 43 comments

[-]

OcelotOk8071@reddit

Idea rejected, because it's already a thing. Nicer prompts result in better results, whereas impolite prompts lead to worse generations. https://arxiv.org/abs/2402.14531#:~:text=This%20phenomenon%20suggests%20that%20LLMs,language%20processing%20and%20LLM%20usage.

Reply

Who will release a new model in 2025 firstly?

Posted by foldl-li@reddit | LocalLLaMA | View on Reddit | 44 comments

[-]

OcelotOk8071@reddit

Impossible. How could they give reasoning abilities in a 14b?

Reply

PoSe

Posted by Sufficient-Smile-481@reddit | LocalLLaMA | View on Reddit | 3 comments

[-]

OcelotOk8071@reddit

How does memory scale using this method?

Reply

Restaurant called "The Birds Nest" where you order your food, the waiter eats it, then regurgitates it into your mouth.

Posted by n_thomas74@reddit | CrazyIdeas | View on Reddit | 16 comments

[-]

OcelotOk8071@reddit

Great minds think alike

Reply

Meta's Byte Latent Transformer (BLT) paper looks like the real-deal. Outperforming tokenization models even up to their tested 8B param model size. 2025 may be the year we say goodbye to tokenization.

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 190 comments

[-]

OcelotOk8071@reddit

But couldn't we represent machine code as letters? Infact, due to the model being optimized for language, wouldn't it make it better with this approach?

Reply