Down_The_Rabbithole

Per-Layer Embeddings: A simple explanation of the magic behind the small Gemma 4 models

Posted by -p-e-w-@reddit | LocalLLaMA | View on Reddit | 68 comments

[-]

Down_The_Rabbithole@reddit

The question becomes why isn't this applied to bigger models? Does it stop scaling after a certain point? Why isn't Gemma 4 31B "E18B" instead?

A new paper demonstrates that LLMs could "think" in latent space, effectively decoupling internal reasoning from visible context tokens. This breakthrough suggests that even smaller models can achieve remarkable performance without relying on extensive context windows.

Posted by tehbangere@reddit | LocalLLaMA | View on Reddit | 305 comments

[-]

Hard disagree. The human brain just has a higher sample efficiency than the Transformer architecture in current LLMs. That's it. Humans for example don't have a specific brain region in the brain for reading. We only invented it 6000 years ago, and look how naturally you are writing messages to me and reading all of this text as if it is a native ability in our brains. Kids learn to read in about the same amount of training hours as they learn to speak. I genuinely consider Chomsky to just be dead wrong and hope he gets removed from (at the very least) computer science curriculum, but perhaps even linguistics. The brain learning to read and write at a similar rate as speaking, together with LLMs learning the ability as well highly suggests there isn't some "specific circuitry" in brains/llms that does this and instead it's a more general skill. Ergo language is an invention and a human innovation and note some innate evolved ability in the brain. The vocal cords necessary to produce sound *is* evolutionary but our ability to comprehend and engage in language is highly likely to not have been specialized when it came into existence.

Bullshit Benchmark - A benchmark for testing whether models identify and push back on nonsensical prompts instead of confidently answering them

Posted by bot_exe@reddit | LocalLLaMA | View on Reddit | 33 comments

[-]

Down_The_Rabbithole@reddit

Yeah what surprises me most is that Chinese models clearly trained/semi-distilled on Claude output like GLM-5 doesn't seem to have inherited this anti-sycophancy.

Minimum viable LLM

Posted by Down_The_Rabbithole@reddit | LocalLLaMA | View on Reddit | 30 comments

[-]

Down_The_Rabbithole@reddit (OP)

Wow, that's very impressive, thanks for sharing.

Artificial Analysis: South Korea 🇰🇷 is now the clear #3 nation in AI — powered by the Korean National Sovereign AI Initiative there are now multiple Korean AI labs with near frontier intelligence.

Posted by self-fix@reddit | LocalLLaMA | View on Reddit | 59 comments

[-]

Down_The_Rabbithole@reddit

Yeah this is ridiculous and I suspect the account to either be a south korean propaganda account or a south korean nationalist posting this. What a bizarre post.

ASUS Rumored To Enter DRAM Market Next Year

Posted by Highwaytothebeach@reddit | LocalLLaMA | View on Reddit | 37 comments

[-]

Down_The_Rabbithole@reddit

It would, considering ASUS is targeting the consumer market, if their bid allows them to buy chips that would otherwise have gone to non-consumer systems it means there is now more supply for regular consumer chips, putting downward pressure on DIY RAM prices.

Nvidia DGX Station GB300 784GB available now! 95,000 USD / 80,000 EUR

Posted by GPTshop@reddit | LocalLLaMA | View on Reddit | 331 comments

[-]

Down_The_Rabbithole@reddit

Exactly 288GB of HBM3e (VRAM) and the rest is regular DRAM that isn't equivalent in bandwidth to the (actual) VRAM.

Nvidia DGX Station GB300 784GB available now! 95,000 USD / 80,000 EUR

Posted by GPTshop@reddit | LocalLLaMA | View on Reddit | 331 comments

[-]

Down_The_Rabbithole@reddit

This is only 288GB of VRAM.... Not worth the price of admission unless you're power constrained.

WTF are these AI companies doing where they supposedly are the cause of the ram price spike?

Posted by Red_Redditor_Reddit@reddit | LocalLLaMA | View on Reddit | 430 comments

[-]

Down_The_Rabbithole@reddit

Then they came for the local electricity, and there was no one left to speak for me.

GigaChat3-702B-A36B-preview is now available on Hugging Face

Posted by Any-Ship9886@reddit | LocalLLaMA | View on Reddit | 89 comments

[-]

Down_The_Rabbithole@reddit

The Russian language seems to be the most effective at prompting LLMs because of the information density of the language. Japanese is the worst for prompting. As a Japanese AI expert I understand exactly why. The Japanese language was purposefully designed to be as ambiguous as possible to always have plausible deniability and to leave most meaning and implications unspoken. But that's the absolute worst you want a language to be for LLMs. Rigid languages that communicate more concepts through their inherent grammar are superior in the age of LLMs. English is also trending the wrong way, it is making word usage more ambigious and the removal of he/her genders as well as lower information density of "ebonics" style English means that it's slowly losing its information density and looking more and more like Japanese. I wonder if we stick with the current LLM paradigm and there is no breakthrough that changes things if languages will again change to become more explicit and rigid over time purely because people realize it helps their performance when communicating with AI systems.

US Cloud Giants to Spend ~8.16× What China Does in 2025–27 — $1.7 Trillion vs $210 Billion, Will it translate to stronger US AI dominance?

Posted by abdouhlili@reddit | LocalLLaMA | View on Reddit | 169 comments

[-]

Down_The_Rabbithole@reddit

As someone that actually worked in China, people don't realize that the bureaucracy and layers of middle men and committee decisions *are way worse in China than in the west* People don't know this but over a certain size of a company there needs to be a member of the CCP on the board of directors that has to sign off on any major decisions. They try to make this a technical person but their knowledge usually ends up short, which means they have to bring it back to the regional government office for approval and it takes a ton of time and sometimes multiple back and forths before you can get something done. Ironically enough, China used to be more flexible when it was more corrupt because you could cut through red tape with bribes. Under Xi Jinping the regulatory screws got tightened to an insane degree, barely giving any tech company breathing room. It's why foreign talent like me left the country. This is also why I think the west will win the tech race. When it comes down to it the west has the least amount of bureaucracy and red tape to bring a vision into reality and the people complaining have no frame of reference to how bad it is in other places, including China.

Meta chief AI scientist Yann LeCun plans to exit to launch startup, FT reports

Posted by brown2green@reddit | LocalLLaMA | View on Reddit | 43 comments

[-]

Down_The_Rabbithole@reddit

Very ignorant thing to claim

Server DRAM prices surge up to 50% as AI-induced memory shortage hits hyperscaler supply — U.S. and Chinese customers only getting 70% order fulfillment

Posted by IonizedRay@reddit | LocalLLaMA | View on Reddit | 62 comments

[-]

Down_The_Rabbithole@reddit

I meant datacenters, but actually I feel like that name is outdated as most servers are not built to process data anymore but rather to train AI models.

Server DRAM prices surge up to 50% as AI-induced memory shortage hits hyperscaler supply — U.S. and Chinese customers only getting 70% order fulfillment

Posted by IonizedRay@reddit | LocalLLaMA | View on Reddit | 62 comments

[-]

Down_The_Rabbithole@reddit

It's because of the insane amount of databases being built right now, there literally isn't enough capacity to supply all of it at the same time so prices go up. There are planned (and paid for) database plans up to the middle of 2030s so we will see high prices for years.

Server DRAM prices surge up to 50% as AI-induced memory shortage hits hyperscaler supply — U.S. and Chinese customers only getting 70% order fulfillment

Posted by IonizedRay@reddit | LocalLLaMA | View on Reddit | 62 comments

[-]

Down_The_Rabbithole@reddit

It's a perfect storm. DRAM manufacturers are slowly gearing up for DDR6 so DDR5 production isn't expanding as fast as it naturally would have. DDR4 production is completely stopped starting this year. DDR5 capacity has been bought out for the foreseeable future, but there is a heightened demand as well because of MoE inference so every company and hobbyist doing anything interesting with LLMs need to stock up on RAM. This will percolate to SSD NAND price relatively soon.

If You Want to Understand Why Llama Models Flopped, Zuck is the Cause!

Posted by Iory1998@reddit | LocalLLaMA | View on Reddit | 213 comments

[-]

Down_The_Rabbithole@reddit

At the peak of Blackberry (2012) 21% of Americans owned a blackberry phone. To give you some indication, the highest percentage Apple has ever gotten was 30% in 2015, not significantly more than Blackberry. And there were only more iphones sold than blackberries globally in 2013. I'm not a blackberry fan or an iphone hater. I just hate this artificial mythos that has been created around iphones being some special innovation or technological revolution. The launch of the iphone is statistically insignificant on the general trend of smartphone adoption, it's just that gradually the iphone form factor seemed to have won out and (slowly!) replaced the traditional "keypad" formfactor of smartphones.

If You Want to Understand Why Llama Models Flopped, Zuck is the Cause!

Posted by Iory1998@reddit | LocalLLaMA | View on Reddit | 213 comments

[-]

Down_The_Rabbithole@reddit

>How was google able to go from bard to Gemini while meta went from llama great to llama crap Mostly because Bard and Gemini were made by two completely different teams. Bard was done by Google Brain while Gemini was done by Google DeepMind. Yeah.... Google used to have 2 completely separate AI divisions, and treated Google Brain better because it was san francisco based while DeepMind was London based and treated more hands-off. When Google Brain fumbled with Bard Google pulled the plug on them, they got merged into DeepMind and DeepMind used their superior AI talent to rapidly make a good product. So from an outsider perspective it looks like Google had a rapid improvement from a shit product to a cutting edge product. But in reality it was just a bad team making a bad product and then a separate good team making a good product, there was never any improvement or iteration going on under the hood

If You Want to Understand Why Llama Models Flopped, Zuck is the Cause!

Posted by Iory1998@reddit | LocalLLaMA | View on Reddit | 213 comments

[-]

Down_The_Rabbithole@reddit

This "iphone moment" never happened and is a myth. Which is very perplexing to me as it's not long ago so most people here would have personally experienced it. Palmtops/PDAs already were smartphones in the 1990s with internet browsers, email services and internet based messaging on it. They were indeed clunky but they were feature complete and not very different from modern smartphones aside from processing power. But even if you dismiss those. The entire first world was already using blackberry smartphones for close to 5 years before iphones were introduced. Everyone I knew owned those and used them to go to myspace and early facebook, watch and send videos to each other and do most things people do nowadays, and they were almost as popular as iphones. Apple must have insane marketing to be able to make people forget about the 5 years of smartphone usage before the iphone was introduced. To bring it back to the VR discussion. There won't be an "iphone moment" because those don't exist. There is a gradual adoption curve of the technology by the general public like every other technology in history and it's not spiky.

If You Want to Understand Why Llama Models Flopped, Zuck is the Cause!

Posted by Iory1998@reddit | LocalLLaMA | View on Reddit | 213 comments

[-]

Down_The_Rabbithole@reddit

People do this all the time, they are called glasses, no one seems to care. It will just take time for hardware to be small enough to fit a normal pair of glasses with negligible weight. Just like it took miniaturization before smartwatches took off.

Gemma 4

Posted by Brave-Hold-9389@reddit | LocalLLaMA | View on Reddit | 73 comments

[-]

Down_The_Rabbithole@reddit

Gemma models are extremely good for real time translation on portable local devices, something that is impractical for bigger models to do. A big usecase of gemma specifically that not a lot of people talk about is real time translation between people with unreliable internet connection.

Stanford just dropped 5.5hrs worth of lectures on foundational LLM knowledge

Posted by igorwarzocha@reddit | LocalLLaMA | View on Reddit | 74 comments

[-]

Down_The_Rabbithole@reddit

Disagree with MLA being a thing only Deepseek does. Slightly modified techniques which are essentially MLA are being done by almost all compute constrained labs, which essentially means all chinese labs as well as some smaller players like Mistral. Google has a proprietary in-house approach to kv-cache which is so secret most engineers don't even know about it as it's what gives Google their monopoly on consistency on very long context sizes. My hypothesis is that this is a superior version of essentially MLA.

Good ol gpu heat

Posted by animal_hoarder@reddit | LocalLLaMA | View on Reddit | 38 comments

[-]

Down_The_Rabbithole@reddit

make sure to adjust the 3090 voltage curve, you can underclock the gpu core while overclocking the memory for a nice gain in LLM performance. You can usually get a 20-30% power (and heat) reduction by just adjusting the voltage curve. It's a free lunch.

Intel Arc Pro B60 24GB professional GPU listed at $599, in stock and shipping

Posted by PhantomWolf83@reddit | LocalLLaMA | View on Reddit | 171 comments

[-]

Down_The_Rabbithole@reddit

I bought a 3090 2nd hand for less than $300 a couple of years ago. They are more expensive now but it's bizarre that a 5 year old GPU still beats modern flagships.

Is this real? 14b coder.

Posted by Relative_Ad_9881@reddit | LocalLLaMA | View on Reddit | 43 comments

[-]

Down_The_Rabbithole@reddit

Ollama is downstream from llama.cpp they just badly copy the llama.cpp implementations, somehow manage to screw up implementations and default settings and then call it a day. llama.cpp has more functionality, better stability and easier to use. It's just that Ollama was founded by ex-google employees that used a big bag of cash and SF connections to try and promote themselves more. No one serious should use it.

Is this real? 14b coder.

Posted by Relative_Ad_9881@reddit | LocalLLaMA | View on Reddit | 43 comments

[-]

Down_The_Rabbithole@reddit

Just delete Ollama and install Llamacpp already. Ridiculously bad application that no one should use.

4x 3090 local ai workstation

Posted by monoidconcat@reddit | LocalLLaMA | View on Reddit | 247 comments

[-]

Down_The_Rabbithole@reddit

Not only power limit but adjusting voltage curve as well. Most 3090s can work with lower voltages while maintaining performance, lowering power draw, heat and sound production.

PSA for Ollama Users: Your Context Length Might Be Lower Than You Think

Posted by gpt872323@reddit | LocalLLaMA | View on Reddit | 55 comments

[-]

Down_The_Rabbithole@reddit

I'd even go as far as this sub having to have a stickied thread on the front page urging people to never use Ollama and switch to Llamacpp. Ollama is a bad-faith project that uses a lot of behind-the-scenes politicking and paid things to try and push themselves. They copy Llamacpp code without understanding how it works and implements settings and features in a wrong way which causes insane amount of bugs and a terrible user experience. The only reason I'm not asking for an outright ban on Ollama discussion at all is because it goes against the Open Source ethos to do so. But they are absolutely a malicious entity with no upsides to the wider community and should be avoided on principle alone.

Renting GPUs is hilariously cheap

Posted by -p-e-w-@reddit | LocalLLaMA | View on Reddit | 398 comments

[-]

Down_The_Rabbithole@reddit

Hell, it's *cheaper* to run on API than it is to run *on my own hardware* purely because the electricity costs of running the machine is higher than the API costs. Economies of scale, lower electricity costs and inference batching tricks means that using your own hardware is usually more expensive.

Can 2 RTX 6000 Pros (2X98GB vram) rival Sonnet 4 or Opus 4?

Posted by devshore@reddit | LocalLLaMA | View on Reddit | 222 comments

[-]

Down_The_Rabbithole@reddit

Q4 with QAT would potentially come close within the VRAM requirement.

Deepseek changes their API price again

Posted by Pro-editor-1105@reddit | LocalLLaMA | View on Reddit | 37 comments

[-]

Down_The_Rabbithole@reddit

I think the naming of the subreddit doesn't actually align with how it's used. It's more about open weight models rather than local. It's about the ability to run it locally if needed or wanted, not about actually running it locally. Like how open source software is still open source even if you run it on some cloud server.

When will low-cost Chinese GPUs hit the market?

Posted by noellarkin@reddit | LocalLLaMA | View on Reddit | 97 comments

[-]

Down_The_Rabbithole@reddit

EUV *is* a completely new and separate technology from DUV which is why other DUV companies have had issues pivoting towards it including Global Foundries which refused to buy EUV machines. And Intel which after multiple attempts cancelled their EUV attempts and they are now buying ASML High-NA EUV machines instead. A metaphor would be the Vacuum tube production line versus Transistor production line in the 1950s. Similar usage, completely different technology, underlying principles and required expertise.

There are at least 15 open source models I could find that can be run on a consumer GPU and which are better than Grok 2 (according to Artificial Analysis)

Posted by obvithrowaway34434@reddit | LocalLLaMA | View on Reddit | 117 comments

[-]

Down_The_Rabbithole@reddit

I hate most comments I read on reddit. Bunch of pedantic, whiny assholes. The funny thing is that this probably includes me as well for other redditors. Most arguments are typically also bad faith, malicious or intentionally bad takes for engagement, which defeats the purpose.

grok 2 weights

Posted by HatEducational9965@reddit | LocalLLaMA | View on Reddit | 201 comments

[-]

Down_The_Rabbithole@reddit

He means speculative decoding when he says multiple token prediction.

When will low-cost Chinese GPUs hit the market?

Posted by noellarkin@reddit | LocalLLaMA | View on Reddit | 97 comments

[-]

Down_The_Rabbithole@reddit

Yes I *do* believe if ASML were to stop existing the technology will die out due to institutional knowledge being lost. It's the most advanced logistical integration of over a thousand different companies that specialize in very specific parts just for the EUV lithography. Every EUV machine needs around a hundred specialists trained for a decade to maintain it every day. There has never been any other machine in the same order of complexity. The building of ITER or CERN looks like childs play compared to the technological marvel of the ASML EUV machines. It's honestly kind of bizarre we succeeded even once. Which is why high-NA EUV is still up for grabs. Everyone pretends we are certain ASML will deliver it as it's now in demo mode, but the machine is so insanely complex that honestly nothing like it ever existed in history. There is a genuine chance it will not work at all. China has more ASML technicians on their payroll than ASML themselves. They have for the last 10 years or so. The Chinese head of technology acquisition was an old head of ASML in fact. In spite of all of that and almost a half trillion spending by China over the last 15 years to try and crack EUV they have failed to do so. China has had fully working ASML machines, including some ex-employees, they disassembled it, tried to reverse engineer it, and failed. It's such a complex machine that most experts working on it themselves have no idea of the bigger picture as it's thousands of state of the art technologies combined in a single device. If anything I think the EU government should have some manhattan plan to officially document how it works because I am firmly of the belief that even ASML themselves don't have the full picture written down in a single place. They just defer to the speciifc experts for specific parts of the system. But if enough people leave or die the technology is in genuine chance of being lost forever.

When will low-cost Chinese GPUs hit the market?

Posted by noellarkin@reddit | LocalLLaMA | View on Reddit | 97 comments

[-]

Down_The_Rabbithole@reddit

China has spend hundreds of billions and hiring ASML executives and engineers for 15 years now trying to get EUV. They haven't succeeded so far. Even the US has tried to get independent EUV and failed. EUV machines are the most complex machines humanity has ever build and no other entity is capable of reproducing it. Not even the US government. China isn't getting EUV.

When will low-cost Chinese GPUs hit the market?

Posted by noellarkin@reddit | LocalLLaMA | View on Reddit | 97 comments

[-]

Down_The_Rabbithole@reddit

China is *not* piloting 5nm. Their "5nm" node is branch off of their existing 7nm. China doesn't have a grasp on EUV technology and they will not have it within the next decade. Their lithographic stack can't go much further than the current 7nm node they have so they can only optimize around the same mark they are now. Meanwhile ASML is right now rolling out High-NA EUV which is the next generation. We will see the gap between China and the west widening more, not closing.

Wow anthropic and Google losing coding share bc of qwen 3 coder

Posted by Independent-Wind4462@reddit | LocalLLaMA | View on Reddit | 128 comments

[-]

Down_The_Rabbithole@reddit

Sonnet is actually better for coding. It's about equivalent in output but significantly faster so you can iterate quicker on whatever your workload is.

Wow anthropic and Google losing coding share bc of qwen 3 coder

Posted by Independent-Wind4462@reddit | LocalLLaMA | View on Reddit | 128 comments

[-]

Down_The_Rabbithole@reddit

This is true for me. I use claude at work through official API while I experiment with OpenRouter at home to test new models for a while.

LocalLLaMA is the last sane place to discuss LLMs on this site, I swear

Posted by ForsookComparison@reddit | LocalLLaMA | View on Reddit | 213 comments

[-]

Down_The_Rabbithole@reddit

I don't even know what usecases remain after those.

ollama

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 340 comments

[-]

Down_The_Rabbithole@reddit

Ollama does a lot of shady stuff on the AI model trainer side as well. As part of the Google contest for finetuning Gemma 3n on Kaggle Ollama would pay out an extra $10,000 if you packaged their inference stack into whatever solution you would win the price with. They are throwing money at adoption and that's why everyone you hear talking about it online mentions Ollama (because they get shady deals or paid to do so) It's literally just a llama.cpp fork that is buggier and doesn't work properly most of the time. It's also less convenient to use if you ask me. They just have money behind them to push it everywhere.

I'm disappointed with GPT-5

Posted by Dr_Karminski@reddit | LocalLLaMA | View on Reddit | 128 comments

[-]

Down_The_Rabbithole@reddit

Anthropic is compute constrained. They have repeated numerous times that they aren't interested in gaining market share or even people using their models. They put an arbitrary high price on their tokens to try and limit demand as much as possible so that they have more compute for their training runs. Amodei has repeatedly said that he thinks Claude models should be accessible to the general public which is why he provides a public API but that he prefers no one uses Claude so that they can focus on training the next generation of models. Anthropic and OpenAI aren't playing the same game.

Elon Musk says that xAI will make Grok 2 open source next week

Posted by Nunki08@reddit | LocalLLaMA | View on Reddit | 214 comments

[-]

Down_The_Rabbithole@reddit

I'm not fine with *anyone* not open sourcing their models. There are tons of different ways to organize your business to be profitable while still open sourcing all your models as soon as possible.

Elon Musk says that xAI will make Grok 2 open source next week

Posted by Nunki08@reddit | LocalLLaMA | View on Reddit | 214 comments

[-]

Down_The_Rabbithole@reddit

Should try Claude 4 Opus for a change then.

Gemini 2.5 Deep Think mode benchmarks!

Posted by Beautiful-Essay1945@reddit | LocalLLaMA | View on Reddit | 72 comments

[-]

Down_The_Rabbithole@reddit

Claude

4B models are consistently overlooked. Runs Locally and Crushes It. Reasoning for UI, Mobile, Software and Frontend design.

Posted by smirkishere@reddit | LocalLLaMA | View on Reddit | 80 comments

[-]

Down_The_Rabbithole@reddit

The issue I have with smaller models like this is why ever use it? Just run the larger model slowly if you care for best possible output (which you should for professional usecases like generating UI)

One year’s benchmark progress: comparing Sonnet 3.5 with open weight 2025 non-thinking models

Posted by nomorebuttsplz@reddit | LocalLLaMA | View on Reddit | 36 comments

[-]

Down_The_Rabbithole@reddit

New Opus is superior to old Opus in creative writing, understanding nuance and understanding your inherent intent behind whatever your prompt is.

Introducing the world's most powerful model

Posted by eastwindtoday@reddit | LocalLLaMA | View on Reddit | 199 comments

[-]

Down_The_Rabbithole@reddit

It used to be coding, roleplaying *and* philosophical discussions. 4 seems to only be good at coding.

Llama 4 Benchmarks

Posted by Ravencloud007@reddit | LocalLLaMA | View on Reddit | 139 comments

[-]

Down_The_Rabbithole@reddit

Not a local model

Is it worth spending so much time and money on small LLMs?

Posted by ML-Future@reddit | LocalLLaMA | View on Reddit | 79 comments

[-]

Down_The_Rabbithole@reddit

Ironically the main reason LLMs are useful for fact retrieval is because search engines have gotten unusable over the years and google in particular is so bad that I have completely stopped using it years ago. If Google was still as good as in the 2000s the usecase for LLM information retrieval would be gone. However now LLMs are absolutely the best way to get information with search engines merely being the best way to double-check the source correctness.

Are there any LLMs with less than 1m parameters?

Posted by UselessSoftware@reddit | LocalLLaMA | View on Reddit | 73 comments

[-]

Down_The_Rabbithole@reddit

Really wonder what the absolute tiniest size is where models are still coherent as in sentences are at least tangentially related to each other. It's not this 260K model. What about 1M? 5M? 10M?