What a time to be alive from 1tk/sec to 20-100tk/sec for huge models

Llama405b q4 at 1.2tk/sec 2 years ago was something to be excited about.

That same hardware will now run HUGE state of the art models (kimik2.6, deepseekv4flash, minimax2.7, step3.5flash, qwen3.5-397b) at 30tk-100tk/sec while crushing llama405b. :-/

I recall folks asking why anyone would want to run Llama405b at 1.2/tk, etc. My answer when folks asked me was that I wanted to be ready for when AGI arrived. If it meant being able to run my own super AI at 1tk/sec I wanted that option. It turned out better than I could have ever imagined, we do have super AGI and we can run them cheap and fast.

Putting aside the huge models, for a few hundred $ you could run qwen3.6-36b at 50tk/sec at home. So to my fellow local llama nuts, stay crazy, keep experimenting, ignore the naysayers, all the "stupid", "waste of time" experiments are paying off.

[-]

Eyelbee@reddit

Wasn't that a dense model? The others are MoE, that's why you're able to run them fast. 405B would be just as slow today. If you mean the capability-wise jump, yeah, that's true.

[-]

rm-rf-rm@reddit

Yeah, it was. OP probably doesn't understand the difference between MOE and dense.

[-]

grumd@reddit

OP also said "we have super AGI" so uuuh

[-]

segmond@reddit (OP)

yup, I did, and what's your point? we can have this debate if you wish to. these models are smarter than 99.9% of people on reddit. if that's not AGI, I don't know what is. I definitely call it super because I remember when the papers were evaluating cloud models as being around 110-120 IQ points. These local models are 150+ easily. some how AGI has now turned into autonomous AI and super has turned into singularity. give me a break folks.

[-]

splice42@reddit

if that's not AGI, I don't know what is.

I can confirm you don't know what AGI is.

[-]

segmond@reddit (OP)

I can confirm your stupidity.

[-]

cristoper@reddit

smarter than 99.9% of people on reddit

That's got to be the lowest bar for "AGI" I've ever seen

[-]

Nefilim314@reddit

I literally read someone say “bike lanes make cities more dangerous” yesterday and there’s no way you can convince me this wasn’t a human trained on slop.

[-]

cristoper@reddit

This video is too long, but it gives some insight into the source of a lot of that kind of thinking:

https://www.youtube.com/watch?v=pRPduRHBhHI

[-]

Nefilim314@reddit

I literally read someone say “bike lanes make cities more dangerous” yesterday and there’s no way you can convince me this wasn’t a human trained on slop.

[-]

CorpusculantCortex@reddit

Claude opus and gpt 5.5 (both objectively better than any open source sota) both struggle with relatively simple tasks that require attention beyond baseline. Which means they are operating at a lesser degree than most people in day to day tasks. And that is with significant bespoke infrastructure to support those operations beyond just prompting or agentic tools.

The most basic consideration of AGI is the ability to autonomously manage tasks comparable to humans over time. The “general” part is not just breadth of knowledge, but the ability to plan, adapt, and self manage across many different kinds of work.

They are no where near consideration for agi, there is absolutely no debate. They are very useful tools that have diverse domain knowledge and decent reasoning for some short form tasks. They are no where near autonomous in any trustworthy sense.

[-]

Eyelbee@reddit

What tasks are those?

[-]

Dabalam@reddit

I think it's a reasonable argument to say that current AI is very impressive, but it doesn't really meet up to the standard of artificial general intelligence. It's very useful for some tasks and is absolutely better than humans in a lot of areas. But it's generalisability is still limited. Its planning ability usually still requires pretty tight planning outside narrow contexts and it isn't able to maintain abilities when executing taksks over long horizons.

That isn't to say that these issues are insurmountable. Avancements in embodiments, memory agentic and reasoning may resolve many of these issues. But the reality is, currently AI is a very helpful tool when applied to small scale well structured tasks usually with a competent human working with it. For large scope open ended and/or creative work without a human guide you kind of get what most AI skeptics expect.

That is the profile of a helpful tool, not a general intelligence. I happen to think that even if we don't get to "general intelligence" it could still be massively useful for a lot of areas once we understand well the limitations and risks.

[-]

etaoin314@reddit

This feels a bit goal post moving (not you specifically but the larger we are not at ai camp). The original definition was be convincing to most people in a conversation and when ai easily passes the touring test there is a lot of guffawing about narrow scope. When I talk to Claude opus it gets the job done with much less context than an actual person would need.

[-]

ImpressiveSuperfluit@reddit

So does a calculator. That's where the 'G' comes in.

[-]

HiddenoO@reddit

A bunch of companies' lifelines depend on hyping up AI - do you believe they wouldn't have claimed AGI if they had gotten even close to it?

[-]

JazzlikeLeave5530@reddit

IQ isn't even a good measurement so you're already starting off with bad reasoning lol

[-]

threevi@reddit

These models are more knowledgeable than most people. That's a huge achievement, knowledge is very important and lets you perform all kinds of useful tasks. But they're not smarter than people. They lack common sense and make the simplest mistakes that a child could catch a lot of the time. Intelligence isn't a simple metric, there are many different components to it, and the whole point of the 'G' in AGI is to signify an AI has matched humans in all those metrics rather than just a few. That hasn't happened, and while new models keep getting closer at a very fast pace, as of right now, they're not that close yet.

[-]

banecroft@reddit

“If that’s not AGI, I don’t know what is”
Yes that much is evident.

[-]

Snoo_28140@reddit

"if thats not AGI, I dont know what it is." - oh boy.... We got super AGI - fails to read simple diagrams. We got super AGI - can't mass replace workers. We got super AGI - can hardly even fold clothes.

"I dont know what it is." - indeed. Are you going to bring narrow examples and prove that even more?

[-]

Dany0@reddit

You think the models are smarter than 99% of reddit because they're smarter than you and you lack the cognitive capability to comprehend how smart 99% of reddit is

[-]

SIMMORSAL@reddit

No he didn't!

[-]

tiffanytrashcan@reddit

[-]

Low-Boysenberry1173@reddit

It doesn’t matter which technology itself is the reason for that. Also Attention mechanism evolved a lot in that time, but it doesnt matter. OP is just amazed about the progression itself.

[-]

segmond@reddit (OP)

duh. my point is that in 2 years we have gone from 1tk/sec to orders of magnitude with the same hardware. obviously the model architecture has changed for that to happen. it being a dense model is completely irrelevant. point again is that with the same hardware, you can infer much faster and with much better quality. i don't see the need to spell this out.

[-]

Automatic-Arm8153@reddit

Fair point honestly, I think you just communicated it badly.

But you do have a point, completely agree with you.

Not sure what you got downvoted for here though

[-]

HiddenoO@reddit

Because he's moving goal posts at this point. Nobody denies that models have improved significantly, but that doesn't mean his claims in the OP are accurate.

[-]

segmond@reddit (OP)

lol, it's localllama, votes are given by vibes not substance.

[-]

emprahsFury@reddit

You're acting like the pivot to moe hasnt been a crucial development in between llama 3.1 and kimi 2.6 today. The models of today are faster and smarter than the models of yesterday. Yeah things happened in between, thanks Sherlock.

[-]

HiddenoO@reddit

The point is that OP is comparing apples to oranges. MoE models are generally significantly less capable than non-MoE models of the same total size.

[-]

10minOfNamingMyAcc@reddit

Running the MoE on DDR4 would still be 1tok/s lol

[-]

Yes_but_I_think@reddit

If another tech comes on top of MoE, we are living in truly magical times.

[-]

yaosio@reddit

Imagine MOE but you don't need to load the entire model into VRAM.

[-]

123vovochen@reddit

There also is MTP now...

[-]

IrisColt@reddit

I will say only one thing: Llama 3.1 405B was soooo knowledgeable, and still relevant.

[-]

segmond@reddit (OP)

absolutely, the huge dense models are great knowledge wise. I just said the new models are great at a fraction of the compute. 😃

[-]

IrisColt@reddit

I agree with you too.

[-]

Wwavinghello@reddit

“Putting aside the huge models, for a few hundred $ you could run qwen3.6-36b at 50tk/sec at home.”

-what hw would this be? Seems like a 3090 is running $900-$1000 used these days and less than 24GB won’t cut it. Am I missing something?

[-]

Numerous-Annual420@reddit

The 5060 ti with 16 gb handles qwen3.6-35b nicely with turbo3 or 4 in the cache. Tends to be around 30t/s. I paid $550 for it in January.

[-]

Daemonentreiber@reddit

What context size?

30t/s seems a bit low.

[-]

Prestigious-Chair282@reddit

Offloading to cpu maybe?

[-]

Potential-Gold5298@reddit

*Shake your hand* Gemma 4 31B Q5_K_M - 0.9 t/s. My home AGI. An hour to get an answer. You know this pain.

[-]

tvetus@reddit

Is your rig a cell phone?

[-]

Potential-Gold5298@reddit

15 year old PC.

[-]

Borkato@reddit

0.9T/s? It runs at 25 T/s eval and 880 T/s pp on my 3090 + 2060

[-]

Borkato@reddit

Omg no I’m not trying to brag, I’m just confused, I thought you were saying you had a similar setup

[-]

Potential-Gold5298@reddit

Everything's fine. No – I run models on CPU + DDR3 (without GPU), so MoE is essentially the only viable option for me. Dense models like 31B I keep for especially complex tasks.

[-]

Potential-Gold5298@reddit

Good for you)

[-]

Snoo_28140@reddit

Yeah, I wanted to compare the quality vs 35b, with the actual quant I would use, on my own workflows. Reasoning on. Omfg... did it take the entire afternoon for some 30k tokens output!

[-]

Anduin1357@reddit

It's so dumb to future proof in advance for AGI like, Tesla went through several hardware revisions for FSD and each time, they thought they had the hardware capabilities to finally reach full autonomy.

You can't run AGI on current machines. If it happens, you can only run AGI on period-appropriate hardware that might only be available AFTER AGI is finally achieved but is impractical to run.

Think about it. If current cloud hardware hosting B200s from Nvidia - with distributed computing - isn't running AGI, nothing you can buy as a consumer will.

[-]

IrisColt@reddit

Exactly, add to that that the advent of AGI will redefine hardware in itself.

[-]

FullOf_Bad_Ideas@reddit

I run llama 405b at around 90 t/s PP and 11 t/s TG

Qwen 3.5 397B runs at 600 t/s PP and 30 t/s TG on the same rig.

No MTP or draft model on any configuration.

The gap in speed is big, but maybe not as big as I'd have expected.

Qwen is better for coding but has way worse Polish language proficiency than llama 405B IMO. It's definitely not a better model in all dimensions, only in some like software engineering and agentic tasks. I think the focus on agentic tasks alone makes it easier to say that new 27B models are better than old 405B models, otherwise you'd see that the improvement isn't quite as drastic. Older big models were simply trained for different tasks, and they did those tasks better than new small models. For knowledge retrieval or multilinguality, old dense models can be better since they weren't overtrained on agentic coding traces as much, so the knowledge in them didn't erode the same way.

[-]

IrisColt@reddit

Absolutely this.

[-]

Synor@reddit

7x4090

[-]

fallingdowndizzyvr@reddit

Dense versus moe. Apples versus oranges.

[-]

Silver-Champion-4846@reddit

Apples vs. Oranges: still fruits. Dense vs. Moe: still llms

[-]

fallingdowndizzyvr@reddit

Pedal car vs Ferrari: still cars.

[-]

Silver-Champion-4846@reddit

Sure sure. Of course! Naturally!

[-]

philmarcracken@reddit

grandpa models: back in my day we used to walk to get the car washed!

[-]

Silver-Champion-4846@reddit

The carwash test is stupid, just like the strawberry test. Just benchmaxed or not

[-]

Ardalok@reddit

we do have super AGI

ARC-AGI-3 be like: I'm about to end this man's whole career.

[-]

LeftHandedToe@reddit

we do have super AGI

Uhh...

[-]

Mundane_Ad8936@reddit

It’s doom running on a toaster.. nothing to get excited about. proof of concept but won’t hold up to the most basic usage..

[-]

segmond@reddit (OP)

"basic usage" - everything doesn't have to be agentic, there are lots of useful ways to get more out of LLM without agents, and even if one has to use an agentic loop the current popular methods are widely inefficient and not far off from brute forcing

[-]

Silver-Champion-4846@reddit

What do you suggest?

[-]

droning-on@reddit

Uhm.

The "from" in your scenario is different for those a little older than 3.

From: cordless land lines being a huge invention, to what we have now. My phone can plan a vacation.

[-]

UncleRedz@reddit

Not only the shift from Dense models to MoE have been a huge boost for self hosting. Architecture changes to attention is really making a big difference as well, hybrid mamba, DSA, and once DeepSeek V4 architecture innovation trickles into other labs, even better. At least on my rig, I was mostly capped at 24-32K context length, after that things got way too slow for practical use, if at all possible to run. With Qwen 3.5/3.6 and Nemotron 3 nano 30b, and to some extent Gemma4 as well, that has changed to 64k-128k usable context length. That makes a huge difference in how you run things locally. I know Mamba has been worked on for many years, but it's still incredible to see how fast models are evolving each year.

[-]

segmond@reddit (OP)

yeah. i remember the first time I did a 1million context with qwen2.5-14b. it fit, but was garbage in terms of being coherent past 32k. I actually run deepseekV4flash with the full 1million context and still have space to go, performance on point, etc.

[-]