1000 token/s, it's blazing fast!!! Fairl
Posted by Anxious_Basil8446@reddit | LocalLLaMA | View on Reddit | 66 comments
Posted by Anxious_Basil8446@reddit | LocalLLaMA | View on Reddit | 66 comments
Working-Leader-2532@reddit
Let's see if this is actually a better model or not.
I will be testing out on a speech-to-text model, and the LLM enhancement.
Well, speed-wise, it is quite fast. I mean, like, absolutely fast, but I want to know if it can follow up a really properly put-up LLM prompt to optimize and make the speech-to-text dictation transcript better.
Ambitious-Art-604@reddit
could be a new mercury model?
khudgins@reddit
If it's a state-space model (or mostly so, most models using that tech have mixed layers), it's attention calculation is linear, not quadratic, so you get huge performance gains for inference.
LiquidAI's 24B MoE model runs locally for me at over 200 tokens/sec on a mac studio with vllm. On production grade hardware, it wouldn't surprise me that a really efficient model using state-space architecture gets that fast.
repair_and_privacy@reddit
But does this work reliably??? In the quality side
khudgins@reddit
It's based on the model's training and size. Nemotron is also state space and is quite good. (If you see Mamba in a model card, it's state space). Transformers is only one way to build a model. It wouldn't surprise me if Meta's new model is similarly architected.
Fluxing_Capacitor@reddit
Nemotron is actually hybrid; attention and state space. I think the Mamba team has released some proof of concept model, but AFAIK theres no pure state space LLM.
jsonmona@reddit
Not state space, but there is pure RNN LLM called RWKV.
Tall-Ad-7742@reddit
Well Intelligence and Speed are not the same thing...
DelKarasique@reddit
Proven by nemotron
prestodigitarium@reddit
How is nemotron in practice? I’ve been meaning to try out the nvfp4 120b.
philmarcracken@reddit
Its like my half sister; incredibly dumb at high speed.
prestodigitarium@reddit
Ha damn, I was optimistic about running nemoclaw with it, seemed to bench well.
oodelay@reddit
TAKES ONE TO KNOW ONE
still got it
Alternative_You3585@reddit
Rarely speed might lead to better results as the model might try so many more things out that it simply brute forces a task which a more intelligent model might need more time
Existing-Wallaby-444@reddit
"Hmm. Deleting this table didn't work. Let me try this one."
Alternative_You3585@reddit
If you approve a models bash command that's on you. Moreover letting any model at a real infrastructure is incredibly stupid
rpkarma@reddit
True but my coworkers are and my EM is on my case about not going fast enough :D
skrshawk@reddit
I facepalm at least once a week when someone asks me to validate a LLM generated script. Especially when I know it won't work because the command syntax was hallucinated.
seamonn@reddit
Instructions unclear. Yeeted Production Database and all Backups.
Certain_Pick3278@reddit
The user is already the bottleneck when it comes to speed, except now I have to cancel much more stupid requests... So no, I'd rather wait 5 more mins and get a reasonable approach I can approve than decline 100s that I have to decline.
Also, having a lot of bad ideas erodes trust, meaning now I have to check the AI output even more and on a deeper level.
Existing-Wallaby-444@reddit
You think so?
akza07@reddit
If it is reasoning model. Then maybe...
seamonn@reddit
Good thing my models are bad at both. hahaha...
*Cries in GPU Poor*
shaolinmaru@reddit
The Captain America math meme
BalorNG@reddit
Well, for a model that need to operate in real world and react in real time a bit dumber, but much faster model will be much more preferable... like our "System 1" intelligence... or not just preferable, but outright viable vs unviable, if it is just good enough. That's a pretty big if, however!
Tall-Ad-7742@reddit
that is true and i have never doubt that. my comment was just a statement
letsgoiowa@reddit
In my quick and dirty testing it was way worse than qwen 3.5 4b. Fast yes but useless really. Couldn't web search, couldn't figure out how to make a 3x5 table.
zenmagnets@reddit
One of the dumbest models y'all tricked me into testing this month
letsgoiowa@reddit
For real it can't even figure out how to make a simple 3x5 table lol.
kyr0x0@reddit
But it's fast!!!1!! 🤣😆
Nicking0413@reddit
Can somebody explain to me how it’s supposedly faster than other 100B models? Or is it just a marketing thing
Alternative_You3585@reddit
A company once reached 17k tokens/s by specifically engeneiring a chip designed just for inference on a llama 8B model, Id assume similar here
Nicking0413@reddit
Oh wait so hardware acceleration? So is open router designing chips or was the model built around one of the existing chips? Or is it something else?
WhatTheFlukz@reddit
you can look at it on chatjimmy.ai its made by taalas
CallumCarmicheal@reddit
Holyshit that is fast.
WhatTheFlukz@reddit
lol yeah its pretty cool, its completely inflexible though, once the weights are baked in they are baked in. i think it would be interesting to see if they could come up with a hybrid approach with a small layer on top with a little bit of vram so you can tune/steer it
ForeverInYou@reddit
This is not an openrouter model, openrouter only, well, routes llms
Nicking0413@reddit
Oh. So the company/server they routed to had special chips to run the model? And they’re posting it because it’s like a collaboration thing?
ForeverInYou@reddit
Yeah the company is developing a new model, and needs people to test to collect data. Then they put on openrouter for easy access for a lot of people
Nicking0413@reddit
Ah ok thanks
z_latent@reddit
Yes, Taalas is the company. But this model isn't like that. If they actually had made such a chip, you can be sure they'd let it run at the 10k or so tokens/s, for marketing.
A Groq or Cerebras chip would be more reasonable at these speeds. I personally still think it's architectural though (like diffusion or MTP)
-dysangel-@reddit
It might be a diffusion model, or quantised to 1 bit (bonsai?)
DeepV@reddit
1) The models architecture matters as much as the number of params.
2) It will depend on hardware, and how specialized hardware you’ll need to get that.
3) We don’t know the benchmarks yet
MagicZhang@reddit
There’s a chance fair it’s a diffusion model, similar to what Google Deepmind have been doing (their experimental Gemini Diffusion can reach 200+ tokens/sec)
tracagnotto@reddit
Spoiler: It's shit.
If you give it a minion short task it can be decent. Anything else it gets lost in hundred of loops and comes out to nothing.
Don't know what this crap is, but they can keep it for themselves.
I couldn't manage it to clone and install a simple repo.
ongrabbits@reddit
Gave it the car wash test.
Response
tracagnotto@reddit
Test like this prove nothing but that the current architecture of every llm is weak. My tests were asking to clone and install useful ai repos and configure them alone and let me have the whole thing up and running. But it got tangled into his own loops reaching all the time my generous limit of 120 actions to complete a task
Auralore@reddit
So you're saying it's Opus level?
llkj11@reddit
Better be for a non reasoner
holygawdinheaven@reddit
Might be a diffusion llm at that speed. Llada is that fast
federico_84@reddit
Or it could be served from Cerebras or Grok, SRAM-only inference servers are blazing fast.
redballooon@reddit
Groq has slowed down lately to between 350 and 500t/s depending in the model.
z_latent@reddit
Groq*
(and they came before xAI btw)
federico_84@reddit
Indeed, thanks for catching that 😉
Traditional-Gap-3313@reddit
on my classification benchmark the model is the worst model of all tested. I ran Qwen 3.5 9B just to be able to compare this model to it. 65% False positive rate.
grencez@reddit
It's pretty dumb in my tests of simple game logic (e.g., tic-tac-toe), but it doesn't make formatting mistakes like I'd expect from a model with such a logic deficiency. So, like the model description says, maybe it's good at simple but token-heavy tasks?
dergachoff@reddit
tried it with russian and it's hideous, one of the worst models i've tested with this language (worse than 9b/4b models)
eXl5eQ@reddit
Where does the number came from? OpenRouter model page shows ~100t/s throughput.
madezaen@reddit
There's zero difference between 100 and 1000 😉
yarikfanarik@reddit
100b? feels less
coder543@reddit
Why do people keep saying 1000 tokens per second? It is 63 tokens per second: https://openrouter.ai/openrouter/elephant-alpha
VonDenBerg@reddit
OR always throttles speeds
coder543@reddit
Why? That is a very confusing thing to do.
ELPascalito@reddit
They are not throttling, the load is simply too big, thus it needs to be balanced, the model dropped yesterday yet it already processed more than 50 billion tokens lol
VonDenBerg@reddit
Buy low, sell high.
Alternative_You3585@reddit
Try it in chat, I reached 700 t/s