1000 token/s, it's blazing fast!!! Fairl

[-]

Working-Leader-2532@reddit

Let's see if this is actually a better model or not.
I will be testing out on a speech-to-text model, and the LLM enhancement.

Well, speed-wise, it is quite fast. I mean, like, absolutely fast, but I want to know if it can follow up a really properly put-up LLM prompt to optimize and make the speech-to-text dictation transcript better.

[-]

Ambitious-Art-604@reddit

could be a new mercury model?

[-]

khudgins@reddit

If it's a state-space model (or mostly so, most models using that tech have mixed layers), it's attention calculation is linear, not quadratic, so you get huge performance gains for inference.

LiquidAI's 24B MoE model runs locally for me at over 200 tokens/sec on a mac studio with vllm. On production grade hardware, it wouldn't surprise me that a really efficient model using state-space architecture gets that fast.

[-]

repair_and_privacy@reddit

But does this work reliably??? In the quality side

[-]

khudgins@reddit

It's based on the model's training and size. Nemotron is also state space and is quite good. (If you see Mamba in a model card, it's state space). Transformers is only one way to build a model. It wouldn't surprise me if Meta's new model is similarly architected.

[-]

Fluxing_Capacitor@reddit

Nemotron is actually hybrid; attention and state space. I think the Mamba team has released some proof of concept model, but AFAIK theres no pure state space LLM.

[-]

jsonmona@reddit

Not state space, but there is pure RNN LLM called RWKV.

[-]

Tall-Ad-7742@reddit

Well Intelligence and Speed are not the same thing...

[-]

DelKarasique@reddit

Proven by nemotron

[-]

prestodigitarium@reddit

How is nemotron in practice? I’ve been meaning to try out the nvfp4 120b.

[-]

philmarcracken@reddit

Its like my half sister; incredibly dumb at high speed.

[-]

prestodigitarium@reddit

Ha damn, I was optimistic about running nemoclaw with it, seemed to bench well.

[-]

oodelay@reddit

TAKES ONE TO KNOW ONE

still got it

[-]

Alternative_You3585@reddit

Rarely speed might lead to better results as the model might try so many more things out that it simply brute forces a task which a more intelligent model might need more time

[-]

Existing-Wallaby-444@reddit

"Hmm. Deleting this table didn't work. Let me try this one."

[-]

Alternative_You3585@reddit

If you approve a models bash command that's on you. Moreover letting any model at a real infrastructure is incredibly stupid

[-]

rpkarma@reddit

True but my coworkers are and my EM is on my case about not going fast enough :D

[-]

skrshawk@reddit

I facepalm at least once a week when someone asks me to validate a LLM generated script. Especially when I know it won't work because the command syntax was hallucinated.

[-]

seamonn@reddit

Instructions unclear. Yeeted Production Database and all Backups.

[-]

Certain_Pick3278@reddit

The user is already the bottleneck when it comes to speed, except now I have to cancel much more stupid requests... So no, I'd rather wait 5 more mins and get a reasonable approach I can approve than decline 100s that I have to decline.

Also, having a lot of bad ideas erodes trust, meaning now I have to check the AI output even more and on a deeper level.

[-]

Existing-Wallaby-444@reddit

You think so?

[-]

akza07@reddit

If it is reasoning model. Then maybe...

[-]

seamonn@reddit

Good thing my models are bad at both. hahaha...

*Cries in GPU Poor*

[-]

shaolinmaru@reddit

The Captain America math meme

[-]

BalorNG@reddit

Well, for a model that need to operate in real world and react in real time a bit dumber, but much faster model will be much more preferable... like our "System 1" intelligence... or not just preferable, but outright viable vs unviable, if it is just good enough. That's a pretty big if, however!

[-]

Tall-Ad-7742@reddit

that is true and i have never doubt that. my comment was just a statement

[-]

letsgoiowa@reddit

In my quick and dirty testing it was way worse than qwen 3.5 4b. Fast yes but useless really. Couldn't web search, couldn't figure out how to make a 3x5 table.

[-]

zenmagnets@reddit

One of the dumbest models y'all tricked me into testing this month

[-]

letsgoiowa@reddit

For real it can't even figure out how to make a simple 3x5 table lol.

[-]

kyr0x0@reddit

But it's fast!!!1!! 🤣😆

[-]

Nicking0413@reddit

Can somebody explain to me how it’s supposedly faster than other 100B models? Or is it just a marketing thing

[-]

Alternative_You3585@reddit

A company once reached 17k tokens/s by specifically engeneiring a chip designed just for inference on a llama 8B model, Id assume similar here

[-]

Nicking0413@reddit

Oh wait so hardware acceleration? So is open router designing chips or was the model built around one of the existing chips? Or is it something else?

[-]

WhatTheFlukz@reddit

you can look at it on chatjimmy.ai its made by taalas

[-]

CallumCarmicheal@reddit

Holyshit that is fast.

[-]

WhatTheFlukz@reddit

lol yeah its pretty cool, its completely inflexible though, once the weights are baked in they are baked in. i think it would be interesting to see if they could come up with a hybrid approach with a small layer on top with a little bit of vram so you can tune/steer it

[-]

ForeverInYou@reddit

This is not an openrouter model, openrouter only, well, routes llms

[-]

Nicking0413@reddit

Oh. So the company/server they routed to had special chips to run the model? And they’re posting it because it’s like a collaboration thing?

[-]

ForeverInYou@reddit

Yeah the company is developing a new model, and needs people to test to collect data. Then they put on openrouter for easy access for a lot of people

[-]

Nicking0413@reddit

Ah ok thanks

[-]

z_latent@reddit

Yes, Taalas is the company. But this model isn't like that. If they actually had made such a chip, you can be sure they'd let it run at the 10k or so tokens/s, for marketing.

A Groq or Cerebras chip would be more reasonable at these speeds. I personally still think it's architectural though (like diffusion or MTP)

[-]

-dysangel-@reddit

It might be a diffusion model, or quantised to 1 bit (bonsai?)

[-]

DeepV@reddit

1) The models architecture matters as much as the number of params.

2) It will depend on hardware, and how specialized hardware you’ll need to get that.

3) We don’t know the benchmarks yet

[-]

MagicZhang@reddit

There’s a chance fair it’s a diffusion model, similar to what Google Deepmind have been doing (their experimental Gemini Diffusion can reach 200+ tokens/sec)

[-]

tracagnotto@reddit

Spoiler: It's shit.
If you give it a minion short task it can be decent. Anything else it gets lost in hundred of loops and comes out to nothing.
Don't know what this crap is, but they can keep it for themselves.
I couldn't manage it to clone and install a simple repo.

[-]

ongrabbits@reddit

Gave it the car wash test.

Me: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Response

Elephant: For a 50-meter trip, walking is likely the better choice. Here's why:

Distance: 50 meters is very short — about the length of half a football field. Walking time: At a normal pace, it would take less than a minute. Driving considerations: You'd need to start the car, possibly park nearby, and maneuver in a area likely meant for walking. The time saved would be minimal. Unless you're in a hurry or have physical limitations that make walking difficult, walking 50 meters is simple, free, and avoids any hassle of driving.

So yes — walk to the car wash.

[-]

tracagnotto@reddit

Test like this prove nothing but that the current architecture of every llm is weak. My tests were asking to clone and install useful ai repos and configure them alone and let me have the whole thing up and running. But it got tangled into his own loops reaching all the time my generous limit of 120 actions to complete a task

[-]

Auralore@reddit

So you're saying it's Opus level?

[-]

llkj11@reddit

Better be for a non reasoner

[-]

holygawdinheaven@reddit

Might be a diffusion llm at that speed. Llada is that fast

[-]

federico_84@reddit

Or it could be served from Cerebras or Grok, SRAM-only inference servers are blazing fast.

[-]

redballooon@reddit

Groq has slowed down lately to between 350 and 500t/s depending in the model.

[-]

z_latent@reddit

Groq*

(and they came before xAI btw)

[-]

federico_84@reddit

Indeed, thanks for catching that 😉

[-]

Traditional-Gap-3313@reddit

on my classification benchmark the model is the worst model of all tested. I ran Qwen 3.5 9B just to be able to compare this model to it. 65% False positive rate.

[-]

grencez@reddit

It's pretty dumb in my tests of simple game logic (e.g., tic-tac-toe), but it doesn't make formatting mistakes like I'd expect from a model with such a logic deficiency. So, like the model description says, maybe it's good at simple but token-heavy tasks?

[-]