Spectral-AI - a project to use Nvidia RT cores to dramatically speedup MoE inference on Nvidia GPU's (Crazy Fast!)

[-]

indigos661@reddit

looks like another innovation by claude

[-]

Thrumpwart@reddit (OP)

Probably. Is that bad though? If it works, I twerks.

[-]

The whole point is that it doesn't work though. As mentioned in the other reply, someone's random research model just wasn't sensitive to massive dimensionality reduction in the expert router, which usually isn't the case for real models.

If you're gonna use AI to explore, cool whatever, just like actually check the results before wasting everyone's time.

(Of course, very generally, using other cores does sound interesting, but I'd also be curious if TDP throttling makes it a wash)

[-]

Thrumpwart@reddit (OP)

I don’t know that it doesn’t work. The GitHub author got it working. They provided a repo, some metrics, and explained how they did it. If you’re unhappy that it didn’t work for you, I’m just going to have to learn to live with that.

It’s an innovative technique that could be really beneficial to the local LLM community. If it needs additional work, you can say that. I don’t quite get why this is being met with hostility though.

I suspect part of it is resentment from purists who study this stuff and develop it academically or professionally. There’s a boat load of people like the GitHub author and myself who have no idea what we’re doing, but want to try new things.

I like to share new things that seem interesting. This is new and seems interesting. I really don’t care if the dude is using an LLM to translate or draft his posts - it’s not about them. It’s about the tech and how it could help a bunch of people access these incredible tools.

I’m sorry if you think I wasted your time. I hope by posting this I inspired a few hundred people to check out the repo and maybe someone will fix it, or adapt it, or improve upon it.

I hadn’t considered tdp limits but that does seem like a worthwhile consideration. I’ve never played with RT cores for gaming or anything else, so I’m curious now too. I hope 100 more people are curious about it now too.

[-]

DerDave@reddit

Cool idea to use unused hardware. I have some feedback and a question:

1) This seems to accelerate the MoE expert routing but has no influence on the speed or memory usage of the actual inference within the experts. So your memory savings and speed improvements only refer to a small part of the actual processing time + memory needs of the entire model. Would be less misleading to show the full picture.

2) You seem to be a solo researcher and I respect that but why do you always say "We"? I find it pretty odd when people refer to themselves + their AI, like they are a group of researchers. That also has slightly misleading vibes.

3) Lastly about the hierarchy and dimensions - why is it not truely hierarchical? With for layers and three hardware-accelerated dimensions you could have 3x3x3x3=81 dimensions instead of just 3+3+3+3=12. I think you would need 1x3x3x3=27 precomputed PCAs but that effort should be worth the gained higher dimensionality and expressiveness. In theory each token would have to go through 27 BVH traversals but given how fast they are, that shouldn't hurt right? You could even add another level and gain a dimensionality of 243. As a further optimization you could selectively only continue tokens in later stage BVH traversal with a high value and find a cutoff to spare the other less promissing branches. Or did I completely misunderstand something here?

[-]

Critical-Chef9211@reddit

Good feedback, taking it point by point:

You’re right — routing is ~3% of total inference today. The memory savings (731×) apply to the router, not the expert weights. The README has a table showing the full inference breakdown (MLPs 63%, attention 20%, routing 2.8%). The argument for why it matters is scaling: at 1K–10K experts the O(N) gate dominates; BVH stays flat. But fair to say that’s speculative at current model sizes.

Fair point — “we” is just an academic writing habit, not a reference to AI. Solo researcher, no co-authors. I’ll switch to “I”

This is actually a genuinely better design than what I implemented. You’re right that 3^4 = 81 effective dimensions is achievable with branch-specific PCAs (27 precomputed projections at the leaf level), versus the single global PCA I used. The pruning idea — only continuing high-confidence branches — is essentially what the confidence-gated routing does, but your version would make it truly hierarchical. The main open question is whether branch-specific PCAs trained on enough data per region stay stable, since lower branches see fewer tokens. But the idea is sound and would be a real improvement. Worth trying.

[-]

cunasmoker69420@reddit

That's pretty cool but what are your thoughts on banana pudding? Do you have a recipe you could share?

[-]

Thrumpwart@reddit (OP)

Banana pudding sounds good.

[-]

Critical-Chef9211@reddit

There’s a human here, yes. My English is pretty bad so I use AI to help write clearly — that’s probably why it reads that way. The research and code are mine though. And re: banana pudding — no recipe, but now I want some.

[-]

cunasmoker69420@reddit

whatever you say robot man

[-]

DerDave@reddit

Thanks for taking the feedback professionally.

Cool you're trying it out - looking forward to seeing what results you'll have.

I wonder if the direction is really thousands of experts. There seems to be a trend that experts stay within the 10B-20B range, while models get bigger and bigger, so the direction is there. Will take a while to reach the 10k of experts though 😅

[-]

jkflying@reddit

FYI, you're responding to an LLM

[-]

DerDave@reddit

I'm aware the text was produced by an llm but I trust there is a human behind it, reading my comment.

[-]

cunasmoker69420@reddit

nah man these are just automated bots

[-]

maboesanman@reddit

Using “we” in that way is called “academic voice” and is generally considered appropriate for papers

[-]

Mashic@reddit

I think it's still the polite form to use we instead of I in this type of content. Or even better, speak from the point of the software, it does x and y only.

[-]

Hytht@reddit

There is misinformation in the README and it doesn't make LLMs much faster overall as this issue explains https://github.com/JordiSilvestre/Spectral-AI/issues/2

[-]

Thrumpwart@reddit (OP)

I thought the authors response to that was good.

[-]

Infninfn@reddit

So good it has not one but four emdashes.

[-]

Thrumpwart@reddit (OP)

Oh no, is something from localllama using AI for something?

[-]