Spectral-AI - a project to use Nvidia RT cores to dramatically speedup MoE inference on Nvidia GPU's (Crazy Fast!)
Posted by Thrumpwart@reddit | LocalLLaMA | View on Reddit | 25 comments
indigos661@reddit
looks like another innovation by claude
Thrumpwart@reddit (OP)
Probably. Is that bad though? If it works, I twerks.
qfox337@reddit
The whole point is that it doesn't work though. As mentioned in the other reply, someone's random research model just wasn't sensitive to massive dimensionality reduction in the expert router, which usually isn't the case for real models.
If you're gonna use AI to explore, cool whatever, just like actually check the results before wasting everyone's time.
(Of course, very generally, using other cores does sound interesting, but I'd also be curious if TDP throttling makes it a wash)
Thrumpwart@reddit (OP)
I don’t know that it doesn’t work. The GitHub author got it working. They provided a repo, some metrics, and explained how they did it. If you’re unhappy that it didn’t work for you, I’m just going to have to learn to live with that.
It’s an innovative technique that could be really beneficial to the local LLM community. If it needs additional work, you can say that. I don’t quite get why this is being met with hostility though.
I suspect part of it is resentment from purists who study this stuff and develop it academically or professionally. There’s a boat load of people like the GitHub author and myself who have no idea what we’re doing, but want to try new things.
I like to share new things that seem interesting. This is new and seems interesting. I really don’t care if the dude is using an LLM to translate or draft his posts - it’s not about them. It’s about the tech and how it could help a bunch of people access these incredible tools.
I’m sorry if you think I wasted your time. I hope by posting this I inspired a few hundred people to check out the repo and maybe someone will fix it, or adapt it, or improve upon it.
I hadn’t considered tdp limits but that does seem like a worthwhile consideration. I’ve never played with RT cores for gaming or anything else, so I’m curious now too. I hope 100 more people are curious about it now too.
DerDave@reddit
Cool idea to use unused hardware. I have some feedback and a question:
1) This seems to accelerate the MoE expert routing but has no influence on the speed or memory usage of the actual inference within the experts. So your memory savings and speed improvements only refer to a small part of the actual processing time + memory needs of the entire model. Would be less misleading to show the full picture.
2) You seem to be a solo researcher and I respect that but why do you always say "We"? I find it pretty odd when people refer to themselves + their AI, like they are a group of researchers. That also has slightly misleading vibes.
3) Lastly about the hierarchy and dimensions - why is it not truely hierarchical? With for layers and three hardware-accelerated dimensions you could have 3x3x3x3=81 dimensions instead of just 3+3+3+3=12. I think you would need 1x3x3x3=27 precomputed PCAs but that effort should be worth the gained higher dimensionality and expressiveness. In theory each token would have to go through 27 BVH traversals but given how fast they are, that shouldn't hurt right? You could even add another level and gain a dimensionality of 243. As a further optimization you could selectively only continue tokens in later stage BVH traversal with a high value and find a cutoff to spare the other less promissing branches. Or did I completely misunderstand something here?
Critical-Chef9211@reddit
Good feedback, taking it point by point:
You’re right — routing is ~3% of total inference today. The memory savings (731×) apply to the router, not the expert weights. The README has a table showing the full inference breakdown (MLPs 63%, attention 20%, routing 2.8%). The argument for why it matters is scaling: at 1K–10K experts the O(N) gate dominates; BVH stays flat. But fair to say that’s speculative at current model sizes.
Fair point — “we” is just an academic writing habit, not a reference to AI. Solo researcher, no co-authors. I’ll switch to “I”
This is actually a genuinely better design than what I implemented. You’re right that 3^4 = 81 effective dimensions is achievable with branch-specific PCAs (27 precomputed projections at the leaf level), versus the single global PCA I used. The pruning idea — only continuing high-confidence branches — is essentially what the confidence-gated routing does, but your version would make it truly hierarchical. The main open question is whether branch-specific PCAs trained on enough data per region stay stable, since lower branches see fewer tokens. But the idea is sound and would be a real improvement. Worth trying.
cunasmoker69420@reddit
That's pretty cool but what are your thoughts on banana pudding? Do you have a recipe you could share?
Thrumpwart@reddit (OP)
Banana pudding sounds good.
Critical-Chef9211@reddit
There’s a human here, yes. My English is pretty bad so I use AI to help write clearly — that’s probably why it reads that way. The research and code are mine though. And re: banana pudding — no recipe, but now I want some.
cunasmoker69420@reddit
whatever you say robot man
DerDave@reddit
Thanks for taking the feedback professionally.
Cool you're trying it out - looking forward to seeing what results you'll have.
I wonder if the direction is really thousands of experts. There seems to be a trend that experts stay within the 10B-20B range, while models get bigger and bigger, so the direction is there. Will take a while to reach the 10k of experts though 😅
jkflying@reddit
FYI, you're responding to an LLM
DerDave@reddit
I'm aware the text was produced by an llm but I trust there is a human behind it, reading my comment.
cunasmoker69420@reddit
nah man these are just automated bots
maboesanman@reddit
Using “we” in that way is called “academic voice” and is generally considered appropriate for papers
Mashic@reddit
I think it's still the polite form to use we instead of I in this type of content. Or even better, speak from the point of the software, it does x and y only.
Hytht@reddit
There is misinformation in the README and it doesn't make LLMs much faster overall as this issue explains https://github.com/JordiSilvestre/Spectral-AI/issues/2
Thrumpwart@reddit (OP)
I thought the authors response to that was good.
Infninfn@reddit
So good it has not one but four emdashes.
Thrumpwart@reddit (OP)
Oh no, is something from localllama using AI for something?
Critical-Chef9211@reddit
Already fixed in the latest commit — the O(N²) claim and the “12 dimensions” overclaim are both corrected. Fair catches.
datbackup@reddit
Quickly went and searched and found that 3090 has 82 RT cores. 4090 has 128. 5090 has 170.
AvidCyclist250@reddit
4080 has 76, in case anyone was wondering.
smirk79@reddit
The claims in this post are so amazing, I'm over here renting an AWS instance to try and verify them after digging into the idea and code with my buddy Claude...
Critical-Chef9211@reddit
Genuinely curious what you find. All the profiling scripts are in the repo if you want to run them directly instead of just reading the code.