[-]

grumd@reddit

they all specialize by syntactic type (content words vs function words vs punctuation). The "science expert" is a myth.

If this is true then it makes sense why REAP models never worked for me

[-]

Exactly! The widespread assumption that we can control or edit MoEs by patching or routing to specific "concept experts" usually fails because you aren't isolating a topic—you're just accidentally amplifying or suppressing verbs, transition words, or punctuation.

In the second paper (Expert Specialization), I did a deep dive into this exact phenomenon. The absolute "best" topic-specialized expert across all 3 models only activated 2.3x above a uniform random baseline, which is negligible. Meanwhile, the syntactic clustering was incredibly sharp and consistent layer by layer.

It completely shifts the paradigm on how we should approach Representation Engineering or expert-level patching!

[-]

grumd@reddit

Pretty sure you're writing both the post and comments using AI, can you tell me why?

[-]

BenignAmerican@reddit

Why does anyone do that? It’s laziness lol

[-]

Critical-Chef9211@reddit (OP)

I've spent the last several months writing bare-metal CUDA C++, optimizing OptiX BVH shaders, running 180+ tests, and documenting 3 full research papers from scratch in my spare time.

If saving a few minutes on English grammar translation on Reddit is "laziness", then I guess I'm guilty! 😅

[-]

BenignAmerican@reddit

Using AI probably

[-]

Critical-Chef9211@reddit (OP)

air catch! 😅 English is not my native language (I'm an independent researcher from Spain).

Because the questions here are highly technical, I've been feeding my Spanish thoughts into an LLM to help me translate and structure my replies. I want to make sure I'm explaining the paper findings clearly without messing up the grammar.

I wrote all the CUDA/OptiX kernels and did the research myself, but I definitely rely on AI as my "English PR assistant" today. Apologies if it sounded a bit too robotic! The code in the repo is 100% real and human-made.

[-]

grumd@reddit

Alright, that's fair :) Just making sure! Keep up the good work

[-]

Mguyen@reddit

The repo was at least in part or in whole, written by Claude code. Anyone who has used it can tell.

[-]

grumd@reddit

Yeah, looks like it, and seems like OP is just lying.

[-]

Alarming_Positive_59@reddit

If it works who cares if Claude wrote the code

[-]

Mguyen@reddit

Because OP is lying about it. It calls the rest of the work into question.

[-]

grumd@reddit

His post, comments and README contain multiple inaccuracies and contradictions. I'd assume the same goes about his code. That's the problem. Writing code with Claude is fine but only as long as you read it and understand what it wrote. If OP can't even write a comment on reddit without contradicting himself, why should we trust his code?

[-]

perkia@reddit

This comment is wild. Would you trust OP's code if a human had written it? I wouldn't, and AI usage doesn't change anything to the security posture.

Have your AI check it with your own criteria, install it on your own isolated env, test it yourself.

[-]

grumd@reddit

It's not even about security. If OP can't write a comment without a ton of inaccuracies and not understanding what he writes, his code most likely is the same way, with a ton of bugs and doesn't work as it's supposed to work. Yes I will trust code written by a human who knows what they're talking about more than this. Why's it so hard to believe?

And why would I waste my time setting up environments and testing every single vibecoded buggy mess someone spawned in their free time though? There's plenty of interesting tools made by knowledgeable people that actually work and bring value to the community.

Look, if OP knew what they built, how it works and why it works, then I'd trust it way more, regardless of it being written using Claude or by hand. I don't think it's a controversial take.

[-]

perkia@reddit

If OP can't write a comment without a ton of inaccuracies and not understanding what he writes, his code most likely is the same way, with a ton of bugs and doesn't work as it's supposed to work.

Have you ever conversed in English with non-fluent foreigners, for example non-bilingual Chinese people? The sentence construction and idea order is so different it generally makes their prose read like gibberish. That doesn't mean they are thoughtless, just that you can't sync easily.

And why would I waste my time setting up environments and testing every single vibecoded buggy mess someone spawned in their free time though?

Ask your AI to do it for you, let it cook and come back later?

Yes I will trust code written by a human who knows what they're talking about more than this. Why's it so hard to believe?

You shouldn't, though I agree it's not particularly hard to believe.

[-]

ParaboloidalCrest@reddit

Ugh, people like you are the most deterring thing in this sub. Just absolutely insufferable.

[-]

ParaboloidalCrest@reddit

You may think you sound like Sherloke Holmes, but you're just absolutely insufferable.

[-]

beavis9k@reddit

This is the best way I've ever seen someone handle this. You're a much kinder person than I. Bravo!

[-]

a_beautiful_rhind@reddit

You still can man.. some of those experts will be on glyphs or other languages, markdown, etc. Pretty sure the REAP models reaped multilingual hard.

[-]

RemarkableAntelope80@reddit

Lmao yeah, pretty sure I saw someone on here prune Qwen's low-use experts, where people said it basically forgot how to use a thinking token. I don't suppose its always so obvious what is being thrown away, but infrequent clearly doesn't mean unimportant.

[-]

Monkey_1505@reddit

REAP just tests the activations on a particular series of data (person doing the reaping can choose anything but usually math/code type tests by default), and then preserves that as it trims experts by layer.

It's not predicated on their being any form of specialization for experts. It just tests what still works for a customizable set of data.

[-]

EffectiveCeilingFan@reddit

If I understand correctly, this achieves the speedup by just not calculating attention and replacing it with something completely different. This will, obviously, cause significant degradation. I see you didn’t do any testing beyond HellaSwag, I recommend you test a benchmark that requires long context understanding.

Also, why’d you have your AI that wrote this entire thing make up a bunch of your comparison numbers? GPT-4 is not public, all your numbers regarding it are completely hallucinated.

Not to mention, I see you exclusively tested on models that are ancient. I’m assuming that’s because those were all the ones ChatGPT knew about? Like cmon man, Qwen1.5? Be serious.

[-]

Critical-Chef9211@reddit (OP)

I think there is a major misunderstanding of the post and the paper.

I am not replacing the Attention mechanism. I am replacing the linear projection Router (the Gate) inside the MoE (Mixture-of-Experts) layers. The attention mechanism remains 100% untouched.
I never mentioned or benchmarked GPT-4 anywhere. All numbers (+1.5% perplexity hit, HellaSwag drops) are strictly comparing the standard PyTorch linear routing gate versus my RT-Core BVH gate running on the exact same open-weight models.
I tested OLMoE-1B-7B, DeepSeek-MoE, and Qwen1.5-MoE because they are state-of-the-art architectures with high expert counts (64 experts), which is exactly what you need to test the O(log N) scaling curve of a BVH router.

As I mentioned in another comment, I'm using an LLM to help me translate my Spanish into proper English so I don't make grammar mistakes here on Reddit. But the C++ CUDA routing logic, the OptiX shaders, and the benchmarks on Zenodo are entirely my own work. Feel free to check the repo!

[-]

EffectiveCeilingFan@reddit

You’re lying through your teeth. You did not use an LLM to “translate”. Don’t kid yourself. You had it write pretty much the entire thing. It’s really easy to tell when something was merely translated, because then it doesn’t have all the classic AI tells. Your slop has all the classic AI tells.

It’s obvious because you clearly haven’t even read what it wrote.

[-]

Critical-Chef9211@reddit (OP)

"You are completely right about the git history. I gave the LLM my raw notes, and it generated the entire README. It originally hallucinated 'Attention' instead of 'MoE Routing', and when I realized the mistake thanks to the community, I corrected the repo.

I will be 100% honest: English is not my native language, this is my first big open-source project, and I leaned way too hard on AI to write the documentation because I didn't know how to format it properly. I fully own that mistake.

But please, stop obsessing over my bad prompt engineering and look at the actual repository. The C++, CUDA, and OptiX code is right there.

If you don't trust me, use your LLM to scan the source code for malware. Sandbox it. But please, compile it and run the benchmark. If the code doesn't work or the math is fake, feel free to roast me all you want. But test the code before judging the entire architecture based on my terrible documentation skills."

[-]

EffectiveCeilingFan@reddit

Sorry. I have been a bit critical. There’s just so much slop that gets posted every single day. And, well, when you initially made your post, it was definitely slop. In the future, you will receive much more positive attention if you explain your project in your own words, using something like Google Translate if needed. I promise you that people don’t care about bad grammar, because at least it was written by a human.

Humans like talking to humans, nothing is more insulting than looking into a project, reading the whole Readme, writing a comment pointing out all the issues with it, and then just receiving a purely LLM-generated response that is completely wrong.

[-]

Crinkez@reddit

Use a dedicated translator like Google translate. Do not use an LLM for translation.

[-]

grumd@reddit

Your README in the repo mentions GPT-4

Also you should test your architecture with Qwen 3.5 MoE if you want state-of-the-art

[-]

NarutoDragon732@reddit

Uh oh he's hallucinating

[-]

WPBaka@reddit

As I mentioned in another comment, I'm using an LLM to help me translate my Spanish into proper English so I don't make grammar mistakes here on Reddit. But the C++ CUDA routing logic, the OptiX shaders, and the benchmarks on Zenodo are entirely my own work. Feel free to check the repo!

come on bro

[-]

j_osb@reddit

Lmao.

[-]

svankirk@reddit

I think this is a fascinating approach and it deserves some attention. I don't really understand why anybody would take any time to rip into you because it was written by an AI because your English isn't so good. To be perfectly honest, most of the people on Reddit have exceptionally poor English skills. Anyways fascinating stuff. Keep up the good work! These sorts of experiments even failed. Ones are what will enable us poor folk to at least partake in some of the singularity

[-]

EffectiveCeilingFan@reddit

Hey mate, please look at their comment https://www.reddit.com/r/LocalLLaMA/s/qZxwV2NNLn and compare it to the contents of the README. They don’t even know what’s in the README that they supposedly wrote, and they call LLMs that came out January 2024 “state of the art”. None of this is their own work. It’s AI slop all the way down.

[-]

Awkward-Boat1922@reddit

It doesn't matter whether they typed every line out themselves or not.

They conceived the idea and had claude slop it together and that's fine.

This is some novel shit and we're getting threads like this deleted?

wtf is wrong with us?

[-]

EffectiveCeilingFan@reddit

Dude. This isn’t a case of “not typing every single line themselves”. The README was completely wrong. Like, the most fundamental aspect of the project, the explanation of what it does/how it works, was incorrect. And not just slightly. The original README stated that the purpose of the project was to replace attention with something else that can use RT cores. Completely wrong, it replaces the calculations done during expert routing, at least according to the updated version, which could still be wrong. LocalLLaMa explicitly prohibits low effort posts. The poster couldn’t be bothered to read their own README. Ergo, low effort. Presumably that is why it was removed. Then, when confronted about how wrong the README was, the poster doubles down and states that the problems simply do not exist… before actually taking the time to read their own README and correct the errors.

If an idea is worth sharing, it’s worth sharing yourself. There are hundreds of AI slop posts a day, who would ever want to be lumped in with them?

[-]

Critical-Chef9211@reddit (OP)

"Let me be 100% transparent with everyone here: Yes, I used AI to write and format the README. > I am the one who came up with the architecture, the math, and the idea of using RT Cores to bypass MatMul. But when it came to writing the documentation, formatting the markdown tables, and organizing the text, I heavily relied on LLMs to translate my raw ideas into a structured README. That’s why there was a hallucinated mention of GPT-4 earlier—I saw the mistake thanks to the community, and I fixed it.

To the people defending the core idea (like Awkward-Boat1922), thank you. You get it.

To the skeptics: Don't trust the README. Trust the code. The repository is open source. If you think this is just 'AI slop' or even malware, please, read the source code. Look at the CUDA/OptiX implementations. Sandbox it if you need to. But most importantly, run it.

"If you have any RTX 40-series or 50-series GPU, just run the benchmark. The O(\log N) traversal and the latency drops are real, and the code is right there for anyone to execute and verify."

[-]

RemarkableAntelope80@reddit

They literally had a message on _here_ with someone, where they said they were gonna fix that lol. Sure they definitely let the AI do the whole README, but like, shouldn't we judge the content instead? Like, chill lmao, the pitchfork will still be there tomorrow

[-]

Critical-Chef9211@reddit (OP)

Thank you so much for the kind words, it really means a lot!

The strength of the open-source community is exactly why I wanted to run these experiments and share everything. If we can figure out how to squeeze massive performance out of the consumer GPUs we already have sitting in our computers, everyone wins. Let's keep pushing! 🚀

[-]

a_beautiful_rhind@reddit

they all specialize by syntactic type (content words vs function words vs punctuation). The "science expert" is a myth.

exactly. this is a common MoE trope that has been debunked over and over.

[-]

Awkward-Boat1922@reddit

MoP might be a more apt description then?

Mixture of Partitions

[-]

a_beautiful_rhind@reddit

they're still "experts" on pieces of language. it is a language model after all.

[-]

Critical-Chef9211@reddit (OP)

100% agreed! Yet it's still surprisingly pervasive in most tutorials, blog posts, and even some AI marketing materials.

I felt it was absolutely necessary to explicitly quantify this exact phenomena across three modern but distinct architectures (OLMoE, Qwen, DeepSeek). I had to be absolutely sure that the syntactic clustering was universally predictable before attempting to map that clustered manifold into the RT Cores' geometric space.

Once I proved the "topic trope" was fundamentally false layer-by-layer, it gave me the green light to build the BVH!

[-]

a_beautiful_rhind@reddit

Do run more tests because I see they were shitting on you. Is all of this specific to blackwell or can go back to any RTX card?

[-]

Critical-Chef9211@reddit (OP)

Will definitely keep running more tests and scales!

And to answer your question: No, it is absolutely not specific to Blackwell! Because the kernels were built using the generic OptiX 9.0 API, this approach is backward compatible with ANY NVIDIA card that has dedicated RT cores.

You can clone this repo and accelerate routing on a Turing RTX 2060, Ampere 30-series, Ada 40-series, or the new 50-series cards. As long as the silicon has hardware Ray Tracing, the O(log N) geometric routing math holds up perfectly.

[-]

Awkward-Boat1922@reddit

Sounds very exciting but how come there's a perplexity hit? It should be identical, no?

Just faster?

[-]

Critical-Chef9211@reddit (OP)

Great question! It is not identical because the underlying math fundamentally changes.

A standard MoE gate uses dot(X, W) (dot product maximization). The RT Cores, however, use geometric bounding boxes and Euclidean/L2 distances through a Bounding Volume Hierarchy (BVH) to find the nearest centroid.

Because this is a geometric approximation of expert affinity rather than an exact dot-product, about 4.1% of tokens end up taking slightly different geographical paths and are routed to different experts. That slight token deviation is what causes the tiny +1.5% perplexity penalty.

[-]

Awkward-Boat1922@reddit

Thank you for the explanation.

No idea wtf this thread got removed.

[-]

Mister_bruhmoment@reddit

If, say, your work picks up speed and gets expanded and refined, would it be possible to run models (dense or moe) with standard cuda and RT cores in normal applications like LM studio or Ollama?

[-]

bebackground471@reddit

!RemindMe 4 days

[-]

RemindMeBot@reddit

I will be messaging you in 4 days on 2026-04-13 16:55:53 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

RoamingOmen@reddit

I found this out too. The , pronouns etc

[-]

BrilliantDirt2833@reddit

how is it even possible can someone explain

[-]

Artistic-Falcon-8304@reddit

be OP make something nice for the community get roasted for using AI to write cause English is not your language

lol must be first for a lot of y'all knowing other country exist other than the great US of A 🤣

[-]

FullOf_Bad_Ideas@reddit

what does

expert routing in MoE models.

have to do with

SpectralAI is a research prototype that replaces the O(N^2) Transformer attention mechanism with O(N log N) ray tracing operations, using the RT Cores already present in consumer NVIDIA GPUs (RTX 4090, RTX 5070 Ti).

attention is dense even in MoEs. I am not sure about the network that expert router uses but it's a marginal compute anyway.

[-]

Critical-Chef9211@reddit (OP)

Oh wow, phenomenally good catch! That is a massive typo in my GitHub README intro paragraph.

You are 100% correct: SpectralAI replaces the MoE Linear Routing Gate, NOT the dense Attention mechanism. My brain must have short-circuited when writing the repo description. I am pushing a commit to fix the README right now. Thank you so much for flagging that!

[-]

EffectiveCeilingFan@reddit

Lmao. AI slop.

[-]

NewtMurky@reddit

ChatGPT evaluation:

I’ll evaluate the idea behind Spectral-AI not just descriptively, but in terms of mathematical soundness, alignment with MoE theory, and practical viability.

1) What the approach is (conceptually)

Although the repo itself is lightweight, it aligns with a broader class of methods often called:

spectral routing
prototype / clustering-based routing
manifold-aware MoE

These approaches replace:

dot-product gating (linear projection + softmax)

with:

distance / spectral / clustering-based routing in latent space

This is consistent with known research directions:

Routing via distance to prototypes
Using spectral structure of embeddings
Enforcing manifold separation between experts

2) Mathematical core: is it reasonable?

✔ Yes — it is mathematically legitimate

The key idea:

w_i \propto \exp(-\|x - \mu_i\|\^2)

This is:

A Radial Basis Function (RBF) network
Equivalent to:
soft clustering
kernel density weighting
Gaussian mixture models (without covariance)

👉 This is well-established math, not experimental.

Equivalent interpretations

The routing becomes:
1. Kernel method

Experts = kernel centers
Routing = kernel weights

2. Soft clustering

Experts = clusters
Routing = cluster membership probabilities

3. Energy-based model

Lower distance = lower energy = higher probability

3) Where it improves over standard MoE

This approach addresses real, known problems:

A. Expert collapse (major issue)

Standard MoE:

Uses W x
Biases toward dominant directions
Leads to:
few experts overused
others unused

Spectral / distance routing:

Forces geometric separation
Encourages local specialization

This is explicitly supported in literature:

Spectral methods reduce overlap in expert representations

B. Interpretability

Instead of:

opaque linear logits

You get:

“this token is close to prototype X”

That’s geometrically interpretable.

C. Stability

Spectral constraints (if implemented):

bound Lipschitz constant
prevent sharp routing changes

This is actually a serious advantage in training.

4) Where the approach breaks (important)

This is where most “spectral MoE” ideas fail in practice.

❌ 1. Curse of dimensionality

Embedding space:

2K–16K dimensions

Problem:

Euclidean distance becomes uninformative
Distances concentrate:
all points look equally far

👉 This kills naive prototype routing.

❌ 2. Scaling issue

Standard MoE:

O(N) logits → top-k

Prototype routing:

O(N \cdot d) distance computations

This is:

more expensive
worse memory access pattern

❌ 3. No learned projection

Classic router:

z = W x

This:

learns task-specific routing subspace

Prototype approach:

relies on raw embedding geometry

👉 That geometry is NOT optimized for routing.

❌ 4. Training instability (if naïve)

Without constraints:

prototypes collapse
clusters overlap
routing degenerates to uniform

This is why serious work adds:

spectral regularization
orthogonality constraints
temperature tuning

❌ 5. Hardware inefficiency

Distance-based routing:

harder to fuse
worse for tensor cores
less optimized than GEMM

5) The “spectral” part — is it meaningful?

Depends on implementation.

There are two very different interpretations:

Weak version (likely in the repo)

“Spectral” = just:

distances
maybe normalization

👉 This is mostly rebranding of clustering

Strong version (research-grade)

Uses:

eigenvalues / singular values
spectral norm constraints
subspace decomposition

These are actually powerful:

enforce diversity
prevent expert overlap

👉 If the repo does NOT include:

spectral norm penalties
eigenspace reasoning

then “spectral” is mostly superficial.

6) Comparison to other modern routing ideas

Method	Strength	Weakness
Linear + softmax (standard)	Fast, optimized	Collapse
Top-k gating	Sparse, scalable	Non-smooth
Hash routing	Very fast	Less adaptive
RL routing	Optimal	complex
Spectral / prototype (this)	Structured, interpretable	scaling issues

7) Verdict

✔ Reasonable as a research direction

grounded in kernel methods and clustering
aligns with current MoE research trends
addresses real problems (collapse, specialization)

⚠️ But not production-ready (as-is)

Main blockers:

High-dimensional geometry issues
Worse hardware efficiency
Requires careful regularization
Needs hybridization with learned projections

8) What would make it actually good

If you were to evolve this idea, the winning version would be:

Hybrid router:

z = W x \quad \text{then cluster in projected space}

Add:

learned projection (critical)
low-dimensional routing space (e.g. 64–256)
spectral regularization
top-k sparsity

👉 This combines:

efficiency of standard MoE
structure of spectral routing

Final assessment

The idea behind Spectral-AI is mathematically sound and aligned with active MoE research, but in its naive form it is incomplete and unlikely to outperform standard routing without additional constraints and architectural improvements.

[-]

repolevedd@reddit

Hi. The idea of using RT cores for compute is actually pretty cool.

But to be honest, the repo has a real "AI-generated" feel to it, and your arguments, especially that wall of text with all the emojis, aren't helping. I use LLMs to help with my English too, and I'm using one to polish this message, but I can still see my own voice in what I write. When I look at your posts, it just looks like straight GPT-4 output.

I checked out the repo, and it's weird that there's plenty of info on speedups and GPU costs but zero data on actual LLM accuracy benchmarks. You only mentioned comparing CUDA vs. SpectralAI results in specific cases, which feels a bit cherry-picked to me. I'm not an ML expert, but even from my very basic experience messing around with local LLMs, I've learned that just getting a model to run is one thing, and getting it to actually produce coherent results is another. I totally believe you got these models running and that the RT cores are handling the MoE routing, but there's no way to tell if the output quality is actually any good.

[-]

Tiny_Arugula_5648@reddit

Did you fact check this?

[-]

Tiny_Arugula_5648@reddit

Do you have the expertise to fact check this analysis or are you copy and pasting something that could be filled with hallucinations?

[-]

Beginning-Window-115@reddit

ai is wrong most of the time when it comes to new stuff

[-]

Critical-Chef9211@reddit (OP)

That's actually a fascinating read! The AI got the mathematical premise (clustering, geometric separation, avoiding expert collapse) entirely correct and well-aligned with the theory.

However, it's hilarious because the AI's "Criticism #2 (Scaling Issue)" and "#5 (Hardware Inefficiency)" clearly show it hallucinated and completely missed the core innovation of my project!

ChatGPT assumes I'm running these geometric distances on standard Tensor Cores via PyTorch, which would indeed cause O(N) memory bottlenecks and horrible efficiency. BUT that is precisely the problem I solved. By projecting the tokens and mapping the subspace into an OptiX BVH (Bounding Volume Hierarchy) tree, I completely bypassed the Tensor Cores and offloaded the routing geometry strictly to the NVIDIA RT Cores (hardware ray tracing). Those dedicated silicon cores naturally traverse the BVH in O(log N) time, completely shattering the "O(N) scaling issue" and resulting in the 218x speedup over PyTorch.

It’s a great example of why we can't let AIs review novel hardware co-design yet—they assume we are bound by standard GEMM constraints!

[-]

Herr_Drosselmeyer@reddit

I also found that MoE experts don't actually specialize by topic.

How did you not know that already?

[-]

Critical-Chef9211@reddit (OP)

You're right that it's somewhat known or hinted at in literature (like the Mixtral paper), but the "topic specialist" analogy (the math expert, the coding expert) is still incredibly pervasive in tutorials, blogs, and even some AI marketing.

My goal wasn't just to point out the myth, but to strictly quantify it across three very different architectures (OLMoE, Qwen, DeepSeek). I needed to prove if this syntactic clustering (content words vs function words vs punctuation) held universally despite completely different model sizes (7B to 16B) and routing strategies (top-4 vs top-8).

Understanding exactly how they group syntactically (and confirming the U-shaped selectivity curve across layers) was a mandatory step before determining if I could efficiently route them geometrically using the RT Cores.

[-]

valdev@reddit

Seeing your response start with "You're right" triggers something deep inside of me. It's like a sort of LLM trauma response I didn't even know I have been forming.

[-]

shing3232@reddit

he use model to translate his through so yee

[-]

Calm-Start-5945@reddit

It sends shivers down my spine, too.

[-]

Silver-Champion-4846@reddit

And they were perfectly good English, too, before llms colonized them and they became signposts that say "llm speak, beware"

[-]

Critical-Chef9211@reddit (OP)

Hahaha, I don't blame you! My Spanish-to-English translation AI definitely showed its true colors right there. If I start my next sentence with "Certainly!" or "As an AI language model...", please unplug my router. 🤣

[-]

Serious-Log7550@reddit

Should we wait for pull request to LLAMA CPP?

[-]

Critical-Chef9211@reddit (OP)

A direct PR to llama.cpp might be tricky right now. The routing speedup relies heavily on the NVIDIA OptiX libraries to access the hardware RT Cores, and llama.cpp/ggml uses its own custom backends. However, the kernels are fully open source, so I'd love to see someone try to adapt the logic!

[-]

shing3232@reddit

you can access RT hardware via vulkan backend so that would be a more general solution

[-]

Farmadupe@reddit

https://github.com/JordiSilvestre/Spectral-AI/commit/812a47ae97c99e0ef80edec4aeedd6a499a6ad75

I like the look of the patent application diagrams that you've published, can you talk us through some of them?

[-]

Critical-Chef9211@reddit (OP)

Thanks! I tried to design them so the geometric intuition is as clear as possible. The diagrams mainly cover two core architectural shifts:

The BVH / Inception Engine Space: Normally, experts are a flat 1D array requiring dense dot-products. The diagrams show how tokens are instead evaluated in 3D coordinate space. The BVH (Bounding Volume Hierarchy) partitions this space into domains (Level 1), subdomains (Level 2), and specific concept experts (Level 3). The ray traversal naturally discovers the closest expert in O(log N) geometric intersection tests.
Spectral Routing (Optical Polysemy): The prism/refraction diagrams illustrate how we handle words with multiple meanings (polysemy, like the word "bank"). Rather than duplicating parameters, rays carry a context vector as a "color frequency." Using Snell’s law in the 3D space, incompatible contexts trigger Total Internal Reflection—causing the token to literally "bounce" off the wrong semantic domain and route to the correct one natively.

Let me know if there's a specific figure you want me to do a deeper dive into!

[-]

Farmadupe@reddit

Can you tell use the mechanism that you're using to map between N-dimensional vector space and the 3-dimensional spaces that the RT cores must be using? You must be able to do this bidirectionally without significant loss?

[-]

Tiny_Arugula_5648@reddit

This is very impressive idea.. Can you give us a explanation of real world impact? 7

31x less VRAM sounds impressive but how much space does that actually free up for a 7B MoE?

218x speed up on routing is also a large number what does that mean for real world speed up, how does it impact TPS for example?

Running this on a 16GB GPU was the model quantized, if so does that have an impact on these numbers?

[-]

Monkey_1505@reddit

Wait, anyone believes that MoE experts specialize on topic?

[-]

Maleficent-Ad5999@reddit

I used to.. before joining this sub

[-]

Monkey_1505@reddit

That was debunked and abandoned way back in the Mistral days when MoE became a thing. Surprising.

They just create the arch, and train it. You get things like specialized components with intentional human or genetic design, whereas just training a thing is sort of more chaotic.

[-]

Critical-Chef9211@reddit (OP)

Haha, there you go! As Maleficent mentioned, it's still an incredibly pervasive misconception because it's the #1 analogy used in 101 tutorials, blogs, and mainstream AI journalism ("the model acts like a company with a math expert and an astronomy expert").

While serious practitioners know it's a myth, I still had to rigorously map and quantify the exact limits of that syntactic clustering across multiple architectures. Proving it statistically was a mandatory step before I could confidently build a predictive, 3D geometric routing manifold to take advantage of the RT Cores.

[-]

WPBaka@reddit

This is really cool, mad scientist type stuff, especially the Nested Instance Acceleration Structure as a way of forcing RT cores to do higher dimensional math (my brain hurts). I'm going to dive in and hopefully understand it more after work.

Thanks for sharing!

[-]

Critical-Chef9211@reddit (OP)

Thank you! Honestly, forcing N-dimensional math into 3D nested OptiX acceleration instances definitely felt like a "mad scientist" moment of desperation. Thanks for taking the time to read through the repo, let me know if you run into any issues running it!

[-]

Nice_Cellist_7595@reddit

Neat work - not sure why everyone is downvoting you without trying it out!

[-]

Critical-Chef9211@reddit (OP)

Thank you! It's just the classic Reddit mob mentality. But as long as people who actually read the papers and code find it useful, it's totally worth it. Cheers!

[-]

No-Dot-6573@reddit

Nice project! To get those numbers into perspective: How much VRAM is at all consumed for routing? How much time is spent on routing? Those numbers are very high, but do they really affect real world use cases?

[-]

Critical-Chef9211@reddit (OP)

Great question, and you're spot on to ask about the absolute numbers!

You are correct that in a standard model with 64 experts like OLMoE-1B-7B, the absolute time spent on the routing gate is relatively small compared to the massive MLPs. The standard PyTorch gate takes a few milliseconds per batch. The RT Core BVH brings this down to a blistering 19 µs/batch. So for a 64-expert model today, the end-to-end generating latency isn't drastically transformed.

HOWEVER, the real-world impact is about unblocking the future:

O(log N) Scaling: The horizon for MoEs involves scaling to thousands or even millions of micro-experts. A standard O(N) linear gate completely bottlenecks at that scale (both in compute and VRAM). A hardware BVH can traverse a million experts in just \~20 intersection steps instead of a massive 1,000,000-way matrix multiply.
Context-Dependent Routing: By moving the routing out of linear algebra and into Ray Tracing geometry, we enabled "Spectral Routing" (Paper 3). We can natively use optical limits (Total Internal Reflection) to naturally solve word polysemy (e.g., routing the word "bank" differently based on context)—something standard linear routers fail at horribly.

So the goal isn't just to squeeze a few extra tokens/sec out of yesterday's 64-expert models, but to establish a zero-friction, hardware-native routing architecture that allows MoEs to scale to infinity while routing smarter.

[-]

PrashantRanjan69@reddit

Interesting project!

[-]

Critical-Chef9211@reddit (OP)

Thanks a lot!

[-]

martincerven@reddit

Pretty cool 🚀

[-]

Critical-Chef9211@reddit (OP)

hanks! Really appreciate it. Let me know if you dive into the repo and have any questions about the implementation. 🚀