Used ray tracing cores on my RTX 5070 Ti for LLM routing — 218x speedup, runs entirely on 1 consumer GPU

Posted by Critical-Chef9211@reddit | LocalLLaMA | View on Reddit | 91 comments

Quick summary: I found a way to use the RT Cores (normally used for ray tracing in games) to handle expert routing in MoE models. Those cores sit completely idle during LLM inference, so why not put them to work?

What it does:

Numbers (OLMoE-1B-7B, RTX 5070 Ti 16GB):

Unexpected discovery: I also found that MoE experts don't actually specialize by topic. Tested across 3 different models (OLMoE, Qwen-MoE, DeepSeek-MoE) — they all specialize by syntactic type (content words vs function words vs punctuation). The "science expert" is a myth.

Code repo: https://github.com/JordiSilvestre/Spectral-AI All papers are open access on Zenodo with full data and reproduction instructions: https://doi.org/10.5281/zenodo.19457288