New "major breakthrough?" architecture SubQ
Posted by Daemontatox@reddit | LocalLLaMA | View on Reddit | 10 comments
while reading through papers and news today i came across this post/blog , claiming major architectural breakthrough , having 12M tokens context window , better than opus , gemini and other models and whopping less than 5% of the cost and it processes token 52X faster than flashattention , yep you read that number right , Fifty two times , at this point i instantly called BS and was ready to move one tbh , there is zero code , paper , api or anything to either test it out or reproduce it .
so i was thinking maybe there is a slight chance i am a complete idiot and somehow this is the next "attention is all you need" thing , what do you guys think ? i am calling bs tbh
DeltaSqueezer@reddit
I hope it is real and someone manages to reverse engineer what they've done and release an open weight model with it so we can test and use it.
xadiant@reddit
Anything x times faster/better is immediate cap for me. We are currently sifting through tons of dirt to find a couple gold nuggets in terms of optimizations. If there was such a gain in transformers architecture, it would be obvious to existing credible AI labs.
ffgg333@reddit
Wtf? Too good to be true. Can someone make a some research about them?
Few_Painter_5588@reddit
Llama Reflection vibes
leonbollerup@reddit
sounds to good to be true.. and if it sounds to good to be true.. ut usually is .. even in AI
Dany0@reddit
Some details are here
A quick skim tells me this is something that has been tried before, iirc this improves single needle in a haystack a lot but starts to fail at N needles much faster than regular attn
Vibe(claude) coded site and slop blog style does not give much confidence either
autisticit@reddit
I'm not experienced enough about LLM to judge the actual breakthrough, but it doesn't look fake at this time at first glance (and for spotting fake things I'm very experienced).
FormerIYI@reddit
Likely 90% of startup hype.
- There were sparse attention systems before, such as Google BigBird (not generative LLM, but more like sparse attention BERT) - somewhat better, but not enough to become industry standard. Also current LLM have positional embeddings that prioritize close tokens strongly.
- The most expensive calculation in attention is vector projection which is O(N). Calculating many cross product before softmax is O(N\^2) but ultimately it is not expensive as matrices are not large. The problem, of course, happens with decoding and KV caches as you need to store these projections, but for input context it matters not.
- Therefore, sparse attention seems to be decent idea, but not genius solution and with tradeoffs.
- Real problem is not making 12M context, but make abstractive reasoning work reliably at like 50k context https://arxiv.org/abs/2502.05167 and also make LLM not break randomly if you feed it with lots of irrelevant details https://machinelearning.apple.com/research/illusion-of-thinking
- Do not believe startups in general until they show reproducible result. For my space of interest (GUI Agents) there are many startups which show solutions that obviously don't work well and will not work well (run Claude or GPT with few agentic prompts) and yet show off benchmark scores like 90% accuracy on very complex tasks.
chrd5273@reddit
Probably bullshit. Also not open.
GrapefruitMammoth626@reddit
Details extremely sparse. Can barely find a papertrail on the founders. Looks a little fishy.