Waves are all you need
Posted by ethereel1@reddit | LocalLLaMA | View on Reddit | 42 comments
A revolutionary new paper introducing the Wave Network: An Ultra-Small Language Model.
ABSTRACT:
We propose an innovative token representation and update method in an new ultra-small language model: the Wave network. Specifically, we use a complex vector to represent each token, encoding both global and local semantics of the input text. A complex vector consists of two components: a magnitude vector representing the global semantics of the input text, and a phase vector capturing the relationships between individual tokens and global semantics. Experiments on the AG News text classification task demonstrate that, when generating complex vectors from randomly initialized token embeddings, our single-layer Wave Network achieves 90.91% accuracy with wave interference and 91.66% with wave modulation—outperforming a single Transformer layer using BERT pre-trained embeddings by 19.23% and 19.98%, respectively, and approaching the accuracy of the pre-trained and fine-tuned BERT base model (94.64%). Additionally, compared to BERT base, the Wave Network reduces video memory usage and training time by 77.34% and 85.62% during wave modulation. In summary, we used a 2.4-million-parameter small language model to achieve accuracy comparable to a 100-million-parameter BERT model in text classification.
AIPornCollector@reddit
If there's anything I've learned about AI models, it's that 99% of them are absolutely useless no matter how much the devs talk them up. Really hard to judge quality off of a random paper with open model or huggingface space.
boxed_gorilla_meat@reddit
No, you haven't learned that anywhere. You had the thought when you wrote this and felt that it sounded erudite, but you most certainly didn't experience or learn that anywhere.
AIPornCollector@reddit
Every week in the image generation side of AI a new model comes out that claims it's groundbreaking, but when you try it it's worse than models released 2 years ago that run faster than them on weaker hardware (SD1.5). The only real gamechangers we've had were flux, sdxl, sd1.5, and maybe, just maybe, pixart sigma. The other hundreds of models that flew by, from Chinese government funded mega models like Hunyuan-DiT to models a few dudes cooked up in their basement, might as well not have existed.
Dayder111@reddit
I guess it's mostly because these researchers and small teams have nowhere near the labeled data accumulated and thoroughly sorted, nowhere near the comparable computing power, and are usually testing just 1 idea, while production models from large companies may combine many approaches and at larger scale.
It's all about the resources, you need compute, funding and more people to expand your ideas and test it on larger scale, and in many cases funding (and compute, people) as your longer-term goal. And of course you will present your work in the best light you can muster, as everyone is already biased towards your ideas being somehow silly and iferior, to save time and resources, and secure their work and approaches that they invested lots of time into and give them their pride, passion and funding.
There are so many individual human/societal limitations, that no wonder what the large companies want the most is AI that can research and experiment autonomously and lots of compute to give to it for that.
visarga@reddit
I must have read 100 papers trying to invent new approaches to replace transformers, none of them panned out.
GrapefruitMammoth626@reddit
Mamba? They’d probably achieve great outcomes if everyone jump on them thinking it’s the next paradigm but it sounds like transformers have the mass, just PyTorch has a bigger user community over Tensorflow?
Maykey@reddit
I'm certainly experienced that retnet is not a "successor to transformers". The only 7B model was not even reuploaded.
Maykey@reddit
(X) doubt
Orolol@reddit
2.4 M Parameters.
Maykey@reddit
Who forbid them to add more layers and not look like model from around 2017 landing in only top 40 of imdb? 4M parameters is not impressive? Bert has 110M yet it doesn't work ~45 times slower.
Orolol@reddit
GPU aren't free. It's research, not every research aim to build a sota model straight outta paper.
vathodo68@reddit
How many people on reddit claiming to have found the revolution or AGI is insane.
People making their own thoughts and think they have found the holy grail lol
Not trying to diminish the efforts, but a lot of it is just hype marketing innovative buzz terming.
schlammsuhler@reddit
This is the oposite of nvidias proposal to even ditch the magnitude by limiting vectors to a hypershere.
Jazzlike_Tooth929@reddit
why do you believe is revolutionary?
kevofasho@reddit
A 50 fold reduction in size means GPT4 size models can run on enthusiast level hardware, which effectively democratizes AI
mrjackspade@reddit
For what, the six months before OpenAI trains their own model on this architecture at GPT4 sizes, effectively obsoleting open source again?
The doesn't "effectively democratize" anything, at best its a brief period of catch up.
gavff64@reddit
I’d imagine it would just plateau at some point. At that scale it’s more about efficiency than stronger scores.
dogcomplex@reddit
It also means 50x the models can be run at once, or 1/50th the energy costs.... this is a very big achievement if it checks out
noprompt@reddit
“All You Need Considered Harmful”
ethereel1@reddit (OP)
I agree! I roll my eyes whenever I see papers using "all you need" in title. But every rule is proven by its exceptions, and this is one. This paper is of the same caliber as the 2015 paper that introduced Attention and the 2017 paper that introduced Transformers.
Robonglious@reddit
I've been trying to do something like this for a month. I was trying to convert bert embeddings rather than create my own though.
I'm excited to read this paper once I get back to my computer.
noprompt@reddit
Global/Local semantic transforms have always made sense to me. Sarcasm is the first thing that comes to mind. The meaning of tokens depends on their global transform in semantic space and their local transforms with respect to each other. Kinda like how objects are parented in 3d modeling, etc.
ivankrasin@reddit
"Releasing code is all you need to get attention".
drooolingidiot@reddit
Have they released any code to go along with the paper?
FesseJerguson@reddit
not that i could find
msbeaute00000001@reddit
The idea of complex number have from 2020s already. What is the wavenet different? Does it scale well? An advantage of Transformer is it scale really well.
dogcomplex@reddit
Alright, the other comments are pretty rude but if these findings check out this is pretty huge and deserves the "All You Need" nod.
Hows this compare to ternary bit representations? Is that on your radar? (Addition Is All You Need)?
roger_ducky@reddit
Wait wait wait. Actually doing waves, if truly successful, will mean this can be sped up drastically using quantum computers once they work at scale. Near real time speed inference would then be possible. This could be really big.
stddealer@reddit
?
roger_ducky@reddit
Quantum computing has to do with creating wave models and combining them to do computations. Essentially, you join waves with opposing amplitudes and they cancel out, leaving you with smaller amounts of waves to look at. Or if the waves had similar amplitudes, they become bigger. The “wave computation” in quantum computers, like DSPs for Fourier transforms, happens “quickly.” So, stuff using waves as their base models should get a tremendous calculation speed up on quantum computers.
stddealer@reddit
Oh okay, that's neat. I thought quantum computers were just good to factor primes and such.
roger_ducky@reddit
That’s just because the quantum registers they could reliably create and use are still extremely tiny compared to normal computers. So the “size” of the problems they can handle in tech demos can’t be too big.
eggs-benedryl@reddit
thanks for helping develop LLM, so that I can understand a fraction of what you just said
off to chatgpt i go to get a dummy's version of what you just said
:)
JohnnyLovesData@reddit
Post GPT-TLDR here
AutomaticDriver5882@reddit
Alright, imagine you have a big box of LEGOs. Each LEGO block represents a word in a story you’re trying to build. Normally, to understand the whole story, you need a ton of LEGO blocks and a lot of time to put them together just right. This is like using a giant model like BERT, which needs a lot of blocks and takes forever to build.
Now, let’s say we found a super cool way to use fewer LEGOs but still build the same story. This is what the Wave Network does—it uses a trick with “waves.”
Instead of needing a whole pile of blocks, the Wave Network only needs a few special ones that have two parts: • Part 1 (like the color of the block) tells you what the whole story is about. It’s like saying, “This is a story about dinosaurs.” • Part 2 (like a little bump on the block) helps the blocks talk to each other to figure out where each one should go.
When we put these special blocks together, they “wave” and “wiggle” in a way that makes the story clear, even though there aren’t many blocks. And guess what? It does this faster and takes up less space than the usual way of building.
In a test, this Wave Network was really good at building stories almost as well as the huge box of LEGOs, even though it used a lot fewer blocks and took less time!
emteedub@reddit
How about a NotebookLm Podcast to listen (only up for 24hrs):
https://jmp.sh/s/IRPo1PfhCaQH1Z4VYFtT
Lissanro@reddit
This Podcast is quite good! I think it deserve to exist for more than 24 hours, so I hosted it here: https://dragon.studio/2024/11/Wave_Network.wav (in case it ever goes down, here is a backup link at the web archive).
Radiant_Dog1937@reddit
Is it slower? I'm not an expert by any stretch but I'd imagine that process is slower for the added accuracy.
roger_ducky@reddit
It’s probably extremely slow on existing hardware. But this would be super fast on quantum computers.
Radiant_Dog1937@reddit
I don't think they mean waves in the literal sense. It looks like they are using semantic vectors, which define the relationship between tokens, as coordinates to define a global semantic vector that represents all of those vectors in the input tokens. They then use some trig to calculate a phase vector using the global semantic vector. Those are combined to make a complex vector, which can be represented as a wave.
When they run inference, the next token is selected based on how constructive or destructive that the next tokens complex vector is to the wave. At least that's what it seems like the paper is saying.
FullOf_Bad_Ideas@reddit
The charts are whacky. The modulation accuracy chart stays stable as model is being trained - it's 90% effective after just 10 steps and doesn't improve later. And transformer gets worse and worse accuracy as training progresses.
Samathura@reddit
I would love to understand this concept in more detail. I will go read the paper, but if you are open to it I would love to discuss how common additions like loras and context cards are impacted.