Waves are all you need
Posted by ethereel1@reddit | LocalLLaMA | View on Reddit | 94 comments
A revolutionary new paper introducing the Wave Network: An Ultra-Small Language Model.
ABSTRACT:
We propose an innovative token representation and update method in an new ultra-small language model: the Wave network. Specifically, we use a complex vector to represent each token, encoding both global and local semantics of the input text. A complex vector consists of two components: a magnitude vector representing the global semantics of the input text, and a phase vector capturing the relationships between individual tokens and global semantics. Experiments on the AG News text classification task demonstrate that, when generating complex vectors from randomly initialized token embeddings, our single-layer Wave Network achieves 90.91% accuracy with wave interference and 91.66% with wave modulation—outperforming a single Transformer layer using BERT pre-trained embeddings by 19.23% and 19.98%, respectively, and approaching the accuracy of the pre-trained and fine-tuned BERT base model (94.64%). Additionally, compared to BERT base, the Wave Network reduces video memory usage and training time by 77.34% and 85.62% during wave modulation. In summary, we used a 2.4-million-parameter small language model to achieve accuracy comparable to a 100-million-parameter BERT model in text classification.
Inevitable-Start-653@reddit
I know nobody is going to believe me, but I've been working on almost this exact same idea. Encoding the token information in complex vectors using phase and magnitude.
The distances between vectors take on a predictable characteristic too.
I came to this idea of what I called "complex wave tokenization" by studying how the brain actually works. Not as a collection of neurons, but instead as a complex electromagnetic waveform.
I'm actually glad someone with more academic street cred is doing this too, my hypothesis is that this is the path to actual "understanding" and "cognition" unlike the current transformers architecture.
cnmoro@reddit
Very Nice! Would you share the code ?
Inevitable-Start-653@reddit
I'll post my original code, and might explore my original ideas more, but here is my attempt to use the authors ideas. I'm unsure of the quality of my original implementation compared to the authors, but I am working on both their implementation and mine. The main difference it looks like is the definition of the global semantics vector.
https://github.com/RandomInternetPreson/WaveAI
I'm just trying new things, right now I'd like to be able to have even a basic LLM type experience with this wave idea, I've gotten a model to say:
he future of artificial intelligence will the ,cop caught the Lawsonerno dove , addressedapproximatelymembers trained the balance sem , the efficacy 244 of the Mania freeway eventually Cecil Voters his emulate952 the CLS , sulfPokemon571 tfcatentryicker remorse oil levels Warsathing the revel diminishingsequent Tradable Kosovo , Volunte , the Madden multiply VotesPak electoral Paradise Maz the exce 257 estimating locals cabbage closely clipped sleptQuality the TT simplyobosDs the neoliberal scoring670 eats Sylv Terrenterren Archbishop Michaels timelines expectation NHrom ,thinBron , the the blinking dismant
so that's something
SquareHistorical6425@reddit
Guys, the detailed analysis on the wave network is coming, and I have extended it to Token2Wave: https://arxiv.org/abs/2411.06989
ivankrasin@reddit
"Releasing code is all you need to get attention".
Maykey@reddit
It's actually doesnt always work. Tokenformer released code, their results are very promising, it barely was noticed in subreddit
Coresce@reddit
Attention is all you need to get funding
1-bit_llm@reddit
An implementation based on the paper: https://github.com/kevinraymond/wave-network
It seems promising.
askchris@reddit
Awesome work!
It's cool that your "results closely match the paper's claims! The Wave Network achieved 91.64% accuracy compared to BERT's 94.53%, which is remarkably close considering it uses only 2.4M parameters versus BERT's 100M parameters"
How did you put this together so fast? Are you open to doing any collaborations?
1-bit_llm@reddit
I'm a stubborn, old, grouchy technologist that someone let use AI and LLMs.
I don't understand all the stuff deeply, but I have a desire to succeed - everyone should get to benefit from this stuff. I'm still working out differences, and now throwing some stuff against Claude about the math. If I don't know, I will learn enough.
Being able to run local, good-at-single-task models without 80 GB VRAM is a goal. And I have two 4090s sitting here begging for work.
Happy to converse about anything, for sure.
ethereel1@reddit (OP)
I'm impressed and thankful that you did this. And your goal of local single-task models is spot on.
How did you manage to figure out the coding implementation from the math in the paper? Do you have any tips, tricks, heuristics you could share? You did it so quickly you're obviously good at this.
1-bit_llm@reddit
I've tried to read or skim interesting research, which already is tough because there's so much of it! It seems like you'll be surprised to hear I've never implemented a paper before this one.
I was serious talking about the math - it's not my strong suit. I generally understand the concepts of this stuff, and I let LLMs (Claude, in this case) help with the stuff I don't know. Knowing what to ask to properly guide a conversation, and how to ask, make these tools very effective (as most anyone in here probably knows).
Understanding the relationships between system components, putting things together so they work properly, and just being able to "get it done" come from decades of technology work across all aspects.
I'll think about how I can document the process I use internally - maybe that's useful to share with others!
ethereel1@reddit (OP)
Did Claude explain the math in the paper? If so, how did you present the paper to it? OCR of math expressions is a problem to solve and I wonder how you did it. Even if going the most sure-fire route, converting LaTex code of arXiv papers, that still leaves the question of which math encoding is best for LLMs to work with. I was going to do tests on this issue, but maybe you already have a good solution.
1-bit_llm@reddit
I don't have a solution per se; it's a challenge, as you mention, so I just experiment with it. Working on this one in particular, I wanted to try just seeing how good the latest Claude update worked.
Feeding in the paper up front, I didn't even bother asking for an explanation or giving much direction at all. It was essentially "Hey, check out this paper - it's interesting! Let's implement it" and off we went.
Unsurprisingly, it didn't work on the first try, but the errors were simple Python things to fix up. After that, and now surprisingly, it worked. I started with AG News and immediately got almost the same results as the paper. Not bad!
It's quick enough for me to run through four epochs that I got the other datasets and ran them as well, with similar results on DBpedia, but only getting \~80% on IMDB. I'm sure this is a parameter thing.
After it initially worked, I mentioned it to a colleague of mine, and he sent me the link here. I figured I had to share it at that point.
Now I go through and experiment, learn, fail, repeat ... just like everything else in life. Claude is okay at explaining this stuff, but it's limited still, of course. Not knowing the math very well hurts when comparing against the paper, but I can compare results all day. Hopefully I'll plug some of this math into my old brain along the way!
The only other part is that I generally understand the signal-based approach and can help steer the conversations. It's important to limit things to small iterations, try/fail, try/fail ... eventually try/succeed happens.
SquareHistorical6425@reddit
I have to say, good work!
1-bit_llm@reddit
Thank you very much - As one of the authors, y'all deserve the credit for making this stuff up! I just read it, pretend to understand stuff like the plus symbol, or an equal sign, and then run off into the wild trying to make it work.
I'm not sure I've got it exactly correct yet, but learning is part of the adventure. Your paper is an inspiration about what is possible.
SquareHistorical6425@reddit
Thank you, I do have a following article that analyzing the wave network, I will tell you after I post it on the Arxiv.
SquareHistorical6425@reddit
Thank you, I do have a following article that analyzing the wave network, I will tell you after I post on the Arxiv.
Maykey@reddit
(X) doubt
Orolol@reddit
2.4 M Parameters.
Maykey@reddit
Who forbid them to add more layers and not look like model from around 2017 landing in only top 40 of imdb? 4M parameters is not impressive? Bert has 110M yet it doesn't work ~45 times slower.
Orolol@reddit
GPU aren't free. It's research, not every research aim to build a sota model straight outta paper.
StyMaar@reddit
A single dude trained TinyLlama on his own money last year: 1.1B params trained on 3T tokens, by renting 16 A100-40 for 90 days.
If you are doing research and don't do even a tens of that, maybe you aren't doing the revolutionary stuff you think you are.
Orolol@reddit
New architecture requires often thousands of training, because we need to tweak values, structure and hyperparams.
But if you have some millions to spare, I'm sure those researchers would be more than happy.
StyMaar@reddit
Thousands of $20 training runs aren't going to cost millions though.
Orolol@reddit
With grid/ random search, you can easily need thousands of training.
But anyway, if you think this is really cheap, go on and use your money.
StyMaar@reddit
I'm not claiming I'm writing “revolutionary papers”…
Orolol@reddit
Of course.
Maykey@reddit
Their datasets are orders of magnitude lower: imdb is around 15M and others ag news is 7M, dbpedia is ~40M (combined train and test)
Paper runs four epoch for two variants of wave network, bert, 1 layer transformer. So all models with four epochs combined need to chew through around 1B data (~62M tokens * 4 models * 4 epochs)
The needed budget is probably more comparable to visiting mcdonalds than pretraining: they don't pretrain the network as MLM, also hardware got so much better since BERT came out, in 2021 pretraining was discussed for $50-100
noprompt@reddit
For reference, a single 40GB A100 runs about $2200 per month on GCP. So the cost to replicate that project should be in the neighborhood of about $105,000 give or take. So, yeah, GPUs aren't free.
> If you are doing research and don't do even a tens of that, maybe you aren't doing the revolutionary stuff you think you are.
This assumes a massive GPU budget is required to do meaningful research which, first of all, isn't true and, second, doesn't make sense as a starting point because the whole point here is to do more with less.
StyMaar@reddit
And of course you take the most expensive provider out there… If you rent in a non-hyperscaler gpu cloud provider, the replication cost is going to be around $50k.
But then again I said “don't do even a tens of that” (which means 10x less params and 10x less tokens), which is going to cost 20-50 times less ! (So we're talking at a budget ceiling at $3k)
So yeah GPUs are free, but you don't need that much compute to train a 100M parameter model anyway (Kaparthy reproduced GPT-2 124M for $20, please tell me the researchers don't even have a $20 budget!).
So again, unless your model doesn't scale at all, there's no good reason to release a 2.4M params in 2024
noprompt@reddit
I didn’t cherry pick the provider, I quoted the one I’m familiar with because the company I work for uses GCP. But even at half the price $50K is not fucking pocket change for most “single dudes.”
I agree with the rest of what you’re saying though. At 1.1B, you can do a lot with a little… If the task is small.
Still, all the whining about model size and the usual petty gripes with this paper are unnecessary. There are more constructive ways to say those things without being a dickhead.
Maykey@reddit
This one does. They talk about how it has potential to revolutionize the field either. Words of revolution didn't come from OP alone. It was part of the paper.
Orolol@reddit
Being revolutionary doesn't mean they release a big model. This is a consumer behaviour, and this a research paper.
Maykey@reddit
24M is not a big model.
Orolol@reddit
Nobody said that, but it can be still quite expensive to train it multiple times with various changes and tweaks in architecture and hyperparams. Again GPU aren't free. You're a beggars, can't be chooser.
WTF are you talking about? Did you even understand a single bit of this paper?
Maykey@reddit
Then don't talk about big model in "Being revolutionary doesn't mean they release a big model" if topic is just a multi layer model, not a big one.
Even if we talking of 12 layers model(same number of layers as in BERT which they fine tune), it's only 2.4M*12=~29M model. Considering datasets, in transformers it's nothing to care of price(Especially if flash attention is used).
Their model scales so badly they can't train even 2-3 layers without running out of budget like others do?
That's literally part of research as can be seen in any normal new architecture paper. Tweaks are even often put in their own section or appendix called "Ablations"
I'm talking about their conclusion.
Orolol@reddit
Ok you seem confused or you haven't read the paper at all. I don't see any point continuing this discussion.
SquareHistorical6425@reddit
well, I suppose it is just because there is no free lunch maybe.
vathodo68@reddit
How many people on reddit claiming to have found the revolution or AGI is insane.
People making their own thoughts and think they have found the holy grail lol
Not trying to diminish the efforts, but a lot of it is just hype marketing innovative buzz terming.
noprompt@reddit
“All You Need Considered Harmful”
ColorlessCrowfeet@reddit
And this paper is titled "Wave Network: An Ultra-Small Language Model".
SquareHistorical6425@reddit
Yes, the name 'Wave is All You Need' was the title I used when I initially submitted it to AAAI, but I withdrew it, as the content has since been significantly modified. But truly, I do not know who put it here, it is weird.
noprompt@reddit
Well, the new title is much better! I made my original comment as a joke because the "All You Need" meme has become as tired as "Considered Harmful".
Also, I think this paper is great. It put a huge smile on my face when I came across it last night. Just a couple weeks ago I was talking with one my colleagues about the idea of global/local semantic transforms in language modeling which was mostly inspired by playing around in Houdini (3d software). The paper is also completely approachable.
What's neat about this way of looking at things is that, if it works, I think it could give us better control over model generation by allowing us to move tokens around in semantic space (sarcasm was an example I mentioned previously). Do you see this sort of potential too?
SquareHistorical6425@reddit
Oh, please don't worry about your jokes. And yes, I suppose we can get insight into the contribution of a token to the overall semantics as it moves through feature space, which is a very brilliant idea.
ethereel1@reddit (OP)
Great minds think alike! :) I can see why you'd be spooked by how this went, but it's no more than synchronicity, we all have "is all you need" on our minds, it's became an archetype. I look for these sort of groudbreaking papers on arXiv daily and posted yours here to alert the community. As you can see, it's paid off quickly, we already have a working implementation. Thank you for this remarkable contribution to the field!
SquareHistorical6425@reddit
Yes, I agree with you, and thanks for the posting. I acutally felt pretty astonished when I saw my article was being discussed. I thought it might be ignored like other papers. Your posting really encouraged me, and I will put together a more detailed analysis of wave network as soon as possible.
ethereel1@reddit (OP)
I agree! I roll my eyes whenever I see papers using "all you need" in title. But every rule is proven by its exceptions, and this is one. This paper is of the same caliber as the 2015 paper that introduced Attention and the 2017 paper that introduced Transformers.
SquareHistorical6425@reddit
Hello everyone, I am the first author of this paper, and I am not sure who posted it here. When I came up with this idea three months ago, I was considering that the attention mechanism might not directly account for global semantics. And the key point is the Wave network can use random initial embedding to achieve this result.
askchris@reddit
Congrats on your work! It's quite creative, how did you come up with this?
SquareHistorical6425@reddit
Thanks, well, I was trying to find a more natural method to update information, then I thought about the field, as the wave is the distrubance of field, but they can exchange information when they meet, then I started to think how could I construct a wave representation.
askchris@reddit
Great insight, so from physics? I'm working on compressing information further as a way to maximize intelligence on consumer hardware (ideally AGI on a laptop) and find this to be an inspiring direction so thank you for your work! 🙌
SquareHistorical6425@reddit
Yes, from the perspective of physical wave and signal processing, and I am glad to help.
BobFloss@reddit
The standard transformer architecture does nearly the exact same math as IQ modulation for positional encoding
Radiant_Dog1937@reddit
Is it slower? I'm not an expert by any stretch but I'd imagine that process is slower for the added accuracy.
roger_ducky@reddit
It’s probably extremely slow on existing hardware. But this would be super fast on quantum computers.
Radiant_Dog1937@reddit
I don't think they mean waves in the literal sense. It looks like they are using semantic vectors, which define the relationship between tokens, as coordinates to define a global semantic vector that represents all of those vectors in the input tokens. They then use some trig to calculate a phase vector using the global semantic vector. Those are combined to make a complex vector, which can be represented as a wave.
When they run inference, the next token is selected based on how constructive or destructive that the next tokens complex vector is to the wave. At least that's what it seems like the paper is saying.
SquareHistorical6425@reddit
Actually, quantum waves share with the same complex vector representation with the representation in this article.
roger_ducky@reddit
Wait wait wait. Actually doing waves, if truly successful, will mean this can be sped up drastically using quantum computers once they work at scale. Near real time speed inference would then be possible. This could be really big.
Inevitable-Start-653@reddit
Likely, I was thinking this would be good for photonic chips. Using light instead of transistors to do the inferencing.
wasatthebeach@reddit
Ha, that should then be called "Interferencing"
Inevitable-Start-653@reddit
That pretty good 😂
stddealer@reddit
?
roger_ducky@reddit
Quantum computing has to do with creating wave models and combining them to do computations. Essentially, you join waves with opposing amplitudes and they cancel out, leaving you with smaller amounts of waves to look at. Or if the waves had similar amplitudes, they become bigger. The “wave computation” in quantum computers, like DSPs for Fourier transforms, happens “quickly.” So, stuff using waves as their base models should get a tremendous calculation speed up on quantum computers.
stddealer@reddit
Oh okay, that's neat. I thought quantum computers were just good to factor primes and such.
roger_ducky@reddit
That’s just because the quantum registers they could reliably create and use are still extremely tiny compared to normal computers. So the “size” of the problems they can handle in tech demos can’t be too big.
AIPornCollector@reddit
If there's anything I've learned about AI models, it's that 99% of them are absolutely useless no matter how much the devs talk them up. Really hard to judge quality off of a random paper with open model or huggingface space.
boxed_gorilla_meat@reddit
No, you haven't learned that anywhere. You had the thought when you wrote this and felt that it sounded erudite, but you most certainly didn't experience or learn that anywhere.
sleepthesunaway@reddit
you're calling him a LLM
AIPornCollector@reddit
Every week in the image generation side of AI a new model comes out that claims it's groundbreaking, but when you try it it's worse than models released 2 years ago that run faster than them on weaker hardware (SD1.5). The only real gamechangers we've had were flux, sdxl, sd1.5, and maybe, just maybe, pixart sigma. The other hundreds of models that flew by, from Chinese government funded mega models like Hunyuan-DiT to models a few dudes cooked up in their basement, might as well not have existed.
Dayder111@reddit
I guess it's mostly because these researchers and small teams have nowhere near the labeled data accumulated and thoroughly sorted, nowhere near the comparable computing power, and are usually testing just 1 idea, while production models from large companies may combine many approaches and at larger scale.
It's all about the resources, you need compute, funding and more people to expand your ideas and test it on larger scale, and in many cases funding (and compute, people) as your longer-term goal. And of course you will present your work in the best light you can muster, as everyone is already biased towards your ideas being somehow silly and iferior, to save time and resources, and secure their work and approaches that they invested lots of time into and give them their pride, passion and funding.
There are so many individual human/societal limitations, that no wonder what the large companies want the most is AI that can research and experiment autonomously and lots of compute to give to it for that.
visarga@reddit
I must have read 100 papers trying to invent new approaches to replace transformers, none of them panned out.
GrapefruitMammoth626@reddit
Mamba? They’d probably achieve great outcomes if everyone jump on them thinking it’s the next paradigm but it sounds like transformers have the mass, just PyTorch has a bigger user community over Tensorflow?
Maykey@reddit
I'm certainly experienced that retnet is not a "successor to transformers". The only 7B model was not even reuploaded.
eggs-benedryl@reddit
thanks for helping develop LLM, so that I can understand a fraction of what you just said
off to chatgpt i go to get a dummy's version of what you just said
:)
JohnnyLovesData@reddit
Post GPT-TLDR here
AutomaticDriver5882@reddit
Alright, imagine you have a big box of LEGOs. Each LEGO block represents a word in a story you’re trying to build. Normally, to understand the whole story, you need a ton of LEGO blocks and a lot of time to put them together just right. This is like using a giant model like BERT, which needs a lot of blocks and takes forever to build.
Now, let’s say we found a super cool way to use fewer LEGOs but still build the same story. This is what the Wave Network does—it uses a trick with “waves.”
Instead of needing a whole pile of blocks, the Wave Network only needs a few special ones that have two parts: • Part 1 (like the color of the block) tells you what the whole story is about. It’s like saying, “This is a story about dinosaurs.” • Part 2 (like a little bump on the block) helps the blocks talk to each other to figure out where each one should go.
When we put these special blocks together, they “wave” and “wiggle” in a way that makes the story clear, even though there aren’t many blocks. And guess what? It does this faster and takes up less space than the usual way of building.
In a test, this Wave Network was really good at building stories almost as well as the huge box of LEGOs, even though it used a lot fewer blocks and took less time!
meulsie@reddit
Damn, this was a good ELI5
emteedub@reddit
How about a NotebookLm Podcast to listen (only up for 24hrs):
https://jmp.sh/s/IRPo1PfhCaQH1Z4VYFtT
Lissanro@reddit
This Podcast is quite good! I think it deserve to exist for more than 24 hours, so I hosted it here: https://dragon.studio/2024/11/Wave_Network.wav (in case it ever goes down, here is a backup link at the web archive).
schlammsuhler@reddit
This is the oposite of nvidias proposal to even ditch the magnitude by limiting vectors to a hypershere.
Jazzlike_Tooth929@reddit
why do you believe is revolutionary?
kevofasho@reddit
A 50 fold reduction in size means GPT4 size models can run on enthusiast level hardware, which effectively democratizes AI
mrjackspade@reddit
For what, the six months before OpenAI trains their own model on this architecture at GPT4 sizes, effectively obsoleting open source again?
The doesn't "effectively democratize" anything, at best its a brief period of catch up.
gavff64@reddit
I’d imagine it would just plateau at some point. At that scale it’s more about efficiency than stronger scores.
dogcomplex@reddit
It also means 50x the models can be run at once, or 1/50th the energy costs.... this is a very big achievement if it checks out
Robonglious@reddit
I've been trying to do something like this for a month. I was trying to convert bert embeddings rather than create my own though.
I'm excited to read this paper once I get back to my computer.
noprompt@reddit
Global/Local semantic transforms have always made sense to me. Sarcasm is the first thing that comes to mind. The meaning of tokens depends on their global transform in semantic space and their local transforms with respect to each other. Kinda like how objects are parented in 3d modeling, etc.
drooolingidiot@reddit
Have they released any code to go along with the paper?
FesseJerguson@reddit
not that i could find
msbeaute00000001@reddit
The idea of complex number have from 2020s already. What is the wavenet different? Does it scale well? An advantage of Transformer is it scale really well.
dogcomplex@reddit
Alright, the other comments are pretty rude but if these findings check out this is pretty huge and deserves the "All You Need" nod.
Hows this compare to ternary bit representations? Is that on your radar? (Addition Is All You Need)?
FullOf_Bad_Ideas@reddit
The charts are whacky. The modulation accuracy chart stays stable as model is being trained - it's 90% effective after just 10 steps and doesn't improve later. And transformer gets worse and worse accuracy as training progresses.
Samathura@reddit
I would love to understand this concept in more detail. I will go read the paper, but if you are open to it I would love to discuss how common additions like loras and context cards are impacted.