Instead of saving all layer activations (n of them) for backpropagation you save every sqrt n -th layer, and recompute missing ones as you need
Tldr memory saving from n to 2 sqrt n with doubling compute cost
The yellow/brownish/orange tint of it all reeks of ChatGPT 4o image generator.
I don't mind AI generated images, but I hate it when they have a quirk that I can't unsee once you've seen it...
Hey guys, I haven't posted here in a while, I've been a lot more active over on X, especially since the LLM research scene is much more alive there. Just wanted to cross-post this here as well. Iām the original author of this on [X](https://x.com/TheAhmadOsman/status/1922336545719107759).
> the scariest thing in llms/ai isn't the models or the math
> it's the names
>
> > kv cache prefill strategy
> > multi-head attention with rotary position embeddings
> > fused CUDA kernel for dynamic tensor rematerialization
> > nucleus sampling with temperature scaling and repetition penalty
> > flash attention v2 with block-sparse operations, causal masking, and warp-level primitives
>
> bro they sound like boss fights frfr
Guys, I don't understand the downvotes. I literally copied the entire tweet over here so nobody has to click on anything š
I am also sharing my findings that there is an active research community over there so that people know to keep their eyes open, I am not advocating for a platform but rather sharing something I think genuinely helpful to the collective knowledge of the members of this community
23 Comments
Normal-Ad-7114@reddit
ortegaalfredo@reddit
Remove_Ayys@reddit
Marksta@reddit
XMasterrrr@reddit (OP)
wektor420@reddit
ChomsGP@reddit
segmond@reddit
Limp_Classroom_2645@reddit
Dr_Ambiorix@reddit
Inaeipathy@reddit
Evening_Ad6637@reddit
Inevitable_Ad3676@reddit
XMasterrrr@reddit (OP)
Euchale@reddit
dacevnim@reddit
Lissanro@reddit
yukiarimo@reddit
Sidran@reddit
Linkpharm2@reddit
Ardalok@reddit
XMasterrrr@reddit (OP)
XMasterrrr@reddit (OP)