TheaterFire

Turns out, Google Colab isn’t as GPU Poor!

Posted by vaibhavs10@reddit | LocalLLaMA | View on Reddit | 33 comments

Turns out, Google Colab isn’t as GPU Poor!
An obligatory thanks to Google Colab for providing free GPUs to the community!

Reply to Post

33 Comments

FullOf_Bad_Ideas@reddit

I didn't know T4 was over 50% faster than RTX 3090.
View on Reddit #27076468

ben_g0@reddit

Not really. What is compared here is the peak FP16 performance, as listed. However, what this number means seem to differ between cards intended for the professional market and those intended for the gaming market: **Tesla T4** * Peak FP16 TFLOPS (non-tensor): 16.2 * **Peak *tensor* FP16 TFLOPS with FP32 accumulate: 65** (this is the listed "FP16 TFLOPS") **RTX 3090** * **Peak FP16 TFLOPS (non-tensor): 35.58** (this is the listed "FP16 TFLOPS") * Peak *tensor* FP16 TFLOPS with FP32 accumulate: 70 So the 3090 is actually has slightly more FP16 compute performance when using the tensor cores, and double the FP16 compute performance when not using the tensor cores. The memory bandwidth of the 3090 is also almost 3x higher than that of the T4 (936GB/s vs 320GB/s), so it is probably able to get closer to that theoretical performance more often. Combine that with the 50% extra memory capacity the 3090 has (24GB on the 3090 vs 16GB on the T4, though on Colab that last gigabyte is system-reserved), and the 3090 is clearly the superior card.
View on Reddit #27096683

FullOf_Bad_Ideas@reddit

Thanks for explaining this, I easily get lost in the TFLOPS and TOPS. Do you think it would make sense for huggingface team to adjust values they use for those GPUs to make it follow one definition uniformly?
View on Reddit #27097504

ben_g0@reddit

It would be nice if they followed a consistent definition, but it doesn't really matter all that much actually. Even when they follow a consistent definition, then actual performance can still vary a whole lot based on the memory bandwidth, cache, which quantization you use, which driver you have installed, and a lot of other factors. Performance can't really be expressed objectively in a single number anyway, so this comparison is mostly just for fun or to get a very rough idea of how powerful your setup is compared to the average. And for that, I don't blame them for just using whichever number the manufacturer lists in the specs.
View on Reddit #27162005

danielhanchen@reddit

Absolutely love Colab!! T4s are 65 TFLOPs and they provide some free hours. Did you guys know Kaggle has 30 hours for free Tesla T4s?? Also TPUs are for free on Kaggle?! Even better, by using [Unsloth](https://github.com/unslothai/unsloth), a 65 TFLOP card converts "magically" to a 130 TFLOP card, since Unsloth makes **finetuning 2x faster** and use 70% less VRAM with no degradation in accuracy :) Also 4x longer context lengths are possible! I attached all our free 2x faster finetuning Colabs below: * [Colab for Llama-3](https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing) * [Colab for Llama-3 Instruct ShareGPT style](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing) * [Colab for Mistral v3](https://colab.research.google.com/drive/1_yNCks4BTD5zOnjozppphh5GzMFaMKq_?usp=sharing) * [Colab for Mistral v3 ChatML ShareGPT style](https://colab.research.google.com/drive/15F1xyn8497_dUbxZP4zWmPZ3PJx1Oymv?usp=sharing) * [Colab for Phi-3 Mini](https://colab.research.google.com/drive/1NvkBmkHfucGO3Ve9s1NKZvMNlw5p83ym?usp=sharing) * [Colab for ORPO](https://colab.research.google.com/drive/11t4njE3c4Lxl-07OD8lJSMKkfyJml3Tn?usp=sharing) * [Colab for DPO](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) * [Colab for Gemma](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing) * [Text Completion notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing) * [Pure 2x faster inference notebook](https://colab.research.google.com/drive/1aqlNQi7MMJbynFDyOQteD2t0yVfjb9Zh?usp=sharing)
View on Reddit #27075920

saved_you_some_time@reddit

This is really great. Is there a catch?
View on Reddit #27078282

danielhanchen@reddit

No catch at all! Unsloth does not do any approximations, so there is 0% accuracy degradations. So you get 2x faster, use 70% less VRAM with no accuracy degradation!
View on Reddit #27085009

saved_you_some_time@reddit

This sounds really too good to be true, is there a documentation of the method/techniques they are using?
View on Reddit #27131120

danielhanchen@reddit

Oh we have a Hugging Face blog https://huggingface.co/blog/unsloth-trl which might be helpful. We also did a Pytorch presentation in one of their meetings here: https://x.com/danielhanchen/status/1787103453011185977 if you're interested :)
View on Reddit #27131365

saved_you_some_time@reddit

Awesome work! Will check them out
View on Reddit #27131601

danielhanchen@reddit

Thanks!!
View on Reddit #27131755

Satyam7166@reddit

Hey Daniel, congratsd on the great work done on Unsloth. Was wondering if you have plans about bringing Unsloth to Apple Devices? Using cloud to finetune goes against my company's policy and we don't have a windows device xD
View on Reddit #27086995

vaibhavs10@reddit (OP)

Sorry if this is not in-policy for the sub-reddit - I found the results quite amusing hence wanted to share. You can try different configurations here: https://huggingface.co/settings/local-apps
View on Reddit #27055985

MrObsidian_@reddit

mfw my 1660 Ti isn't there
View on Reddit #27130318

Caffdy@reddit

how does it work? the "add button" is locked
View on Reddit #27118625

kristaller486@reddit

Lol 3060 isn't even on the list💀
View on Reddit #27060037

severo_bo@reddit

just added: [https://github.com/huggingface/huggingface.js/pull/695](https://github.com/huggingface/huggingface.js/pull/695)
View on Reddit #27062828

Impossible_Belt_7757@reddit

Don’t you have to take into account for it being in a virtualized environment? Pretty sure this is why my 3060 is still faster than a google colab but I might be dumb lol
View on Reddit #27092917

Open_Channel_8626@reddit

I liked the phrase VRAMlet for GPU poor but I don't know what the opposite of VRAMlet is
View on Reddit #27059552

MrVodnik@reddit

vRAMbo
View on Reddit #27080987

__some__guy@reddit

There isn't really an opposite term for manlet. VRAMchad suggestion seems alright.
View on Reddit #27077728

Atom_101@reddit

VRAMchad
View on Reddit #27069267

Open_Channel_8626@reddit

Maybe this is the one
View on Reddit #27069288

nero10578@reddit

Why is it scaling only based on flops lol we are more in a VRAM famine than Tflops shortage.
View on Reddit #27064845

MrVodnik@reddit

Yep, if we only could run all of these great models, I am sure they'd answer super fast to all the questions we have!
View on Reddit #27080954

mpasila@reddit

So it's just mid.
View on Reddit #27057580

Atupis@reddit

GPU middle class.
View on Reddit #27065887

vaibhavs10@reddit (OP)

GPU Mid, yes!
View on Reddit #27057628

Everlier@reddit

I'm most certain Google Colab is more GPU Rich than any one of us non-Googlers can imagine
View on Reddit #27059469

vaibhavs10@reddit (OP)

Google Colab literally taught me ML :p Chris Perry and the team are literally 🐐
View on Reddit #27062558

tutu-kueh@reddit

Colab is good enough for simple prototyping. Definitely can't train or fine tune a LLM 8b for sure
View on Reddit #27059764

vaibhavs10@reddit (OP)

You can try with Unsloth - they have some 7/8B LLM finetuning colabs that work quite well.
View on Reddit #27062521

nikitastaf1996@reddit

I couldn't run llama 3 on it. It's barely adequate now.
View on Reddit #27061035