Turns out, Google Colab isn’t as GPU Poor!

[-]

FullOf_Bad_Ideas@reddit

I didn't know T4 was over 50% faster than RTX 3090.

Reply

[-]

Not really. What is compared here is the peak FP16 performance, as listed. However, what this number means seem to differ between cards intended for the professional market and those intended for the gaming market: **Tesla T4** * Peak FP16 TFLOPS (non-tensor): 16.2 * **Peak *tensor* FP16 TFLOPS with FP32 accumulate: 65** (this is the listed "FP16 TFLOPS") **RTX 3090** * **Peak FP16 TFLOPS (non-tensor): 35.58** (this is the listed "FP16 TFLOPS") * Peak *tensor* FP16 TFLOPS with FP32 accumulate: 70 So the 3090 is actually has slightly more FP16 compute performance when using the tensor cores, and double the FP16 compute performance when not using the tensor cores. The memory bandwidth of the 3090 is also almost 3x higher than that of the T4 (936GB/s vs 320GB/s), so it is probably able to get closer to that theoretical performance more often. Combine that with the 50% extra memory capacity the 3090 has (24GB on the 3090 vs 16GB on the T4, though on Colab that last gigabyte is system-reserved), and the 3090 is clearly the superior card.

Reply

[-]

FullOf_Bad_Ideas@reddit

Thanks for explaining this, I easily get lost in the TFLOPS and TOPS. Do you think it would make sense for huggingface team to adjust values they use for those GPUs to make it follow one definition uniformly?

Reply

[-]

ben_g0@reddit

It would be nice if they followed a consistent definition, but it doesn't really matter all that much actually. Even when they follow a consistent definition, then actual performance can still vary a whole lot based on the memory bandwidth, cache, which quantization you use, which driver you have installed, and a lot of other factors. Performance can't really be expressed objectively in a single number anyway, so this comparison is mostly just for fun or to get a very rough idea of how powerful your setup is compared to the average. And for that, I don't blame them for just using whichever number the manufacturer lists in the specs.

Reply

[-]

danielhanchen@reddit

Absolutely love Colab!! T4s are 65 TFLOPs and they provide some free hours. Did you guys know Kaggle has 30 hours for free Tesla T4s?? Also TPUs are for free on Kaggle?! Even better, by using [Unsloth](https://github.com/unslothai/unsloth), a 65 TFLOP card converts "magically" to a 130 TFLOP card, since Unsloth makes **finetuning 2x faster** and use 70% less VRAM with no degradation in accuracy :) Also 4x longer context lengths are possible! I attached all our free 2x faster finetuning Colabs below: * [Colab for Llama-3](https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing) * [Colab for Llama-3 Instruct ShareGPT style](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing) * [Colab for Mistral v3](https://colab.research.google.com/drive/1_yNCks4BTD5zOnjozppphh5GzMFaMKq_?usp=sharing) * [Colab for Mistral v3 ChatML ShareGPT style](https://colab.research.google.com/drive/15F1xyn8497_dUbxZP4zWmPZ3PJx1Oymv?usp=sharing) * [Colab for Phi-3 Mini](https://colab.research.google.com/drive/1NvkBmkHfucGO3Ve9s1NKZvMNlw5p83ym?usp=sharing) * [Colab for ORPO](https://colab.research.google.com/drive/11t4njE3c4Lxl-07OD8lJSMKkfyJml3Tn?usp=sharing) * [Colab for DPO](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) * [Colab for Gemma](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing) * [Text Completion notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing) * [Pure 2x faster inference notebook](https://colab.research.google.com/drive/1aqlNQi7MMJbynFDyOQteD2t0yVfjb9Zh?usp=sharing)

Reply

[-]

saved_you_some_time@reddit

This is really great. Is there a catch?

Reply

[-]

danielhanchen@reddit

No catch at all! Unsloth does not do any approximations, so there is 0% accuracy degradations. So you get 2x faster, use 70% less VRAM with no accuracy degradation!

Reply

[-]

saved_you_some_time@reddit

This sounds really too good to be true, is there a documentation of the method/techniques they are using?

Reply

[-]

danielhanchen@reddit

Oh we have a Hugging Face blog https://huggingface.co/blog/unsloth-trl which might be helpful. We also did a Pytorch presentation in one of their meetings here: https://x.com/danielhanchen/status/1787103453011185977 if you're interested :)

Reply

[-]

saved_you_some_time@reddit

Awesome work! Will check them out

Reply

[-]

danielhanchen@reddit

Thanks!!

Reply

[-]

Satyam7166@reddit

Hey Daniel, congratsd on the great work done on Unsloth. Was wondering if you have plans about bringing Unsloth to Apple Devices? Using cloud to finetune goes against my company's policy and we don't have a windows device xD

Reply

[-]

vaibhavs10@reddit (OP)

Sorry if this is not in-policy for the sub-reddit - I found the results quite amusing hence wanted to share. You can try different configurations here: https://huggingface.co/settings/local-apps

Reply

[-]

MrObsidian_@reddit

mfw my 1660 Ti isn't there

Reply

[-]

Caffdy@reddit

how does it work? the "add button" is locked

Reply

[-]

kristaller486@reddit

Lol 3060 isn't even on the list💀

Reply

[-]

severo_bo@reddit

just added: [https://github.com/huggingface/huggingface.js/pull/695](https://github.com/huggingface/huggingface.js/pull/695)

Reply

[-]

Impossible_Belt_7757@reddit

Don’t you have to take into account for it being in a virtualized environment? Pretty sure this is why my 3060 is still faster than a google colab but I might be dumb lol

Reply

[-]

Open_Channel_8626@reddit

I liked the phrase VRAMlet for GPU poor but I don't know what the opposite of VRAMlet is

Reply

[-]

MrVodnik@reddit

vRAMbo

Reply

[-]

someguy@reddit

There isn't really an opposite term for manlet. VRAMchad suggestion seems alright.

Reply

[-]

Atom_101@reddit

VRAMchad

Reply

[-]

Open_Channel_8626@reddit

Maybe this is the one

Reply

[-]

nero10578@reddit

Why is it scaling only based on flops lol we are more in a VRAM famine than Tflops shortage.

Reply

[-]

MrVodnik@reddit

Yep, if we only could run all of these great models, I am sure they'd answer super fast to all the questions we have!

Reply

[-]

mpasila@reddit

So it's just mid.

Reply

[-]

Atupis@reddit

GPU middle class.

Reply

[-]

vaibhavs10@reddit (OP)

GPU Mid, yes!

Reply

[-]

Everlier@reddit

I'm most certain Google Colab is more GPU Rich than any one of us non-Googlers can imagine

Reply

[-]

vaibhavs10@reddit (OP)

Google Colab literally taught me ML :p Chris Perry and the team are literally 🐐

Reply

[-]

tutu-kueh@reddit

Colab is good enough for simple prototyping. Definitely can't train or fine tune a LLM 8b for sure

Reply

[-]

vaibhavs10@reddit (OP)

You can try with Unsloth - they have some 7/8B LLM finetuning colabs that work quite well.

Turns out, Google Colab isn’t as GPU Poor!

Reply to Post

33 Comments

FullOf_Bad_Ideas@reddit

ben_g0@reddit

FullOf_Bad_Ideas@reddit

ben_g0@reddit

danielhanchen@reddit

saved_you_some_time@reddit

danielhanchen@reddit

saved_you_some_time@reddit

danielhanchen@reddit

saved_you_some_time@reddit

danielhanchen@reddit

Satyam7166@reddit

vaibhavs10@reddit (OP)

MrObsidian_@reddit

Caffdy@reddit

kristaller486@reddit

severo_bo@reddit

Impossible_Belt_7757@reddit

Open_Channel_8626@reddit

MrVodnik@reddit

someguy@reddit

Atom_101@reddit

Open_Channel_8626@reddit

nero10578@reddit

MrVodnik@reddit

mpasila@reddit

Atupis@reddit

vaibhavs10@reddit (OP)

Everlier@reddit

vaibhavs10@reddit (OP)

tutu-kueh@reddit

vaibhavs10@reddit (OP)

nikitastaf1996@reddit