Cuda out of memory for fine tuning downloaded llama 3.1 model on instruction dataset

Posted by Reasonable-Phase1881@reddit | LocalLLaMA | View on Reddit | 6 comments

Hi guys i have 24 gb vram gpu nvidia gtx 4090 with integrated 64 gb intel gpu.

When i am running downloaded llama 3.1 8B model on linux system for fine tuning with my instruct dataset. I am getting an error cuda out of memory. Earlier i was getting the error for float32 and then m for float16 also.

Don't want it quantized for 4 or 8 bit. So, to run it I tried 128gb gpu also. Again the same problem cuda out of memory.

Should i use vlm? Any code documentation available for funing tuning without peft/lora as i have enough computational memory. I think i need to do cuda semantics.

[-]

whyn0t___@reddit

In torchtune we have many optimization flags if you are running out of memory: https://github.com/pytorch/torchtune#optimization-flags

[-]

MugosMM@reddit

Weird. I thought With 128 GPU you should comfortably full fine time this model. Try using torchtune, they have recipe you can just run to test.

[-]

mpasila@reddit

Idk it might need a bit more. Someone was training Gemma 2 9B with 4 H100s.. so that's 320gb of memory just to do full-finetuning specifically Simpo variant. (they said they used mixed 32 and 16 precision so that might explain the amount needed)
Also I needed to use a single A100 just to fine-tune Gemma 2 2B.. I tried it on a 48gb GPU and it still ran out of memory. Full-finetuning needs a lot of memory so better to just do LoRA training.

[-]

Reasonable-Phase1881@reddit (OP)

https://pytorch.org/torchtune/main/tutorials/chat.html

Should i follow this ?

[-]

MugosMM@reddit

You can’t finetune an 8b model in 24GB without memory saving tricks. I guess you don’t want to use PEFT/Qlora. Here a reply I got from another forum: For full fine-tuning an 8b model with 24 GB, you need: - gradient checkpointing - FlashAttention - bfloat16 - paged adamw 8bit - batch size of 1 (or 2) - short sequence length (less than 1024, maybe 512 or 256)

So yes it’s possible but it won’t perform well for tasks processing long sequences.

[-]

Reasonable-Phase1881@reddit (OP)

Thanks. I also have another server with 128 gb gpu. I am getting the same errors for that also. Will try to use your tips and let you know.