You can now do FP8 reinforcement learning locally! (<5GB VRAM)

Posted by danielhanchen@reddit | LocalLLaMA | View on Reddit | 61 comments

Hey r/LocalLlama! We're getting close to our last release of 2025! Thanks so much for all the support this year. The DeepSeek team back in Jan showcased how powerful FP8 RL can be with GRPO. Well, you can now try it on your local hardware using only 5GB VRAM! RTX 50x, 40x series all work!

Why should you do FP8 training?
NVIDIA's research finds FP8 training can match BF16 accuracy whilst getting 1.6x faster inference time. We collabed with TorchAO from PyTorch to introduce FP8 RL training, making FP8 GRPO possible on home GPUs with no accuracy loss!

Qwen3-4B FP8 GRPO works on just 6GB VRAM. Qwen3-1.7B on 5GB
1.4x faster RL training and 2× longer context vs BF16/FP16
60% less VRAM and 10× longer context than other FP8 RL implementations
Unsloth is the only framework that makes FP8 RL LoRA work on consumer GPUs (e.g. NVIDIA RTX 40 & 50 Series). Also runs on H100, H200, B200.
You may notice Unsloth now uses much less VRAM than before, enabling even longer context. We’re also implementing faster training soon. Blog coming soon
Our notebooks use 24GB L4s which fit Qwen3-14B as Tesla T4s don’t support FP8.
Our FP8 RL incorporates Unsloth’s weight sharing, Standby, Flex Attention + more.
Works on any NVIDIA RTX 40, 50 series and H100, B200 etc. GPUs
Use load_in_fp8 = True within FastLanguageModel to enable FP8 RL.

You can read our blogpost for our findings and more: https://docs.unsloth.ai/new/fp8-reinforcement-learning

Llama 3.2 1B FP8 Colab Notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama_FP8_GRPO.ipynb

In the notebook, you can plug in any of our previous reward functions or RL environment examples, including our auto kernel creation and our 2048 game notebooks. To enable fp8:

import os; os.environ['UNSLOTH_VLLM_STANDBY'] = "1" # Saves 30% VRAM
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-8B",
    max_seq_length = 2048,
    load_in_4bit = False, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = 32,
    load_in_fp8 = True, # Float8 RL / GRPO!
)

Hope you all have a lovely Thanksgiving, a lovely rest of the week and I'll be here to answer any and all questions! =)

[-]

SykenZy@reddit

This is awesome, will check it as soon as docker image is out....

Is there a plan to support diffusion models? Flux 2.0 is out but FP16 is like 64 GB, FP4 with unsloth performance improvements might be awesome

[-]

exaknight21@reddit

As someone who is a complete fan on unsloth, qwen3:4b (specifically), AND a proud owner of my twins (2x3060 @ 12 GB)... I am looking forward to playing with this and actually contributing to the community. I have about 200 GB of construction data that I plan on using for Fine Tuning using LIMA approach.

[-]

markovianmind@reddit

mind sharing what you assist of data is and what you plan on doing with it

[-]

exaknight21@reddit

All in due time friend. It’s raw and confidential data, but construction related. Immediate project is a complete self hosted RAG app, currently in a non-LocalLLaMA state.

[-]

ANR2ME@reddit

I wondered why did they didn't mentioned RTX 30 series 🤔

Works on any NVIDIA RTX 40, 50 series and H100, B200 etc. GPUs

[-]

exaknight21@reddit

It’s the slow bandwidth and lack of NVLink. AI Researchers are not going to use a 3060 to work with the insane requirements of a model. 3090s have higher bandwidth and NVLink hence why it is worth a shot for them to at least try, which I think is what daniel is trying to do.

In any event. Probably a cause for povertyAI.

I was able to fine tune a model, I dont know what model, as a test on my 3060 set up - so it is not “impossible”, i used QLoRA at the time from unsloth. I’m working on a RAG App right now, but as I learn, I will share my approaches. It was a dataset of 10 “items”, and it took about 2 mins to fine tune… not to scale my memory is essentially Q1.5 right now.

[-]

ItsAMeUsernamio@reddit

Hardware FP8 support was added with the 40 series.

Similarly Hardware FP4 was added with the 50 series and will give a big leap in performance with Pytorch 2.10.

[-]