You can now do FP8 reinforcement learning locally! (<5GB VRAM)
Posted by danielhanchen@reddit | LocalLLaMA | View on Reddit | 61 comments
Hey r/LocalLlama! We're getting close to our last release of 2025! Thanks so much for all the support this year. The DeepSeek team back in Jan showcased how powerful FP8 RL can be with GRPO. Well, you can now try it on your local hardware using only 5GB VRAM! RTX 50x, 40x series all work!
Why should you do FP8 training?
NVIDIA's research finds FP8 training can match BF16 accuracy whilst getting 1.6x faster inference time. We collabed with TorchAO from PyTorch to introduce FP8 RL training, making FP8 GRPO possible on home GPUs with no accuracy loss!
- Qwen3-4B FP8 GRPO works on just 6GB VRAM. Qwen3-1.7B on 5GB
- 1.4x faster RL training and 2× longer context vs BF16/FP16
- 60% less VRAM and 10× longer context than other FP8 RL implementations
- Unsloth is the only framework that makes FP8 RL LoRA work on consumer GPUs (e.g. NVIDIA RTX 40 & 50 Series). Also runs on H100, H200, B200.
- You may notice Unsloth now uses much less VRAM than before, enabling even longer context. We’re also implementing faster training soon. Blog coming soon
- Our notebooks use 24GB L4s which fit Qwen3-14B as Tesla T4s don’t support FP8.
- Our FP8 RL incorporates Unsloth’s weight sharing, Standby, Flex Attention + more.
- Works on any NVIDIA RTX 40, 50 series and H100, B200 etc. GPUs
- Use
load_in_fp8 = TruewithinFastLanguageModelto enable FP8 RL.
You can read our blogpost for our findings and more: https://docs.unsloth.ai/new/fp8-reinforcement-learning
Llama 3.2 1B FP8 Colab Notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama_FP8_GRPO.ipynb
In the notebook, you can plug in any of our previous reward functions or RL environment examples, including our auto kernel creation and our 2048 game notebooks. To enable fp8:
import os; os.environ['UNSLOTH_VLLM_STANDBY'] = "1" # Saves 30% VRAM
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Qwen3-8B",
max_seq_length = 2048,
load_in_4bit = False, # False for LoRA 16bit
fast_inference = True, # Enable vLLM fast inference
max_lora_rank = 32,
load_in_fp8 = True, # Float8 RL / GRPO!
)
Hope you all have a lovely Thanksgiving, a lovely rest of the week and I'll be here to answer any and all questions! =)
SykenZy@reddit
This is awesome, will check it as soon as docker image is out....
Is there a plan to support diffusion models? Flux 2.0 is out but FP16 is like 64 GB, FP4 with unsloth performance improvements might be awesome
exaknight21@reddit
As someone who is a complete fan on unsloth, qwen3:4b (specifically), AND a proud owner of my twins (2x3060 @ 12 GB)... I am looking forward to playing with this and actually contributing to the community. I have about 200 GB of construction data that I plan on using for Fine Tuning using LIMA approach.
markovianmind@reddit
mind sharing what you assist of data is and what you plan on doing with it
exaknight21@reddit
All in due time friend. It’s raw and confidential data, but construction related. Immediate project is a complete self hosted RAG app, currently in a non-LocalLLaMA state.
ANR2ME@reddit
I wondered why did they didn't mentioned RTX 30 series 🤔
exaknight21@reddit
It’s the slow bandwidth and lack of NVLink. AI Researchers are not going to use a 3060 to work with the insane requirements of a model. 3090s have higher bandwidth and NVLink hence why it is worth a shot for them to at least try, which I think is what daniel is trying to do.
In any event. Probably a cause for povertyAI.
I was able to fine tune a model, I dont know what model, as a test on my 3060 set up - so it is not “impossible”, i used QLoRA at the time from unsloth. I’m working on a RAG App right now, but as I learn, I will share my approaches. It was a dataset of 10 “items”, and it took about 2 mins to fine tune… not to scale my memory is essentially Q1.5 right now.
ItsAMeUsernamio@reddit
Hardware FP8 support was added with the 40 series.
Similarly Hardware FP4 was added with the 50 series and will give a big leap in performance with Pytorch 2.10.
danielhanchen@reddit (OP)
OO that's a lot of data!! Hope it works well! Although sadly FP8 won't work on 3060 :( that well - I actually should launch a 3090 and check - I might be able to still make FP8 work :)
mister2d@reddit
That would be great to know for my dual 3060 rig as well. But I'm not suggesting you go out of your way.
yoracale@reddit
Amazing to hear! Even if not contributing just showing your support is enough! 🥰♥️
MrRandom04@reddit
Holy moly, an RL-finetuned 4B Qwen could actually be useful for real tasks. Being able to do that on my lowly laptop GPU would be amazing.
SlowFail2433@reddit
Ye theres already models on huggingface that are that e.g Jan ai ones
Training_Pudding9338@reddit
for example?
shapic@reddit
Are vllm also supported?
AbaGuy17@reddit
This will work on any Training? I tried training gpt2 model on gameboy byte music, it worked in principle, and using this I could train in FP8, right?
solomars3@reddit
I think the problem is that, it's limited to only a few models, Unsloth doesn't support all models architecture, last time i tried i was forced to use one of the templates for the supported models,
tifa_cloud0@reddit
awesome fr. by less than or equal to 5gb vram, do you mean it can also work on gtx 16 series cards which have 4gb vram ?
danielhanchen@reddit (OP)
It'll only work on GPUs that support FP8 unfortunately so any GPU after RTX 40 series BUT, if you want to do normal GRPO, it will work yes. Read more: https://docs.unsloth.ai/get-started/reinforcement-learning-rl-guide
tifa_cloud0@reddit
true fr
swashed-up-01@reddit
guys how well would a finetuned 4B model perform on custom datasets given enough data. better than out of the box LLMs like GPT-5 and would it match reasoning models?
DaniyarQQQ@reddit
That looks amazing. I'm sorry but I don't quite follow the development of your lib. I know that it is used for training. Can it be used as a backend to launch these models?
danielhanchen@reddit (OP)
Oh no worries! Unsloth https://github.com/unslothai/unsloth makes finetuning & training 2x faster and use 60% less memory - we also support reinforcement learning which is also faster and uses less VRAM
You can technically strip out the inference part out of Unsloth - I do plan to make it portable so you can use it simply as an inference server in the near future if that helps!
Ofacon@reddit
Sincerely, thank you for supporting and engaging with community so much. It’s a gift.
Scy73@reddit
This work is amazing, thank you for sharing with us.
danielhanchen@reddit (OP)
Thanks for the support and for reading
Famous-Appointment-8@reddit
MLX Support?
yoracale@reddit
Not at the moment but we hope to support it early next year! We still haven't officially announced AMD or Intel support yet so hopefully we get that done first 🙏
Insipidity@reddit
So MacBook users are unable to run RL ? How about other features in Unsloth like fine-tuning?
danielhanchen@reddit (OP)
Sadly we don't support Mac at this moment - we're working on it though - best to check out MLX in the meantime sorry!
bhupesh-g@reddit
is there any timeline for mac support, most devs use mac for day to day work and enabling them to use unsloth for training and finetuning will be so cooool. BTW I love unsloth 😍
Famous-Appointment-8@reddit
Awesome thanks for all your effort!
_VirtualCosmos_@reddit
Does it work with llama.cpp for the inference part too or vLLM is required? Would be cool to use the layer/expert offloading of llama.cpp to train big models with little VRAM, and gguf models.
danielhanchen@reddit (OP)
The goal was to make a llama.cpp backend, but in the meantime currently no sorry :(
_VirtualCosmos_@reddit
Thanks for the reply! So you tried to do that but discovered it would be too hard and thus changed to vLLM? Or something like that? Are you planning on still try it again?
AIMadeSimple@reddit
This is huge for democratizing AI. When RL training drops from enterprise H100s to consumer RTX 40x series, you fundamentally shift who can innovate. The gap between "AI researcher" and "person with a gaming PC" just collapsed. FP8 at <5GB VRAM means experimentation becomes accessible, not just deployment. This is how open source catches up to closed models.
IrisColt@reddit
Thanks!!!!
Barachiel80@reddit
any chance you have plans for ROCM support?
yoracale@reddit
Should already work we just haven't officially announced it: https://docs.unsloth.ai/get-started/install-and-update/amd
_VirtualCosmos_@reddit
damn great news, my Strix Halo is close to getting in my home.
danielhanchen@reddit (OP)
Oh nice!
peroperoname@reddit
Have you moved to DAPO loss in your implementation?
danielhanchen@reddit (OP)
Yes! You can set
loss_type = "dapo"which will use DAPO! See https://docs.unsloth.ai/get-started/reinforcement-learning-rl-guide/advanced-rl-documentation we also support GSPO and more. GSPO vision notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_VL_(8B)-Vision-GRPO.ipynbperoperoname@reddit
Noice.
danielhanchen@reddit (OP)
:)
ElekDn@reddit
Looks really cool!! Can we expect 30 series support?
danielhanchen@reddit (OP)
I'll check 30x today and get back to you!
No_Lime_5130@reddit
"<6 GB VRAM" ... at what context length? 128? 512? 8096?
danielhanchen@reddit (OP)
Oh 1024 batch size 2 should work since we offload everything. Longer contexts also work - we're going to release something next week or this week on even longer context support with less memory usage!!
larrytheevilbunnie@reddit
This will be available in the docker image right?
danielhanchen@reddit (OP)
Yes!! Tonight!
thekalki@reddit
I was exploring few libraries for full fine tuning and ended up using torch tune. Is there a reason why i should switch to unsloth, At this point i primarily do some continuous pretraining, SFT and exploring RL but how flexible is your frame work to run RL on my own loop ?
danielhanchen@reddit (OP)
Unfortunately TorchTune is deprecated, so it hasn't been updated in 4 months I think :(
Yes we support continued pretraining, SFT and RL! We have notebooks for all these at https://docs.unsloth.ai/get-started/unsloth-notebooks
Kappalonia@reddit
But wasn't Blackwell the only architecture that supports native fp8? Why use L4s?
Conscious_Chef_3233@reddit
that's fp4.
danielhanchen@reddit (OP)
We plan to support FP4 as well!
yoracale@reddit
Nope, any Nvidia GPU after 40 series supports FP8
Sea-Rope-31@reddit
That's amazing! You're amazing! Thank you, guys!
danielhanchen@reddit (OP)
Thank you!
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Educational_Rent1059@reddit
Great work!
yoracale@reddit
Thank you(n🙏