UMbreLLa: Llama3.3-70B INT4 on RTX 4070Ti Achieving up to 9.6 Tokens/s! π
Posted by Otherwise_Respect_22@reddit | LocalLLaMA | View on Reddit | 83 comments
UMbreLLa: Unlocking Llama3.3-70B Performance on Consumer GPUs π
Have you ever imagined running 70B models on a consumer GPU at blazing-fast speeds? With UMbreLLa, it's now a reality! Here's what it delivers:
π― Inference Speeds:
- RTX 4070 Ti: Up to 9.7 tokens/sec
- RTX 4090: Up to 11.4 tokens/sec
β¨ What makes it possible?
UMbreLLa combines offloading, speculative decoding, and quantization, perfectly tailored for single-user LLM deployment scenarios.
π» Why does it matter?
- Run 70B models on affordable hardware with near-human responsiveness.
- Expertly optimized for coding tasks and beyond.
- Consumer GPUs finally punching above their weight for high-end LLM inference!
Whether youβre a developer, researcher, or just an AI enthusiast, this tech transforms how we think about personal AI deployment.
What do you think? Could UMbreLLa be the game-changer we've been waiting for? Let me know your thoughts!
Github: https://github.com/Infini-AI-Lab/UMbreLLa
#AI #LLM #RTX4070Ti #RTX4090 #TechInnovation
Secure_Reflection409@reddit
We need more people to try this. It's kind of a big deal if it works.
Professional-Bear857@reddit
Does this support windows, can it run a 70b on a rtx 3090 with 32gb system ram?
coderman4@reddit
I didn't run it in Windows directly yet, however used wsl to bridge the gap.
I know there's some overhead involved doing it this way, but at least for me it seemed to work well.
My ram utilization was quite high even with 96 gb of system ram, so think 32 gb will be cutting it a bit close unfortunately.
Secure_Reflection409@reddit
Win10?
It doesn't work for me.
Otherwise_Respect_22@reddit (OP)
32GB might be risky. I will solve the problem soon.
XForceForbidden@reddit
Hope it will expand to Qwen Coder-32B and my 4070 laptop 8G + 32G Ram
AppearanceHeavy6724@reddit
speculative decoding is not everyone. at temperature below < .2 many models become barely usable.
Otherwise_Respect_22@reddit (OP)
Our chat configuration uses T=0.6
AppearanceHeavy6724@reddit
AFAIK speculative decoding requires t=0
Mushoz@reddit
It does not. But higher temperatures will lead more to draft rejects (eg less speedup or sometimes even a slowdown), so lower temperatures are better purely for speed.
AppearanceHeavy6724@reddit
Well that is I am trying figure out, how they manage to run speculative decoding with 0.6 temp. This is quite high temperature if you ask me.
Otherwise_Respect_22@reddit (OP)
Welcome to checking our codebase!
sammcj@reddit
It works best with temperature set to 0, but then I think most LLMs do unless you truly want to inject pretty dumb randomness into the start of the prediction algorithm for some reason, if you have to use min_p instead.
Puzzleheaded-Drama-8@reddit
That sounds amazing! Would this allow me to run Qwen 32B with Qwen 0.5B on 3060 12GB with similar speed?
Otherwise_Respect_22@reddit (OP)
I will add support for Qwen in 6-9 days.
waydown21@reddit
Will this work with RTX 4080?
coderman4@reddit
As a fellow 4080 user, I can say that at least on my system it is working great so far.
I used the 16 gb chat config, and didn't need to change anything to have things working well right off the bat.
Otherwise_Respect_22@reddit (OP)
Thank you for trying this!
Otherwise_Respect_22@reddit (OP)
Yes. I have configurations for 4080 SUPER (which might differ from 4080). You can check our repo. (We get the benchmark results with PCIE4, with GPU-CPU bandwidth \~30GBps)
Ok_Warning2146@reddit
Can you also support Nemotron 51B? It will be even faster.
Otherwise_Respect_22@reddit (OP)
Yeah. Let me put in my plan.
itsnottme@reddit
What's the catch? There must be one.
Otherwise_Respect_22@reddit (OP)
We use speculative decoding on a very large scale, by speculating 256 or even more tokens we can generate 13-15 tokens per forward pass. On coding tasks (where LLMs are more confident), this number is more than 20.
ForsookComparison@reddit
Whats the catch? (Dumb it down for me if you could, is this free performance gains or is something lost?)
c110j378@reddit
The catch is that you probably won't get that many tokens/s (or even worse than baseline performance) other than coding tasks.
Otherwise_Respect_22@reddit (OP)
In chatting tasks (I used MT Bench to meansure), we still get 5 tokens/sec, which is still 7-8 times faster than plain CPU offloading. We provide examples in our codebase.
Otherwise_Respect_22@reddit (OP)
Model performance is theoretically proved to be preserved, according to the theory of speculative decoding. This is free performance gain.
ForsookComparison@reddit
Does it scale with VRAM? Could I expect a significant performance boost with multiple 4090's?
Otherwise_Respect_22@reddit (OP)
Yes. But the point of this project is to host a large model with a small GPU. Multiple GPUs can of-course improve the performance of UMbreLLa. But if the VRAM is large enough to host the entire model, I would recommend more standard framework for large-scale serving like vLLM, SGLang, etc.
Secure_Reflection409@reddit
cuda error: out of memory - running the 16GB chat config on a 4080S.
What am I missing?
Otherwise_Respect_22@reddit (OP)
I used roughly 14-15GB when runing the gradio chat. But my device are with Ubantu. My command line is
python gradio_chat.py --configuration ../configs/chat_config_16gb.json
If you confirm that this can lead to OOM with WSL, welcome the submit an issue.
Otherwise_Respect_22@reddit (OP)
this is my memory usage when launching gradio_chat on 4080.
DragonfruitIll660@reddit
Getting the same error on a 3080 16gb mobile trying both the 16gb chat config and 12gb chat config with 64 gb of regular ram also using wsl.
Otherwise_Respect_22@reddit (OP)
I do not meet this error. What do you run?
Secure_Reflection409@reddit
win10/wsl
brown2green@reddit
What this does that Llama.cpp doesn't already?
Otherwise_Respect_22@reddit (OP)
UMbreLLa applies speculative decoding in a very large scale. We speculated 256 or more tokens and generate > 10 tokens per iteration. Existing frameworks only speculate <20 tokens and generate 3-4 tokens. This feature makes UMbreLLa extremely suitable for single user (without batching) on a small GPU.
brown2green@reddit
You can configure Llama.cpp to speculate as many or as little tokens as you desire per iteration. There are various command-line settings for this and the defaults are by all means not necessarily optimal for all use cases.
Otherwise_Respect_22@reddit (OP)
But we apply different speculative decoding algorithms. The one implemented in Llama.cpp won't be so helpful when you set N=256 or more.
caetydid@reddit
might this be integrated in ollama and/or localai?
BenefitOfTheDoubt_01@reddit
Hey folks, not trying to hijack the conversation but I don't have enough karma here to create my own topic.
I am new to AI and interested in getting my own model running with an off-the-shelf GPU but... I don't yet understand the verbage. Is there a series of books or something anyone can recommend to catch me up to speed.
I could just ask an LLM the difference between an image generator and chat based system, how they function, how tokens work, etc but I figured instead of asking a million questions their may be published works or YouTube instruction available. I prefer video as a medium but beggars can't be choosers, right.
Some sources I've found assume prior AI knowledge, I'm looking for something from the ground up, preferably a series.
Thanks all!
AdWeekly9892@reddit
Will inference work on finetunes of supported models, or must the model match exactly?
Otherwise_Respect_22@reddit (OP)
Currently, it does not support. But there is no technical challenge. Can be expected in 7-10 days.
Secure_Reflection409@reddit
This should be pinned to the top tbh.
Realistic-Mix-7913@reddit
Iβll see if this works on my Titan RTX this weekend, looks quite promising
Secure_Reflection409@reddit
I've got a 4080 Super which appears to be the prime target for this?
Have you tried it with 70b qwen / 1.5b qwen?
Could be even bigger gains...?
Otherwise_Respect_22@reddit (OP)
I have not supported Qwen. Can be expected in 7-10 days. Thank you!
FullOf_Bad_Ideas@reddit
That sounds like a game changer indeed. Wow.
Otherwise_Respect_22@reddit (OP)
Could test this (in ./examples)? This can reflect the CPU-GPU bandwidth of your computer (by running model offloading without our techniques). Mine (4070Ti) returns 1.4s-1.6s per token.
FullOf_Bad_Ideas@reddit
I guess that's 4.43s per token for me if I read this right.
Otherwise_Respect_22@reddit (OP)
This is what I got.
Otherwise_Respect_22@reddit (OP)
Yes. So your generated speed will be roughly 4.43/1.5=3 times slower than me. I think this mainly comes from PCIE setting.
Otherwise_Respect_22@reddit (OP)
This depends on the PCIE bandwidth. Our number comes from PCIE 4.0. Maybe the 3090 TiΒ you are testing uses PCIE3.0? You can raise an issue on Github.
FullOf_Bad_Ideas@reddit
It's 4x16 so it should be fine. If my math is right, I should be able to get around the same performance as you get on 4070 Ti with my 3090 Ti, if not better.
I'll test it on cloud gpu tomorrow to see if it works the same way there to eliminate issues with my setup, before making a github issue.
kryptkpr@reddit
Seems to be some device specific magic in the configs
a_beautiful_rhind@reddit
My guess is that the ADA optimizations are why this goes fast at all. Brute forcing it with the extra compute.
FullOf_Bad_Ideas@reddit
3090 Ti has the same FP16 FLOPS (and INT4 too but I don't think AWQ supports INT4 inference) as 4070 Ti though, so I am not sure where it's coming from. It's not FP8 inference. It also has 2x the bandwidth.
a_beautiful_rhind@reddit
Hopefully someone with that hardware verifies the benchmarks.
coderman4@reddit
At least for me using a 4080 with 16 gb of ram, I'm able to get at least 10 t/s with the 16 gb chat configuration using llama3.3-70b.
It's early days, but this looks like a promissing advance so far, especially when you compare it to the 0.5 t/s I was getting before with gguf.
Bonus points in my book will be if/when we can get an openai compatible api for this, so it can be hooked into more things.
Thanks for making this available to the opensource community.
space_man_2@reddit
What do you have configured for a large bar? In windows, if the bios support is enabled then it's usually half of your systems memory.
antey3074@reddit
Can I use this with Aider? What quantization can my 3090 with 70b model support?
Otherwise_Respect_22@reddit (OP)
We have not integrated with Aider. You can run full precision (16bit) with RTX 3090. However, the inference speed will be 1/4, since the model size is 4 times larger. For quantization, we currently only support AWQ q4.
Whiplashorus@reddit
Omg this seems nice Do you think I can use it on my 7800xt ? Is there a qwen2.5-72b version planned ?
Otherwise_Respect_22@reddit (OP)
We don't support AMD currently. Qwen is planned.
Whiplashorus@reddit
Am Intel arc ?
Otherwise_Respect_22@reddit (OP)
I think 7800xtΒ is an AMD GPU?
Whiplashorus@reddit
Sorry I mean And not am Let me ask it again
You don't support AMD gpu You support NVIDIA GPU But do you support Intel arc GPU
Otherwise_Respect_22@reddit (OP)
Sorry. We only support NVIDIA GPU. Thank you for your interest!
Whiplashorus@reddit
Okey I see Is there any other gpu brand support planned or it's out of scope ?
Otherwise_Respect_22@reddit (OP)
I plan to extend to AMD in the future.
Whiplashorus@reddit
Nice am saving the repo Thanks for your time
phovos@reddit
Cool there is a need for this. Is there any particular reason you didn't extend this fantastic idea down to the plebs? Why not support x Gig and RTX**70+, or (arbitrarily) why not 1080ti with 6GB?
Otherwise_Respect_22@reddit (OP)
We plan to support more GPU types in the future. 6GB is able (I have not tested my own) to run the program, but may be not that fast.
tengo_harambe@reddit
Why is it named so sarcastically tho
Otherwise_Respect_22@reddit (OP)
why it is sarcastically?
ApatheticWrath@reddit
What quant on what exact hardware are these speeds? 70b doesnt fit on one 4090? If q4 on two 4090 I think exllama is faster. Maybe vllm too? I'm less certain on their numbers.
Otherwise_Respect_22@reddit (OP)
One 4070Ti or one 4090. We use parameter offloading.
Otherwise_Respect_22@reddit (OP)
Only require one GPU and \~35GB CPU RAM to run.
antey3074@reddit
if I have 32gb ram and 24gb video memory, is that not enough to work well with the 70B model?
Otherwise_Respect_22@reddit (OP)
Currently, I load the entire model in RAM and then conduct offloading. I think you raise a very good question. Let me solve this this week. I can make this more flexible.
Otherwise_Respect_22@reddit (OP)
We use AWQ INT4
reddit_kwr@reddit
What's the max context length this supports on 24G
Otherwise_Respect_22@reddit (OP)
32K contexts will take about 21GB (I think at most you can serve 36K-40K currently). This would require to change the engine configurations. We will add support for KV offloading and long context techniques.