You can now check if your Laptop/ Rig can run a GGUF directly from Hugging Face! 🤗
Posted by vaibhavs10@reddit | LocalLLaMA | View on Reddit | 64 comments
tilmx@reddit
Hey u/vaibhavs10 - great feature! Small piece of feedback: I'm sure you know, but many of the popular models will have more GGUF variants than can be displayed on the sidebar:
Clicking on the "+2 variants" takes you to the "files and versions" tab, which no longer includes compatibility info (unless I'm missing something?) Do you have any plans to add it there? Alternatively, you could have the Hardware compatibility section expand in place.
vaibhavs10@reddit (OP)
Hey hey, I'm VB, GPU Poor @ Hugging Face - starting today you can find directly from the Hugging Face page of a GGUF. All you need is to update the specifications of your hardware here: https://huggingface.co/settings/local-apps and then on any GGUF across Quant types it should tell you if you can run or not.
Take it out for a spin and let us know what you think!
10minOfNamingMyAcc@reddit
It doesn't account for multiple GPUs
10minOfNamingMyAcc@reddit
Text Generation
Browse compatible models
So for Koboldcpp, should I select llama.cpp?
Frankie_T9000@reddit
Going to look tonight but quick question does it support multiple PC's or need to change config on fly?
vaibhavs10@reddit (OP)
yes, it should be
Frankie_T9000@reddit
Thanks, had a look and to confirm its hardware instruction based for cpu not memory based?
Liringlass@reddit
Can it instead give my machine the VRAM it desperately needs? :D
DegenerativePoop@reddit
Any plans on updating to include the newest GPUs such as the 9070/9070xt?
MegaBytesMe@reddit
Very cool! Missing my Nvidia Quadro RTX 3000 though (in my Surface Book 3) and ARM based processors (Snapdragon X Elite etc)
vaibhavs10@reddit (OP)
aha! do you mind opening a PR here: https://github.com/huggingface/huggingface.js/blob/1aa1c3f4d2081b270517219c49c95c1d8d7fc682/packages/tasks/src/hardware.ts and tagging me on the PR `Vaibhavs10` 🙏
MegaBytesMe@reddit
Sure thing!
abitrolly@reddit
I feel like pasting `fastfetch` would be the best UI. I am not sure which i5 generation is this.
CPU: Intel(R) Core(TM) i5-4300U (4) @ 2.90 GHz
LagOps91@reddit
if i might suggest something? would it be possible to use this information during model search? for instance, i would want to look for models, where i can run Q4 at minimum and i'm not interested if i can run Q8 or larger (since they typically get outperformed by larger models with lower quants)
vaibhavs10@reddit (OP)
good feedback - will iterate on this with the team
Devonance@reddit
Love this. Making the starting bar lower for hobbyists.
Any chance of getting AWQ to be added?
Also, a "default" option for the graphics card (or CPU) that shows up first when calculating it? It pulls my RTX A4500 before my 2x4090s, so I have to adjust it everytime. (I did just rearrange in the hardware settings and it's whichever is first added).
Also, maybe in the far future, adding in the PCIe slot data lane number and then giving an estimate of token/sec (complete estimate as other things would affect this).
vaibhavs10@reddit (OP)
We'll increase the coverage of models supported, yes, but a bit slowly with wherever the community wants us to go next - keep the feedback coming!
strategos@reddit
Key question though - Are you really GPU poor? :)
vaibhavs10@reddit (OP)
haha, you can look at my GPU stack here: https://huggingface.co/reach-vb
Euphoric-Bullfrog525@reddit
Hi, I'm a beginner starting out. I made an account on hugging face and added my hardware specs, but I'm not seeing the GGUF interface on the model page here:
_harias_@reddit
There is a separate page for the GGUF versions: https://huggingface.co/Qwen/QwQ-32B-GGUF
Euphoric-Bullfrog525@reddit
Thanks!
noneabove1182@reddit
such an awesome QoL upgrade.. awesome at-a-glance info, even if it doesn't give the full story it'll make life so much easier for a lot of people!
AstroEmanuele@reddit
Idk if it's a bug, but somehow a ryzen 9 7000 series with 16gb of ram is only 0.56 TFLOPS, while my ryzen 7 5000 series with the same amount of ram is almost three times higher at 1.33 TFLOPS, is that actually correct?
AstroEmanuele@reddit
u/vaibhavs10
xqoe@reddit
I need a way to calculate BPS and FLOPS per token per second for a choosen model, so I can compare if my hardware can run it
carvengar@reddit
What does it mean you can 'run' the llm?
Cause 1 token a second isn't usable, but its still 'runs'.
Delicious_One_7887@reddit
I have a total of 2.60 TFLOPS of computing power according to this
ParaboloidalCrest@reddit
Damn! That's a shot at LMStudio!
vaibhavs10@reddit (OP)
eh! not really, we love lmstudio and almost chat with them everyday, infact you can open a GGUF directly from the model page in lmstudio as well!
ParaboloidalCrest@reddit
I was kidding. Thanks for the feature.
sunpazed@reddit
This is really great, well done to you and the team!
AlphaPrime90@reddit
CPU and GPU button does not work in Firefox. Will try again tomorrow.
drink_with_me_to_day@reddit
There's a bug, my 750 ti isn't in the dropdown
vaibhavs10@reddit (OP)
we should fix this - do you mind opening a PR here: https://github.com/huggingface/huggingface.js/blob/1aa1c3f4d2081b270517219c49c95c1d8d7fc682/packages/tasks/src/hardware.ts and tagging me on the PR `Vaibhavs10` 🙏
drink_with_me_to_day@reddit
I was joking as I think the 750ti is too old to run anything
MagicaItux@reddit
I think you have bigger issues
LA_rent_Aficionado@reddit
Would love for this to be able to tell you the max context you could run
vaibhavs10@reddit (OP)
good feedback
LA_rent_Aficionado@reddit
Happy to help! And maybe even an estimated gpu layers offload… I feel like lmstudio and kobaldcpp really drop the ball with calculating these for multi gpu setups.
Keep up the good work!
das_rdsm@reddit
Maybe add MLX as well? GGUF is not great for Apple Silicon.
vaibhavs10@reddit (OP)
on the list, yes!
DerfK@reddit
Pretty neat. If it expanded beyond GGUF quants it could become really useful guidance eg if I have multiple video cards, recommending quants and software that can implement tensor parallelism to use the hardware to the fullest. Of course then someone would have to keep track of all the feature compatibilities but at least it wouldn't be me :D
vaibhavs10@reddit (OP)
yeah! definitely want to increase to more model types soon
fuutott@reddit
Please add rtx pro 6000
vaibhavs10@reddit (OP)
Sure, do you mind opening a PR here: https://github.com/huggingface/huggingface.js/blob/1aa1c3f4d2081b270517219c49c95c1d8d7fc682/packages/tasks/src/hardware.ts and tagging me on the PR `Vaibhavs10` 🙏
panchovix@reddit
This is great! But for multigpu, it doesn't seem to sum up? Like I have multiple different GPUs but let me choose 1 of them to evaluate.
vaibhavs10@reddit (OP)
looking into it with the team
puncia@reddit
A very good addition to this would be a suggested number of gpu layers to offload when using cpu + gpu inference, as I'm sure many of us do
vaibhavs10@reddit (OP)
yes! this is a first edition - will iterate a bit in the future :D
draetheus@reddit
Interesting idea, although I'd say this is highly variable depending how much context you run with. I usually keep my KV cache in RAM as well (rather than VRAM) so I can maximize the quant quality within 12GB VRAM.
For instance I can run Mistral Small 3.1 at IQ4_XS in 12GB VRAM, although this tool is saying Q3_K_S is the limit.
ParaboloidalCrest@reddit
How do you do that?
draetheus@reddit
If you're directly using llama.cpp the flag is `-nkvo`. Not sure what it is for the various llama.cpp wrappers or other frameworks.
ParaboloidalCrest@reddit
I wanted this to work badly. While it does keep KV cache in system memory, it slows down token generation significantly, not only prompt processing.
draetheus@reddit
Huh, I don't get nearly that level of slowdown but again with my usage is relatively small context and prompts.
ParaboloidalCrest@reddit
Same here. I just tested with "Hello". Perhaps it has to do with the backend (Vulkan).
draetheus@reddit
Possibly, I'm using CUDA, I've never tested Vulkan.
Marksta@reddit
Do you still get good performance running like that?
draetheus@reddit
IMO yes, token generation is always the bottleneck so having slightly slower prompt processing to fit better quants is worth it.
popiazaza@reddit
Awesome.
I spent way too much time early on in my local LLM journey to learn how to calculate memory usage for each GGUF model.
It's pretty straight forward, but there isn't much information floating around. I got confuse when I searched for something like "llama 3.1 8b vram requirement".
Emport1@reddit
Really nice thanks
ThiccStorms@reddit
Amazing
-Cubie-@reddit
I love it, very nice!
Stepfunction@reddit
This is so cool! Great work!