TheaterFire

Official statement from meta

Posted by Independent-Wind4462@reddit | LocalLLaMA | View on Reddit | 57 comments

Official statement from meta

Reply to Post

57 Comments

mikael110@reddit

>We believe the Llama 4 models are a significant advancement and we're looking forward to working with the community to unlock their value. If this is a true sentiment then show it by actually working with community projects. For instance why were there 0 people from Meta helping out or even just directly contributing code to llama.cpp to add proper, stable support Llama 4, both for text and images? Google did offer assistance which is why Gemma 3 was supported on day one. This shouldn't be an after thought, it should be part of the original launch plans. It's a bit tiring to see great models launch with extremely flawed inference implementation that ends up holding back the success and reputation of the model. Especially when it is often a self-inflicted wound caused by the creator of the model making zero effort to actually support the model post release. I don't know if Llama 4's issues are truly due to bad implementation, though I certainly hope it is, as it would be great if it turned out these really are great models. But it's hard to say either way when so little support is offered.
View on Reddit #53172015

mczarnek@reddit

Yeah, idk why everyone is doing this.. games too, release them half baked instead of getting them ready to show off first and having a beta release
View on Reddit #53400076

jeremy_oumi@reddit

You'd definitely think they'd be providing actual support to community projects, especially for a company/team of their size right?
View on Reddit #53247411

Ok_Warning2146@reddit

Well, google didn't add iSWA support to llama.cpp for gemma 3 such that gemma 3 becomes useless at long context.
View on Reddit #53235610

IrisColt@reddit

>If this is a true sentiment then he should show it by actually... ...using it... you know... eating your own dog food.
View on Reddit #53203028

complains_constantly@reddit

They contributed PRs to transformers, which is exactly what you're suggesting. Also, there are quite a few engines out there. Just because you use llama.cpp doesn't mean everyone else does. In our production environments we mostly use vLLM, for example. For home setups I use exllamav2. And there's quite a few more.
View on Reddit #53182161

segmond@reddit

I don't think it's due to bad inference. Reading the llama.cpp PR, the author implemented it independently and is getting the same quality of results the cloud models are giving.
View on Reddit #53179025

Expensive-Apricot-25@reddit

tbf, they literally did just finish training it. They wouldn't have had time to do this since they released it much earlier than they expected.
View on Reddit #53177874

xanduonc@reddit

And why cant someone write code for community implementations while model is training? Or write a post with recommended settings based on their prior experiments? Look, qwen3 already has pull requests to llamacpp and its not released yet.
View on Reddit #53178631

lemon07r@reddit

At least part of it is. But I've seen models that were hurt on release by implementation and bugs.. sure they were better once fixed but the difference was never so big that it could explain why llama 4 is so bad.
View on Reddit #53173023

LosingReligions523@reddit

Doubt. First of all **their own benchmark** compares their scout model which has 105B parameters to models of WAAAAY lower parameters like 22B or 25B. They claim victory but if you look at benchmark it barely beats them. And naturally they don't compare to QWQ32B because QwQ32B would anihilate scout. --- A 105B model can't even be used by wide public as it needs at least H100 a $40k gpu to run or 4x3090/4090 to run which is less expensive but actually hard to put for commoners.
View on Reddit #53201011

jubilantcoffin@reddit

The unsloth quant runs on MacBook Pro at pretty good speeds.
View on Reddit #53386826

realechelon@reddit

I'm running Scout at Q6\_K on my MacBook Pro (M4 Max 128GB). I get 20 T/s. You do not need a $40k GPU to run this model. You need 128GB of fast RAM, which is $200-300, or DIGITS which will be $3k, or a M4 Max 128GB which is about $5k.
View on Reddit #53302231

robberviet@reddit

Then provide correct way for users to use it. Either by supporting tools like llama.cpp or provide free limited access like Google aistudio. This statement is just cover up.
View on Reddit #53189249

RMCPhoto@reddit

I doubt it, wouldn't be very clever to release a statement like this if it will so easily be disproven in a week or two. I hope they're right and there will be improvements soon.
View on Reddit #53384668

robberviet@reddit

I know that this is a trillion dollar company we are talking about. However it's dumb to say it and there is no way to prove it.
View on Reddit #53385238

RMCPhoto@reddit

Either his team is completely misleading him or they know there's a lot of performance being left on the table. If you skim through the release docs llama 4 has a lot of new features and the dynamic int 4 loading etc can easily lead to problems if not properly implemented. This is a completely different architecture than 3, and unlike Gemma/google meta didn't work with llama.cpp etc to prep in the same way. No doubt it was a rocky release, but I wouldn't be surprised if there are some bugs to iron out. It's easy to forget that a LOT of llm launches have been messy.
View on Reddit #53385655

Lazy-Chick-4215@reddit

What I'm stoked for is being able to run a pretty big model on a combo of a lot of RAM and a much smaller amount of VRAM.
View on Reddit #53320564

Potential_Chip4708@reddit

While criticizing is good, we have cut some slack for meta, since they are one of the main reasons we are seeing lot of open source llms..
View on Reddit #53207658

Background_Stress_7@reddit

no
View on Reddit #53284005

ButterscotchSlight86@reddit

https://preview.redd.it/2n8jpvbhkote1.jpeg?width=1536&format=pjpg&auto=webp&s=809786135900a1bb0ab72f2c2303ad6c13558448
View on Reddit #53264431

Future_Might_8194@reddit

Tbf, almost no one was aware of the extra role and base function calling capabilities of Llama 3.1+ https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/#-special-tokens- Hermes 3 Llama 3.1 8B is actually trained on 2 different function calling sets (Llama and Hermes) and can really lock in on XML tags and instructions. There's a LOT of functionality most people haven't uncovered yet.
View on Reddit #53245654

rorowhat@reddit

"stabilize implementation" what does that mean?
View on Reddit #53174040

iKy1e@reddit

It means Llama.cpp handles this new feature slightly wrong, vllm handles this other part of the new design slightly wrong, etc…. So none produces quite as good results as expected, and each implementation of the models features give different results from each other. But as they all bug fix and implement the new features the performance should improve and converge to be roughly the same. Whether or not that’s true, or explains all of the differences or not 🤷🏻‍♂️.
View on Reddit #53175580

KrazyKirby99999@reddit

How do they test pre-release before the features are implemented? Do model producers such as Meta have internal alternatives to llama.cpp?
View on Reddit #53176400

bigzyg33k@reddit

What do you mean? You can run the models just using PyTorch, you don’t need llama.cpp at all, particularly if you’re meta and have practically unlimited compute
View on Reddit #53182397

KrazyKirby99999@reddit

How is LLM inference done without something like llama.cpp? Does Meta have an internal inference system?
View on Reddit #53182594

Drited@reddit

I tested llama 3 locally when it came out by following the meta docs and output was in terminal. llama.cpp wasn't involved. 
View on Reddit #53226120

Rainbows4Blood@reddit

Big corporations often use their own proprietary implementation for internal use.
View on Reddit #53204310

bigzyg33k@reddit

I mean, you could arguably just use PyTorch if you wanted to, no? But yes, meta has several inference engines afaik
View on Reddit #53182749

sluuuurp@reddit

They probably test inference with PyTorch. It would be nice if they just released that, maybe it has some proprietary secret training code they’d have to hide?
View on Reddit #53187375

rorowhat@reddit

Interesting. I thought that was all done pre-training. I didn't realize your back end could affect the quality of the response.
View on Reddit #53176089

CheatCodesOfLife@reddit

Oh yeah, the backend and quant formats make a HUGE difference! It gets really nuanced / tricky if you dive in too. We've got among other things: - Different sampler parameters supported - Different order in which the samplers are processed - Different KV cache implementations - Cache quantization - Different techniques to split tensors across GPUs Even using CUDA vs METAL etc can have an impact. And it doesn't help the HF releases are often an afterthought, so you get models released with the wrong chat template, etc. Here's a perplexity chart of the SOTA (exllamav3) vs various other quants: https://cdn-uploads.huggingface.co/production/uploads/6383dc174c48969dcf1b4fce/QDkkQZZEWzCCUtZq0KEq3.png
View on Reddit #53197787

rorowhat@reddit

Crazy to think that an older model could get better with some other backend tuning.
View on Reddit #53201252

CheatCodesOfLife@reddit

Maybe an analogy could be like DVD releases. Original full precision version at the studio. PAL release has a lower framerate but higher resolution (GGUF) NTSC release has a higher framerate but lower resolution (ExllamaV2) Years later we get a bluray release in much higher quality (but it can't exceed the original masters)
View on Reddit #53201562

rorowhat@reddit

Not sure, I mean the content is the same (the movie) just the eye candy is lowered. In this case it looks like a whole other movie is playing till they fix it.
View on Reddit #53225239

Thomas-Lore@reddit

Woth MoE how the experts are mixed is separate from the model for example.
View on Reddit #53181648

ShengrenR@reddit

Think of it as model weights + code = blue-print, but the back end actually has to go through and put the thing together correctly - where architectures are common and you can more or less build it with off the shelf parts, you're good; pipe a goes here. But if it's a new architecture, some translation may be needed to make it work with how outside frameworks typically try to build things.. does that thing exist in llama.cpp, or huggingface transformers, or just pytorch? That said, it's awfully silly for an org the size of meta to let something like that go un-checked - I don't know the story of why it was released when it was, but one would ideally have liked to kick a few more tires and verify that 'partners' were able to get the same base-line results as a sanity check.
View on Reddit #53177882

imDaGoatnocap@reddit

The 2nd paragraph
View on Reddit #53174497

rorowhat@reddit

Doesn't help
View on Reddit #53175582

imDaGoatnocap@reddit

It means fixing implementation bugs on various providers that are hosting the model which cannot be run locally without 20k GPUs hope this helps
View on Reddit #53176575

burnqubic@reddit

weights are weights, system prompt is system prompt. temperature and other factors stay the same across the board. so what are you trying to dial in? he has written too many words without saying anything.
View on Reddit #53189453

the320x200@reddit

Running models is a hell of a lot more complicated than just setting a prompt in a few knobs... If you don't know the details it's because you're only using platforms that do all the work for you.
View on Reddit #53191185

TheHippoGuy69@reddit

Just go look at their special tokens and see if you have the same thoughts again.
View on Reddit #53216647

burnqubic@reddit

except i have worked on llama.cpp and know what it takes to translate layers. my question is, how do you release a model to businesses to run with no standards to follow?
View on Reddit #53193810

RipleyVanDalen@reddit

Your comment would be more convincing with examples.
View on Reddit #53192024

terminoid_@reddit

if you really need examples for this go look at any of the open source inference engines
View on Reddit #53192648

sid_276@reddit

There are a lot of things you need to figure out. And btw expecting the same quality actors inference frameworks is wrong. Each has quirks and performance/quality trade-offs. Some things that you need to tune: - interleaved attention - decoding sampling (Top P, beam, nucleus) - repetition penalty - mixed FP8/bf16 inference - MoE routing - … To be clear this is the first MoE Llama w/o ROPE and native multimodal projections. If that means anything to you at all. Quite a few.
View on Reddit #53208490

LaguePesikin@reddit

not true… see both vLLM and sglang tried so hard to implement Deepseek r1 inference
View on Reddit #53193965

GFrings@reddit

"it will take several days for the public implementations to get dialed in" Lol what does that mean? We're supposed to allow a rest period after cooking the models?
View on Reddit #53184249

the320x200@reddit

Bugs. They are referring to serving platform bugs.
View on Reddit #53191396

fkenned1@reddit

Can I just say, it's so incredible to see all these people, like in this community for example, who seem to know so much about a technology that we as humans barely understand. Like, there's so much knowledge out there on how to implement these tools, from a technical standpoint, all while I'm barely keeping up with tech announcements. It's impressive. Kudos to all of you more tech savvy individuals, really diving deep into these tools!
View on Reddit #53186645

YouDontSeemRight@reddit

Nice, these things can take time. Looking forward to testing it myself but waiting for support to roll out. The issue was their initial comparisons though... I think they were probably pretty honest so can't expect more than that. Hoping they can dial it into a 43B equivalent model and then figure out how to push it to the maximum whatever that might be. Even a 32B equivalent model would be a good step. Good job none-the-less getting it out the door. It's all in the training data though.
View on Reddit #53181932

Exelcsior64@reddit

Give it a week, and we're going to see how test sets "accidentally" got into the training data.
View on Reddit #53173826

Healthy-Nebula-3603@reddit

Great results?? Lol
View on Reddit #53173395

Federal-Effective879@reddit

Dupe of [https://www.reddit.com/r/LocalLLaMA/comments/1jts2hq/were\_also\_hearing\_some\_reports\_of\_mixed\_quality/](https://www.reddit.com/r/LocalLLaMA/comments/1jts2hq/were_also_hearing_some_reports_of_mixed_quality/)
View on Reddit #53166589

davewolfs@reddit

Holy fuck Zuck
View on Reddit #53166556