TheaterFire

Updated benchmarks from Artificial Analysis using Reflection Llama 3.1 70B. Long post with good insight into the gains

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 144 comments

Reply to Post

144 Comments

Environmental-Car267@reddit

Haters gonna hate. "All that being said: if applying reflection fine-tuning drives a similar jump in eval performance on Llama 3.1 405B, we expect Reflection 405B to achieve near SOTA results across the board."
View on Reddit #35154774

Many_SuchCases@reddit

I don't hate the guy though. But when someone makes us all look like fools it is not appreciated. And he still hasn't provided us with anything. You're acting like this tweet somehow solves the fact that he still hasn't provided anything to us. Look at what it actually says: > These were tested on a private API version and not an open-weights version. So how do we know this isn't a model with a custom prompt? Why not just open the directory and make a torrent? It would solve everything in like 5 minutes.
View on Reddit #35156461

alongated@reddit

I think its fair to say that people here over reacted, both about how good this was, and how bad this was.
View on Reddit #35157402

RandoRedditGui@reddit

Not really. The "how bad this was" are still easily winning considering we have seen 0 open weights and are provided some ambiguous results from an API that we have no clue the validity of. Open weights or GTFO.
View on Reddit #35169072

alongated@reddit

It has been fucking 6 hours since he trained the model, give the man a fucking break, and guess what he released the weights? Normally I don't get this angry but holy shit you people are fucking insane.
View on Reddit #35175577

showdontkvell@reddit

Matt, that you? lol
View on Reddit #35200448

alongated@reddit

I just went off on a guy for calling someone Matt. I'm not Matt, but doxxing isn't funny. Like you might be right that this is all just bullshit/scam. But you are attacking people for reserving their judgement. That is disgusting mob mentality.
View on Reddit #35208379

showdontkvell@reddit

lol k
View on Reddit #35410768

StartledWatermelon@reddit

Fair concerns. For all we know, under the hood this API could redirect queries to Claude-3.5 Sonnet with a specific system prompt, or another SotA proprietary model.
View on Reddit #35157285

dalkef@reddit

Now that you mention it, it gave me very similar answers to sonnet on the initial demo chat. This could explain the performance drop
View on Reddit #35157953

Hatter_The_Mad@reddit

You are not alone in this feeling
View on Reddit #35166945

jd_3d@reddit (OP)

I don't understand why people need to pile on the hate so quick. Is it really that hard to just reserve judgment for a few weeks and see what comes of it? This avenue of applying more test time compute is a very promising direction to me and could be a great way for open source models to exceed closed models that don't want to spend the $$$ on each request.
View on Reddit #35155113

kryptkpr@reddit

We all tried it, it's performance on real world tasks is terrible despite the high benchmarks. Maybe the model is still broken in some way like they've been claiming and really is good but I don't see it.
View on Reddit #35155415

alongated@reddit

Gemma also had problems, they took weeks to resolve. This team is much smaller.
View on Reddit #35158734

kryptkpr@reddit

I disagree here. Gemma was a novel architecture with an attention mechanism that wasn't well supported. This is a fine-tune of Llama. There is nothing to resolve, they're playing us for fools.
View on Reddit #35159667

alongated@reddit

And his team didn't have billion dollars.
View on Reddit #35175197

Evening_Ad6637@reddit

Bro, seriously? Man, aside from the supposedly broken model and the super-duper-secret private api shit: this guy didn't know the difference between llama 3 and llama 3.1 I mean, by now even my grandmother should know the difference. The model he posted on huggingface was absolutely not broken, it was simply a llama-3, as you would expect from a llama-3. There is zero evidence that there was anything wrong with him. I'm slowly coming to the conclusion that this guy is either stupid and narcissistic enough to believe he can fool the world in such a simple way - or, another possible explanation could be: he himself has been the victim of a scam. Perhaps he doesn't have direct access to the backend of this ominous private API himself. Perhaps he still hasn't realized that he has been misused as a puppet and ruined economically and in terms of marketing with this action. It wouldn't be the first time that someone had economic enemies and fell into a trap. The whole thing is highly suspicious and whether he is a victim himself or not, whether he is stupid or not: he clearly also seems to want to lie and hide things! So there are neither excuses nor pity for this egomaniacal behavior.
View on Reddit #35173367

alongated@reddit

Holy shit you are arrogant. Maybe he made a typo maybe his computer malfunctioned idno just stfu and wait for a response before spouting bullshit? Talk about narcissism reread your comment and reflect.
View on Reddit #35175003

muxxington@reddit

I made some quick tests with this model yesterday and actually it performed not that bad. But I can't compare it with Claude or OpenAI, I don't use them. [https://huggingface.co/bartowski/Reflection-Llama-3.1-70B-GGUF/blob/main/Reflection-Llama-3.1-70B-Q4\_K\_M.gguf](https://huggingface.co/bartowski/Reflection-Llama-3.1-70B-GGUF/blob/main/Reflection-Llama-3.1-70B-Q4_K_M.gguf)
View on Reddit #35160563

Sadman782@reddit

Don't you guys get that the model is broken, he said? The tests were based on private API, now yeah, you might not trust that the model, maybe model behind this is different, then it's okay, but as per Matt, the model everyone downloaded from HF is broken, and yeah, I tried too, it is far worse than LLaMA 3.1 70b.
View on Reddit #35155600

nero10578@reddit

I don’t understand how you can fuck up uploading to HF lol
View on Reddit #35159656

kryptkpr@reddit

I mean it's a diabolical plan: Release an "open" model that crushes benchmarks, but then don't actually release working weights and instead just point to your "private API" that produces those results. I can't test his "private API" can I? The whole thing smells bad, as far as I'm concerned this is a publicity stunt to advertise his LLM service.
View on Reddit #35155799

Environmental-Car267@reddit

He offered on twitter the api model to people who want to benchmark it. soon it will be updated on HF etc
View on Reddit #35156081

kryptkpr@reddit

I'm an open source leaderboard maintainer, without weights any test results are just a free ad for his service. I do benchmark the big APIs for reference but no interest in starting to do it for every tom dick and harry, when weights are fixed I'll try again.
View on Reddit #35156417

alongated@reddit

Gemma had similar problems that took weeks to resolve. This team is much smaller.
View on Reddit #35158683

ilangge@reddit

No need to guess, it is now publicly accessible on hf; [Reflection 70B llama.cpp (Correct Weights) - a Hugging Face Space by gokaygokay](https://huggingface.co/spaces/gokaygokay/Reflection-70B-llamacpp)
View on Reddit #35263167

reevnez@reddit

How do we know that "privately hosted version of the model" is not actually Claude?
View on Reddit #35155066

Wiskkey@reddit

Perhaps somebody with an X account could request a prompt inquiring the model about its identity at [this X post from a user with \~180,000 X followers ](https://x.com/DotCSV/status/1832702433329389839)who purportedly [has been](https://x.com/mattshumer_/status/1832566622801780865) given API access to the good model by Matt Shumer.That account has posted a number of purported responses to various prompts by the good model.
View on Reddit #35156231

dotcsv@reddit

[https://x.com/DotCSV/status/1832904408188805429](https://x.com/DotCSV/status/1832904408188805429)
View on Reddit #35177222

Sm0g3R@reddit

lmao you can't be serious. It literally told it's taking this info from a system prompt.
View on Reddit #35204441

StevenSamAI@reddit

What would the point be? I get that they want to declare they have a great model based on using their platform to generate data, and everyone is just saying it's a scam or trick, but think it through. No one will just believe it until others third parties have independently verified it, which several will. And if everyone disproves it, then it will massively harm the valuation and growth of the company they are trying to promote. I'm not saying I automatically think the model is amazing, although the concept is built on strong donations and has been around for a while, I'm just saying it would be a really bad publicity stunt and a huge reputational risk.
View on Reddit #35155920

waxroy-finerayfool@reddit

why would someone lie and scam?? what could they possibly have to gain?? lol
View on Reddit #35179796

StevenSamAI@reddit

I fully understand why someone would like and scam... But lying about something to everyone at once in a community that tests and communicates within hours of a release, about something where the claims can be disproven and widely reported... Seems like a scam that does nothing apart from having a negative effect on reputation.
View on Reddit #35187417

cuyler72@reddit

You underestimate how gullible people are and how long he can draw this out.
View on Reddit #35158266

StevenSamAI@reddit

Cool... I should have mentioned my latest fine tune gets 101% on all benchmarks, and also created its own benchmark... If you want me to tell you the HF model name just send me a bitcoin
View on Reddit #35187082

htrowslledot@reddit

I take back what I said earlier looks like you were right
View on Reddit #35185520

htrowslledot@reddit

It's a very odd con considering he committed to release it a couple of days
View on Reddit #35155523

Thomas-Lore@reddit

With how scam works in a few days he will say he almost got it working but there are still issues and he needs addition two weeks and so on and so on. It works for that Italian guy who peddles cold fusion for decades now.
View on Reddit #35164255

extopico@reddit

Tesla FSD, for example.
View on Reddit #35177309

htrowslledot@reddit

He quietly released this https://huggingface.co/mattshumer/ref_70_e3, he said he was going to run the benchmarks again before re-announcing it. If he's aiming to drag things out he's doing a pretty shit job.
View on Reddit #35167155

ivykoko1@reddit

This is exactly what this guy and many other AI bros are doing
View on Reddit #35165013

htrowslledot@reddit

He quietly released this https://huggingface.co/mattshumer/ref_70_e3, he said he was going to run the benchmarks again before re-announcing it.
View on Reddit #35167095

Inevitable-Start-653@reddit

I'm downloading their epoch 3 version and can run it locally without quantization, there will be a lot of people like me probing and testing.
View on Reddit #35172337

ozzeruk82@reddit

I was thinking this earlier! It would be a clever con. I was thinking maybe it’s using the OpenAI fine tuning service. Until we get weights that equal what they have in their benchmarks I guess it’s a possibility.
View on Reddit #35170593

Waste-Button-5103@reddit

Because it’s unlikely he’d risk his entire reputation along with glaive on something easily disproven
View on Reddit #35169450

TGSCrust@reddit

The official playground (when it was up) personally felt like it was Claude. Just a gut feeling though, I could be totally wrong.
View on Reddit #35156796

meister2983@reddit

Really? To me, it felt way too dumb to be Claude. It pretty much was llama 3.1 70b in behavior - I struggled to find any obvious real world question performance above it. 
View on Reddit #35164543

TGSCrust@reddit

I didn't say it was necessarily smarter, the response style was very similar to Claude though. It's probably a bad system prompt.
View on Reddit #35164921

mikael110@reddit

This conversations reminds me that [somebody](https://www.reddit.com/r/LocalLLaMA/comments/1f9um6s/comment/llpdj5x/) noticed that the demo made calls to an endpoint called "openai\_proxy" while I was one of the people explaining why that might not be as suspicious as it sound on the surface, I'm now starting to seriously think it might actually have been exactly what it sounded like. The fact that he has decided to retrain the model instead of just uploading the working model he is hosting privately is just not logical at all unless he literally cannot upload the private model. Which would be the case if he is just proxying another model.
View on Reddit #35160788

PraxisOG@reddit

Giving them the benefit of the doubt, what if the training data is Claude generated, influencing how the model sounds?
View on Reddit #35159997

TGSCrust@reddit

He claims there isn't any Anthropic data. https://x.com/mattshumer_/status/1832203011059257756#m ( if I had more time on the playground, I could've confirmed whether it was Claude or not :\ )
View on Reddit #35160401

mikael110@reddit

This reminds me that [somebody](https://www.reddit.com/r/LocalLLaMA/comments/1f9um6s/comment/llpdj5x/) noticed that the demo made calls to an endpoint called "openai\_proxy" while I was one of the people explaining why that might not be as suspicious as it sound on the surface, I'm now starting to seriously think it might actually have been exactly what it sounded like. The fact that he has decided to retrain the model instead of just uploading the working model he is hosting privately is just not logical at all unless he literally cannot upload the private model. Which would be the case if he is just proxying another model.
View on Reddit #35159577

Sadman782@reddit

MMLU is 84% on standard prompt === llama 3.1 70B vs 88% claude 3.5 sonnet? So?
View on Reddit #35155177

h666777@reddit

Different prompt, temperature, etc. The simple fact is that they haven't released the "good" version of their model and have no reason to. This should be a 30 minute fix on the HuggingFace repo, no reason for it to not be available already. Also this isn't a full replications of their results, on the original post they claimed it beat other models on almost everything and we see it isn't quite like that. Until the open weights perform just as well as this suspiciously private, researcher only API we are better off staying skeptical. Still looks like a scam to me.
View on Reddit #35156014

Sadman782@reddit

It almost replicated except MMLU (2% behind), "MMLU: 87% (in line with Llama 405B), GPQA: 54%, Math: 73%." Quite close to Sonnet and other SOTA. But it is okay, there is something he's definitely hiding, but I kinda feel this is really achieved by them with reflection. Let's wait and see.
View on Reddit #35156233

Significant-Nose-353@reddit

It seems to me that with a thorough benchmark they could have spotted something like this, the current models leak their cues and promts very easily
View on Reddit #35155875

cuyler72@reddit

The model he posted to hugging face has been shown to be llama-3 with a Lora applied, he's claiming it's definitely llama-3.1 and he doesn't know what a Lora is. The "private API" could be any model and I see no reason to trust these apparently 3rd party benchmarkers. This is 100% a scam
View on Reddit #35158501

Educational_Rent1059@reddit

Guy with 0 background, no idea what LORA is, "wrong" weights uploaded, "wrong" model name promoted, "my cat ate my model i'll release the real one next week", does not disclose he has ownership in the company he promotes, the model outputs garbage with 4x more tokens generation, sounds legit to me. :)
View on Reddit #35166029

Waste-Button-5103@reddit

He knows what it is check his post history. He didn’t understand “LORAing” in the context used. He stated his ownership in the company is a $1000 investment lol.
View on Reddit #35169704

Educational_Rent1059@reddit

[https://www.reddit.com/r/LocalLLaMA/comments/1fc7avd/reflection\_api\_is\_a\_sonnet\_35\_wrapper\_with\_prompt/?share\_id=wVk5-zyZjs5cLftSnEI0c](https://www.reddit.com/r/LocalLLaMA/comments/1fc7avd/reflection_api_is_a_sonnet_35_wrapper_with_prompt/?share_id=wVk5-zyZjs5cLftSnEI0c) Yeah sure, perfectly legit
View on Reddit #35174344

Waste-Button-5103@reddit

Yeah soo likely that it’s a wrapper and two guys with reputation and multiple companies are going to lie about it and ruin their lives for literally zero reason. Surely you can see that it is way more likely they used a dataset generated from claude to create the reflection template.
View on Reddit #35177889

Evening_Ad6637@reddit

The wrapper had exactly the same tokenizer as Claude sonnet 3.5 and at same time it was shown that it nothing in common with Lama's tokenizer
View on Reddit #35200105

gibs@reddit

> He stated his ownership in the company is a $1000 investment Well since he stated it, it must be true
View on Reddit #35188292

ArtyfacialIntelagent@reddit

And thinks rules don't apply to big shot entrepreneurs like him. "I Couldn't Play By The Rules, So I Became an Entrepreneur", by Matt Schumer. https://www.newsweek.com/i-couldnt-play-rules-so-i-became-entrepreneur-opinion-1448286
View on Reddit #35171791

Inevitable-Start-653@reddit

Where does he say he doesn't know what a lora is?
View on Reddit #35172408

cuyler72@reddit

"https://x.com/mattshumer_/status/1832558298509275440" "4. Not sure what LORAing is "
View on Reddit #35174149

Inevitable-Start-653@reddit

I've made many loras myself and I don't know what loraing is either
View on Reddit #35174375

mckirkus@reddit

LLMK-99
View on Reddit #35171612

cuyler72@reddit

Reflect-99.
View on Reddit #35174168

physalisx@reddit

Yep, sure sounds like a scam.
View on Reddit #35172340

Waste-Button-5103@reddit

He knows what a lora is and you can check his history to see him using them. He was talking specifically about the term “LORAing” in the context. 0% chance its a scam it wouldn’t make sense to risk his reputation on something easily disproven
View on Reddit #35169550

Inevitable-Start-653@reddit

I'm ready to download and test!!
View on Reddit #35162258

jd_3d@reddit (OP)

Let us know what you think: [https://huggingface.co/mattshumer/ref\_70\_e3](https://huggingface.co/mattshumer/ref_70_e3)
View on Reddit #35169428

Inevitable-Start-653@reddit

https://www.reddit.com/r/LocalLLaMA/comments/1fcerck/reflection_ref_70_e3_refuses_to_output_meta_tag/ I've been doing some testing, but the community does not seem interested in objective facts. My post keeps getting downvoted, I'm sure it will be off the front page soon.
View on Reddit #35191834

4hometnumberonefan@reddit

This is giving me a roller coaster of emotions.
View on Reddit #35156822

hleszek@reddit

Reminds me of the LK99 potential room-temperature superconductor. We're so back!
View on Reddit #35157693

TouristDelicious8351@reddit

we're so cooked :')
View on Reddit #35184566

KillerX629@reddit

Wasn't that disproved?
View on Reddit #35160298

JamesAQuintero@reddit

Yeah that's the point, there was the initial announcement of it, then some researchers were like "We are somewhat able to replicate the results", but then it was eventually proven to not work
View on Reddit #35166642

OXKSA1@reddit

sorry, care to elaborate?
View on Reddit #35159813

Cantflyneedhelp@reddit

Two years ago(?) there was a paper / video of a supposed room temperature superconductor (they had a sweet floating rock too). And everyone was like "Yeah that's bullshit." But then some hobby chemists were like "Actually I managed to recreate a small port of it from their paper, and it floats too." and this started a race to recreate it by a lot of laboratories around the world. At the end it was not a room temperature superconductor but they managed to find some new stuff.
View on Reddit #35161486

RandoRedditGui@reddit

It shouldn't. It's still as B.S. as yesterday until it's not just the API.
View on Reddit #35159385

Ivo_ChainNET@reddit

The reddit hivemind always goes too hard one way or the other
View on Reddit #35158486

synn89@reddit

> does not suffer from the issues with the version publicly released on Hugging Face It's not rocket science to upload a model to Hugging Face. It's very suss that they can't seem to upload a BF16 or GGUF of a fine tuned Llama to Hugging Face that can be properly tested.
View on Reddit #35181938

celsowm@reddit

is there any place to test it online?
View on Reddit #35166856

Wiskkey@reddit

And here: https://x.com/OpenRouterAI/status/1832880567437729881.
View on Reddit #35172574

celsowm@reddit

I live in the dictatorship of Brazil and x/twitter is blocked by our dictator
View on Reddit #35181674

Wiskkey@reddit

Yes supposedly [here](https://x.com/Yuchenj_UW/status/1832865464827204065).
View on Reddit #35170112

ambient_temp_xeno@reddit

It supposedly the new one but it's as crap as the one I downloaded....
View on Reddit #35172003

strubenuff1202@reddit

I think it was just uploaded to huggingface
View on Reddit #35167642

gamedevgrunt@reddit

Try it out for yourself here [https://x.com/OpenRouterAI/status/1832880567437729881](https://x.com/OpenRouterAI/status/1832880567437729881)
View on Reddit #35176559

ihaag@reddit

I see no difference for deepseek 2.5 the current best model for open source.
View on Reddit #35174972

AnomalyNexus@reddit

meh...so it beats other comparable models when comparison is set up as apples to oranges conditions...
View on Reddit #35174610

ispeakdatruf@reddit

> The model seems to be achieving these results through forcing an output ‘reflection’ response where the model always generates scaffolding of <thinking>, <reflection>, and <output>. In doing this it generates more tokens than other models do on our eval suite with our standard ‘think step by step’ prompting. > For example, it appears that Reflection 70B is not capable of ‘just responding with the answer’ in response to an instruction to classify something and only respond with a one word category. One can always add a postprocesssor on top of Reflection to filter out everything before \<output\>, problem solved. I don't like this nitpicking. Who cares if a model outputs 10 tokens or 100? Is the answer correct or not??
View on Reddit #35170772

athirdpath@reddit

> Who cares if a model outputs 10 tokens or 100? Folks who care if inference takes 5 seconds or 50.
View on Reddit #35173868

nihalani@reddit

For real time inference is the real issues, your time to first token jumps by a huge margin if you have to wait for 2000 tokens to be generated of the model reflecting. Might also explain why the cloud providers haven’t adopted it yet.
View on Reddit #35173678

Sadman782@reddit

[https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B-ep2-working](https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B-ep2-working) It seems an Epoch 2 model was released a few hours ago silently: "Epoch 2, still finishing up Epoch 3. This should be slightly less powerful, but still pretty close." [https://x.com/DavidFSWD/status/1832826469590130955](https://x.com/DavidFSWD/status/1832826469590130955) someone confirmed it is working perfectly
View on Reddit #35159766

physalisx@reddit

Why is anything getting retrained? Where is the model that he allegedly already had?
View on Reddit #35171985

Sadman782@reddit

nope bro, stop hating now, it is released now, open router: [https://x.com/mattshumer\_/status/1832881734230192442](https://x.com/mattshumer_/status/1832881734230192442) , [https://huggingface.co/mattshumer/ref\_70\_e3](https://huggingface.co/mattshumer/ref_70_e3) hugging face
View on Reddit #35172653

Inevitable-Start-653@reddit

Downloading this now: https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B-ep2-working can run locally with my own setup, and am interested in testing it out!
View on Reddit #35162501

jd_3d@reddit (OP)

Newer version is out (Epoch 3): [https://huggingface.co/mattshumer/ref\_70\_e3](https://huggingface.co/mattshumer/ref_70_e3)
View on Reddit #35169499

Inevitable-Start-653@reddit

Thanks!! Will download this one now ☺️ for all the downloading this still isn't anywhere near as bad as llama405b...that sucker was a multi day download and I needed to download it twice after they updated their repo too.
View on Reddit #35171863

Sadman782@reddit

https://huggingface.co/mattshumer/ref_70_e3 epoch 3 released now, maybe he will announce it soon
View on Reddit #35165815

ivykoko1@reddit

Don't waste your time
View on Reddit #35164323

Deathmax@reddit

I wouldn't bother, it doesn't even try to output the tags the "broken" model outputs, so no idea what they mean by "working". https://preview.redd.it/wgvki1etpmnd1.png?width=1792&format=png&auto=webp&s=85ac2c1c112810a61f861807a9f3ee0310b10cc5
View on Reddit #35163600

Inevitable-Start-653@reddit

Hmm 🤔 ...this whole saga is so strange. Download will finish in a lil bit, I've got to try it out, I got the very first upload to work but only could get one response out.
View on Reddit #35163783

mantafloppy@reddit

You would not push false hype again? Why are ppl upvoting this again. Are "The boy who cry wolf" that obscure of a story, or do you also have bot to upvote yourself? https://www.reddit.com/r/LocalLLaMA/comments/1fa4y7q/first_independent_benchmark_prollm_stackunseen_of/?utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button
View on Reddit #35170505

Jordangnr@reddit

Given the impressive translation capabilities of LLMs, wouldn't it be possible to utilize a new, highly compressed language for the reasoning process and then switch to natural language for the final answer?
View on Reddit #35168176

Thistleknot@reddit

For all of what was said, is it not possible to train on the <input> and <output> as if it was an answer and skip all the inbetween?
View on Reddit #35164441

ivykoko1@reddit

This is all still BS I cannot believe y'all are falling for it again
View on Reddit #35164120

khubebk@reddit

It's brilliant at zero shot reasoning tasks , would definitely try out more difficult mcq questions . I don't know how but the model I used via open router was fantastic and could not relate to its criticism. It gave me better responses than other top models , I only tried out zero shot reasoning tasks though No story telling,coding etc
View on Reddit #35163093

ambient_temp_xeno@reddit

Don't care; release weights or go away.
View on Reddit #35156098

julioques@reddit

What do you mean? Isn't it already downloadable??? Why do you have so many upvotes
View on Reddit #35158117

ambient_temp_xeno@reddit

We're waiting on the super-secret good weights. Seriously.
View on Reddit #35158280

julioques@reddit

Isn't it the same weights?
View on Reddit #35158491

ambient_temp_xeno@reddit

https://preview.redd.it/j3bt4ccbdmnd1.png?width=712&format=png&auto=webp&s=b8bc30ae79c73cd90bf81f0c3e84ccd5a81166bc
View on Reddit #35159077

julioques@reddit

I thought the difference was from the different prompt method. Why didn't they just use the released version with the refection's default system prompt like they used now?
View on Reddit #35161407

ambient_temp_xeno@reddit

The whole thing is very weird and annoying. They supposedly uploaded the model to HF incorrectly, so naturally the solution was to completely redo the finetune? I have no idea.
View on Reddit #35162477

mikael110@reddit

There are public weights, but they are subpar. And this test was not performed on them: >Since our first testing of the public release version of Reflection Llama 3.1 70B, [u/mattshumer\_](https://x.com/mattshumer_) has shared access to a privately hosted version of the model that does not suffer from the issues with the version publicly released on Hugging Face. Now why Matt can't just upload the privately hosted version to HF is a mystery. But he has instead opted to retrain the model and is currently posting new checkpoints to HF.
View on Reddit #35158779

Sadman782@reddit

[https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B-ep2-working](https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B-ep2-working) It seems an Epoch 2 model was released a few hours ago silently: "Epoch 2, still finishing up Epoch 3. This should be slightly less powerful, but still pretty close." [https://x.com/DavidFSWD/status/1832826469590130955](https://x.com/DavidFSWD/status/1832826469590130955) someone confirmed it is working perfectly
View on Reddit #35159780

alongated@reddit

Gemma had problems that lasted for weeks. This team is much much smaller.
View on Reddit #35158551

1889023okdoesitwork@reddit

Epoch 2 seems already uploaded on his huggingface
View on Reddit #35158008

Kathane37@reddit

I only care about this for the possibility too generate better synthetic data with step by step reasoning Other than that there is no point in making the token consumption exponential
View on Reddit #35162294

This_Organization382@reddit

"Oh, the benchmark didn't work? Let's see what tests you used..." *Scrambles to train the model on the test data* "Here you go, try the private API version"
View on Reddit #35159676

Sadman782@reddit

"When using Reflection’s default system prompt and extracting answers only from within Reflection’s <output> tags, results show substantial improvement: MMLU: 87% (in-line with Llama 405B), GPQA: 54%, Math: 73%." Actually, what is presented on the chart is based on their standard system prompt. It achieves performance close to Claude 3.5's sonnet-like output with the Reflect system prompt. If Groq hosts it, latency will not be an issue. We're just waiting for the actual weights to be released
View on Reddit #35155107

a_beautiful_rhind@reddit

What about testing the untuned model with a similar COT system prompt?
View on Reddit #35158385

Sadman782@reddit

Will not match with it for sure. I tried many different system prompts, verbose thinking output + "step by step" at the prompt, but it couldn't pass any of my expert-level coding tests from Edabit, even the 405B failed one; GPT4 too. But the model (when the demo was live) in their demo nailed all of them.
View on Reddit #35158702

a_beautiful_rhind@reddit

I only got one or two replies off the demo before it got "overloaded" and turned off. It seemed *alright*. The demo on hyperbolic was absolute garbage and the model forgot about its COT tags within a few messages. All in all.. it seems like this dude has been stringing everyone else along whether there is some model or not. Even if you had slow internet, the excuses and the "retraining" now doesn't make sense. Everything is maximum hype and delay.
View on Reddit #35159270

Sadman782@reddit

This is the reason I am so positive about it, and defending lol, it hurts me when people say it's far worse due to a broken HF model. But yeah, we don't know for sure if the model behind the API is actually reflection 70b or not
View on Reddit #35158867

ILikeCutePuppies@reddit

Or celebras, which is 2x as fast as groq.
View on Reddit #35158721

vert1s@reddit

In other words vapour ware. He could be running an agent that hits multiple backends. The inability to actually publish the weights speaks volumes.
View on Reddit #35155611

Many_SuchCases@reddit

I love how people downvoted you without any explanation as for why he hasn't just uploaded the weights. HF messed it up? Sure buddy, then upload it to literally anywhere else or make a torrent.
View on Reddit #35156671

vert1s@reddit

It's baffling. People want to believe despite all the effort to the contrary.
View on Reddit #35157594

ArtyfacialIntelagent@reddit

https://i.redd.it/t5k5om55cmnd1.gif
View on Reddit #35158674

redjojovic@reddit

The chart below is based on our standard methodology and system prompt. When using Reflection’s default system prompt and extracting answers only from within Reflection’s <output> tags, results show substantial improvement: MMLU: 87% (in-line with Llama 405B), GPQA: 54%, Math: 73%.
View on Reddit #35157853

redjojovic@reddit

The chart below is based on our standard methodology and system prompt. When using Reflection’s default system prompt and extracting answers only from within Reflection’s <output> tags, results show substantial improvement: MMLU: 87% (in-line with Llama 405B), GPQA: 54%, Math: 73%.
View on Reddit #35157811

Significant-Nose-353@reddit

I think few people under this post, also zealously admit that they were a bit hasty with their toxic reaction
View on Reddit #35155126

htrowslledot@reddit

Got so much downvotes yesterday for just mentioning that he gave these access to the internal model
View on Reddit #35155430

StartledWatermelon@reddit

Shumer wasn't hesitating to claim it's the "world’s top \*open-source\* model" in the initial tweet. And now some "internal" model emerges? You certainly didn't deserve the downvotes. But the entire release event, from the beginning up to this date, was one big clusterf-k
View on Reddit #35156758

htrowslledot@reddit

Oh yeah super shady so far, this is a step in the right direction, hopefully he will release the corrected model in a day or two and people could start exploring what further gains there could be had with more test time compute or we know it was a scam and move on.
View on Reddit #35157308

vert1s@reddit

There is nothing toxic about questioning the validity given the inability of anyone to replicate with the released weights. The sheer number of problems including the lack of disclosure that he is invested in both companies that he’s been saying “helped”
View on Reddit #35155730

Significant-Nose-353@reddit

naturally, but I only had complaints in my comment about blatant hetness. Excessive sarcasm, irony and the like
View on Reddit #35156384

Many_SuchCases@reddit

> These were tested on a private API version and not an open-weights version. Admit what exactly? So far we're still correct that he hasn't given us any evidence of this being a model that he trained himself nor has he provided it to us. All we know is that there is a closed model somewhere that is seeing some improvements that aren't verifiable by the public. How do we know it's not 405B with a custom prompt? If you take 1 look at his twitter page for the last 2 days you can see the guy is making excuse after excuse. > "We uploaded it but suddenly it's not the same anymore" Give me a break, the guy could have just given us a torrent like right when drama this started but still hasn't. Why not? He even responded to someone suggesting this asking his coworker what he thought about it. Like why not just do it? It would literally take him like 5 minutes of creating the torrent (if even) and then just seed it.
View on Reddit #35156190

LiquidGunay@reddit

I would like to see a comparison by giving all the models similar inference compute boosts. One way to easily do this is to maybe give all the models a roughly similar token budget ( you can make multiple generations and vote for models that aren't as verbose as reflection)
View on Reddit #35156600