HomeIndustriesIs Reflection 70B essentially the most powerful open source LLM or a...

Is Reflection 70B essentially the most powerful open source LLM or a scam?

Matt Shumer, founder and CEO of HyperWrite, announced that his latest model, Reflection 70B, uses an easy trick to resolve LLM hallucinations and delivers impressive benchmark results, beating larger and even closed models like GPT-4o.

Shumer worked with synthetic data provider Glaive to create the brand new model, which relies on Meta's Llama 3.1-70B Instruct model.

In the launch announcement on Hugging Face, Shumer said, “Reflection Llama-3.1 70B is (currently) the world's best open source LLM, trained using a brand new technique called reflection tuning, which teaches an LLM to detect errors in its reasoning and course correct.”

If Shumer could discover a solution to solve the AI ​​hallucinations problem, that might be incredible. The benchmarks he shared seem to point that Reflection 70B is way ahead of other models.

Reflection 70B benchmark results provided by Matt Shumer. Source: Hugging face

The model's name refers to its ability to self-correct during inference. Shumer doesn't give an excessive amount of away, but explains that the model reflects its initial response to a prompt and only outputs it when it’s confident it’s correct.

Shumer says a 405B version of Reflection is within the works and can outshine other models, including GPT-4o, when it's unveiled next week.

Is Reflection 70B a scam?

Is this all too good to be true? Reflection 70B is offered for download on Huging Face, but early testers were unable to breed the impressive performance shown by Shumer's benchmarks.

The Reflection playground enables you to try the model, but says the demo is temporarily unavailable as a consequence of high demand. The prompt suggestions “Count 'r's in strawberry” and “9.11 vs. 9.9” suggest that the model answers these tricky prompts accurately. However, some users claim that Reflection was specifically tuned to reply these prompts.

The Reflection playground is currently unavailable. Source: Reflection playground

Some users questioned the impressive benchmarks. In particular, the GSM8K of over 99% looked suspicious.

Some of the bottom truth answers within the GSM8K dataset are literally flawed. In other words, the one solution to recover from 99% on GSM8K was to present the identical flawed answers to those problems.

After some testing, users said that Reflection was actually worse than Llama 3.1 and that it was actually just Llama 3 with LoRA tuning applied.

User tests show that Reflection 70B performs worse than the models Shumer claims to outperform. Source: X

In response to the negative feedback, Shumer posted a press release on X saying, “Quick update – we re-uploaded the weights but there remains to be a problem. We just began training again to rule out any potential issue. Should be resolved soon.”

Shumer explained that there was a bug with the API and that it was being worked on. In the meantime, he provided access to a secret, private API in order that doubters could check out Reflection while the fix was being worked on.

And this is precisely where it doesn't appear to work, because after some careful questioning it becomes clear that the API is definitely only a Claude 3.5 Sonnet wrapper.

In subsequent tests, the API reportedly returned output from Llama and GPT-4o. Shumer insists the unique results are correct and that they’re working to repair the downloadable model.

Are skeptics a bit premature in calling Shumer a fraud? Perhaps the discharge was just poorly handled and Reflection 70B really is a groundbreaking open source model. Or perhaps it's one other example of AI hype to lift enterprise capital from investors in search of the subsequent big thing in AI.

We'll must wait a day or two to see how things develop.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read