Matt Shumer, founder and CEO of HyperWrite, announced that his latest model, Reflection 70B, uses an easy trick to resolve LLM hallucinations and delivers impressive benchmark results, beating larger and even closed models like GPT-4o.
Shumer worked with synthetic data provider Glaive to create the brand new model, which relies on Meta's Llama 3.1-70B Instruct model.
In the launch announcement on Hugging Face, Shumer said, “Reflection Llama-3.1 70B is (currently) the world's best open source LLM, trained using a brand new technique called reflection tuning, which teaches an LLM to detect errors in its reasoning and course correct.”
If Shumer could discover a solution to solve the AI ​​hallucinations problem, that might be incredible. The benchmarks he shared seem to point that Reflection 70B is way ahead of other models.
The model's name refers to its ability to self-correct during inference. Shumer doesn't give an excessive amount of away, but explains that the model reflects its initial response to a prompt and only outputs it when it’s confident it’s correct.
Shumer says a 405B version of Reflection is within the works and can outshine other models, including GPT-4o, when it's unveiled next week.
Is Reflection 70B a scam?
Is this all too good to be true? Reflection 70B is offered for download on Huging Face, but early testers were unable to breed the impressive performance shown by Shumer's benchmarks.
The Reflection playground enables you to try the model, but says the demo is temporarily unavailable as a consequence of high demand. The prompt suggestions “Count 'r's in strawberry” and “9.11 vs. 9.9” suggest that the model answers these tricky prompts accurately. However, some users claim that Reflection was specifically tuned to reply these prompts.
Some users questioned the impressive benchmarks. In particular, the GSM8K of over 99% looked suspicious.
Hey Matt! That's super interesting, but I'm pretty surprised to see a GSM8k rating of over 99%. From what I understand, it's likely that greater than 1% of GSM8k is mislabeled (the right answer is definitely flawed)!
– Hugh Zhang (@hughbzhang) 5 September 2024
Some of the bottom truth answers within the GSM8K dataset are literally flawed. In other words, the one solution to recover from 99% on GSM8K was to present the identical flawed answers to those problems.
After some testing, users said that Reflection was actually worse than Llama 3.1 and that it was actually just Llama 3 with LoRA tuning applied.
In response to the negative feedback, Shumer posted a press release on X saying, “Quick update – we re-uploaded the weights but there remains to be a problem. We just began training again to rule out any potential issue. Should be resolved soon.”
Shumer explained that there was a bug with the API and that it was being worked on. In the meantime, he provided access to a secret, private API in order that doubters could check out Reflection while the fix was being worked on.
And this is precisely where it doesn't appear to work, because after some careful questioning it becomes clear that the API is definitely only a Claude 3.5 Sonnet wrapper.
Reflection API is a prompted wrapper for Sonnet 3.5, and currently it’s obfuscated by filtering out the string “claude”.https://t.co/c4Oj8Y3Ol1 https://t.co/k0ECeo9a4i pic.twitter.com/jTm2Q85Q7b
– Joseph (@RealJosephus) 8 September 2024
In subsequent tests, the API reportedly returned output from Llama and GPT-4o. Shumer insists the unique results are correct and that they’re working to repair the downloadable model.
Are skeptics a bit premature in calling Shumer a fraud? Perhaps the discharge was just poorly handled and Reflection 70B really is a groundbreaking open source model. Or perhaps it's one other example of AI hype to lift enterprise capital from investors in search of the subsequent big thing in AI.
We'll must wait a day or two to see how things develop.