A well known artificial general intelligence (AGI) test is near being solved. But the tests' creators say this means flaws within the test's design quite than an actual research breakthrough.
In 2019, Francois Chollet, a number one figure within the AI world, introduced the ARC-AGI benchmark, short for “Abstract and Reasoning Corpus for Artificial General Intelligence.” Designed to judge whether an AI system can efficiently acquire latest skills outside of the info it was trained on. ARC AGIFrancois claims that it stays the one AI test that measures progress toward general intelligence (although other were suggested.)
Until this 12 months, essentially the most powerful AI could only solve just below a 3rd of the tasks in ARC-AGI. Chollet blamed this on the industry's deal with large language models (LLMs), which he said are unable to attract real “inferences.”
“LLMs struggle with generalization because they rely entirely on memorization,” he says said in a series of posts on X in February. “They fail at all the pieces that wasn’t included of their training data.”
For Chollet, LLMs are statistical machines. With a number of examples, they learn patterns in these examples to make predictions, resembling the “to whom” normally comes before “it would mean something” in an email.
Chollet claims that while LLMs are able to memorizing “patterns of thought,” they’re unlikely to find a way to generate “latest arguments” based on novel situations. “If that you must learn many examples of a pattern, even whether it is implicit, with the intention to learn a reusable representation for it, that you must memorize it,” says Chollet argued in one other post.
To incentivize research beyond LLMs, Chollet and Zapier co-founder Mike Knoop launched a $1 million program in June Competition to develop open source AI that may beat ARC-AGI. Out of 17,789 entries, the highest scorer achieved a rating of 55.5% – about 20% higher than the highest scorer of 2023, although below the 85% “human level” threshold required for victory.
But that doesn't mean we're about 20% closer to AGI, says Knoop.
Today we announce the winners of the 2024 ARC Prize. We may also publish an in depth technical report on what we learned from the competition (link in next tweet).
The state-of-the-art rose from 33% to 55.5%, the most important one-year increase we now have seen since 2020. The…
— François Chollet (@fchollet) December 6, 2024
In one Blog postKnoop said that lots of the contributions submitted to ARC-AGI were capable of reach an answer “through brute force,” suggesting that a “large portion” of ARC-AGI tasks “don’t provide many useful signals for general intelligence.” .”
ARC-AGI consists of puzzle-like problems wherein an AI must generate the proper “answer” grid based on a grid of various coloured squares. The problems should force an AI to adapt to latest problems it hasn't seen before. But it just isn’t clear whether they may succeed.
“(ARC-AGI) has been unchanged since 2019 and just isn’t perfect,” Knoop acknowledged in his post.
Francois and Knoop also faced one another criticism for overvaluing ARC-AGI as a benchmark for AGI – at a time when the definition of AGI is hotly contested. An OpenAI worker recently claims that AGI has “already” been achieved if one defines AGI as AI that’s “higher than most humans at most tasks.”
Knoop and Chollet say that, along with a contest in 2025, they plan to release a second-generation ARC-AGI benchmark to deal with these issues. “We will proceed to focus the efforts of the research community on what we imagine are a very powerful unsolved problems in AI and speed up the timeline for AGI,” Chollet wrote in an X post.
The solution probably won't be easy. If the shortcomings of the primary ARC-AGI test are any indication, the definition of intelligence for AI will likely be just as stubborn – and flammable – similar to it was for people.