Most AI benchmarks don't tell us much. They ask questions that will be solved by memorizing or cover topics that are usually not relevant to nearly all of users.
As a result, some AI enthusiasts are turning to games to check AIs' problem-solving abilities.
Paul Calcraft, a contract AI developer, has created an app that enables two AI models to play a Pictionary-like game with one another. One model doodles while the opposite model tries to guess what the doodle represents.
“I assumed that sounded super fun and potentially interesting from a model capabilities perspective,” Calcraft said in an interview with TechCrunch. “So I sat inside on a cloudy Saturday and got it done.”
Calcraft was inspired by an identical project by British programmer Simon Willison, where models were tasked with rendering a Vector drawing of a pelican riding a bicycle. Willison, like Calcraft, selected a challenge that he believed would force models to “think” beyond the content of their training data.
“The idea is to have a benchmark that just isn’t playable,” Calcraft said. “A benchmark that can not be surpassed by memorizing specific answers or easy patterns previously seen during training.”
Minecraft also belongs to this “unplayable” category, a minimum of 16-year-old Adonis Singh believes. He created one ToolMcbench, which provides a model control of a Minecraft character and tests their ability to design structures, much like Microsoft's Project Malmo.
“I feel Minecraft tests the models for ingenuity and provides them more freedom to make decisions,” he told TechCrunch. “It’s nowhere near as limited and saturated as (other) benchmarks.”
Using games to benchmark AI is nothing latest. The idea goes back a long time: mathematicians Claude Shannon argued in 1949 that games like chess were a worthy challenge to “intelligent” software. More recently, Alphabet's DeepMind developed a Model that would play Pong and Breakout; OpenAI trained the AI to take part in Dota 2 games. and Meta developed an algorithm that would compete with skilled Texas Hold'em players.
But what's different now’s that enthusiasts are connecting large language models (LLMs) – models with the flexibility to investigate text, images and more – with games to check how good they’re at logic.
There are a plethora of LLMs, from Gemini and Claude to GPT-4o, they usually all have different “vibes,” so to talk. They “feel” different from one interaction to the following—a phenomenon that will be difficult to quantify.
“LLMs are notoriously sensitive to certain ways questions are asked and are generally just unreliable and difficult to predict,” Calcraft said.
Unlike text-based benchmarks, games provide a visible, intuitive strategy to compare a model's performance and behavior, said Matthew Guzdial, an AI researcher and professor on the University of Alberta.
“We can consider each benchmark as a distinct simplification of reality, specializing in specific forms of problems equivalent to reasoning or communication,” he said. “Games are only one other strategy to make decisions with AI, so persons are using them like several other approach.”
Anyone accustomed to the history of generative AI will notice how similar Pictionary is to generative adversarial networks (GANs), through which a creator model sends images to a discriminator model, which then evaluates them.
Calcraft believes that Pictionary can capture an LLM's ability to know concepts equivalent to shapes, colours, and prepositions (e.g., the meaning of “in” versus “on”). He wouldn't go thus far as to say that the sport is a reliable test of logical pondering, but he argued that winning requires strategy and the flexibility to know clues – neither of which comes easily to the models.
“I also really like the just about adversarial nature of the Pictionary game, much like GANs, where you’ve gotten two different roles: one draws and the opposite guesses,” he said. “The neatest thing to attract just isn’t essentially the most artistic, however the one which can most clearly convey the concept to the audience of other LLMs (including the faster, much less powerful models!).”
“Pictionary is a toy problem that just isn’t immediately practical or realistic,” Calcraft warned. “However, I feel spatial understanding and multimodality are crucial elements for the advancement of AI, so LLM Pictionary could possibly be a small, early step on this journey.”
Singh believes Minecraft can be a useful benchmark and might measure pondering in LLMs. “The results of the models I've tested thus far are actually perfectly consistent with how much I trust the model by way of reasoning,” he said.
Others aren't so sure.
Mike Cook, a research fellow specializing in AI at Queen Mary University, doesn't think Minecraft is anything special as an AI testbed.
“I feel a part of the fascination with Minecraft comes from people outside of the gaming world who might imagine that since it looks like 'the actual world,' it has a more in-depth connection to real-world considerations or actions,” Cook told TechCrunch . “From a problem-solving perspective, it’s not that dissimilar to a video game like Fortnite, Stardew Valley or World of Warcraft. It just has a distinct feel to it that makes it look more like on a regular basis tasks like constructing or exploring.”
Cook points out that even the most effective gaming AI systems generally don't adapt well to latest environments and might't easily solve problems they haven't seen before. For example, a model that excels at Minecraft is unlikely to be truly adept at playing Doom.
“I feel the great things about Minecraft from an AI perspective are extremely weak reward signals and a procedural world, which creates unpredictable challenges,” Cook continued. “But it's probably not that rather more representative of the actual world than another video game.”
If so, there is certainly something fascinating about watching LLMs construct castles.