The list of informal, strange AI benchmarks continues to grow.
In recent days, some within the AI community have commented on X turn out to be obsessed With a test of how different AI models, especially so-called reasoning models, process input like this: “Write a Python script for a bouncing yellow ball in a shape. Allow the mold to rotate slowly, ensuring the ball stays within the mold. “
Some models handle this “ball in rotating form” benchmark higher than others. After For one user on
👀 Deepseek R1 (right) crushes O1-pro (left) 👀
Prompt: “Write a Python script for a bouncing yellow ball inside a square and be sure that collision detection is handled properly. let the square rotate slowly. Implement it in Python. Make sure ball stays in place. “ pic.twitter.com/3sad9efpez
– Ivan Fioravanti ᯅ (@ivanfioravanti) January 22, 2025
Per Another X posterAnthropic's Claude 3.5 Sonnet and Google's Gemini 1.5 Pro models misjudged physics, causing the ball to flee form. Other user reported that Google's Gemini 2.0 Flash-Thinking Experimental and even Openai's older GPT-4O scored in a Go.
Tested 9 AI models on a physics simulation task: Rotating Triangle + Bouncing Ball. Results:
🥇 Deepseek R1
🥈 Sonar huge
🥉 GPT-4OWorst? OpenAI O1: Completely misunderstood the duty 😂
Video below ↓ First line = argumentation models, rest = basic models. pic.twitter.com/eoyrhvnazr
– aadhithya d (@aadhithya_d2003) January 22, 2025
But what does it prove that an AI can or can encode a rotating, spherical shape?
Well, simulating a bouncing ball is a classic programming Challenge. Accurate simulations include collision detection algorithms that try and discover when two objects (e.g. a sphere and the side of a shape) collide. Poorly written algorithms can affect the performance of the simulation or result in obvious physics errors.
X users N8 programsA researcher in residence at AI startup Nous Research said it took him about two hours to program a bouncing ball in a rotating heptagon from scratch. “You must keep track of multiple coordinate systems, how the collisions are performed in each system, and design the code to be robust from the beginning,” N8 Programs explained in a single post.
But while the balls and rotating shapes are an inexpensive test of programming skills, they're not a really empirical AI benchmark. Even slight variations within the prompt can – and do – produce different results. That's why some users of X report having higher luck with it O1while others say that r1 falls too short.
If anything, such virus tests point to the intractable problem of making useful measurement systems for AI models. It's often difficult to say what differentiates one model from one other, outside of the esoteric benchmarks that aren't relevant to most individuals.
Many efforts are underway to create higher tests corresponding to the ARC-Agi benchmark and the Last Test of Mankind. We'll see how this fares – and within the meantime watch gifs of balls in rotating shapes.