Google's Gecko benchmark determines the most effective AI image generator

May 1, 2024

389

Googles DeepMind released Gecko, a brand new benchmark for comprehensive evaluation of AI text-to-image (T2I) models.

In the last two years we’ve got seen AI image generators like GIVE HER and Midjourney improve and higher with each version release.

However, deciding which of the underlying models these platforms use is best has been largely subjective and difficult to match.

It is just not really easy to make a general claim that one model is “higher” than one other. Different models excel in numerous facets of image creation. One could also be good at text rendering, while one other could also be higher at object interaction.

A key challenge for T2I models is to trace every detail of the prompt and accurately reflect this within the generated image.

With Gecko, that DeepMind Researchers have created a benchmark that assesses the capabilities of T2I models similarly to humans.

skills

The researchers first defined a comprehensive dataset of skills relevant to T2I generation. These include spatial understanding, motion recognition, text reproduction and others. They further divided these into more specific sub-skills.

For example, in text rendering, sub-skills may include rendering different fonts, colours, or text sizes.

An LLM was then used to generate prompts to check the T2I model's ability for a selected skill or sub-skill.

This allows the creators of a T2I model to find out not only exactly which capabilities pose a challenge, but additionally at what level of complexity a capability poses a challenge to their model.

The Gecko benchmark framework uses a dataset of skills and sub-skills (a), a human Likert rating of image accuracy (b), LLM-generated queries for VQA evaluation, and ends in comprehensive metrics that correlate with human rankings. Source: arXiv

Human vs. automated assessment

Gecko also measures how closely a T2I model follows all the small print of a prompt. Again, an LLM was used to isolate key details in each prompt after which generate a series of questions on those details.

These questions could be each easy, direct questions on visible elements within the image (e.g., “Is there a cat in the image?”) or more complex questions that test understanding of the scene or the relationships between objects (e.g. “Is the cat in the image?”) “Cat is sitting over the book?”)

A Visual Question Answering (VQA) model then analyzes the generated image and answers the inquiries to see how accurately the T2I model aligns its output image to a prompt.

The researchers collected over 100,000 human annotations, with participants rating a generated image based on how the image matched certain criteria.

People were asked to think about a selected aspect of the prompt and rate the image on a scale of 1 to five depending on how well it matched the prompt.

Using human-annotated reviews because the gold standard, the researchers were able to substantiate that their automated review metric “correlates higher with human reviews than existing metrics on our recent dataset.”

The result’s a benchmarking system that may quantify specific aspects that make a generated image good or not.

Gecko essentially evaluates the output image in a way that closely follows the way in which we intuitively resolve whether or not we’re pleased with the image produced.

So what’s the most effective text-to-image model?

In their paper, the researchers concluded that Google's Muse model outperforms Stable Diffusion 1.5 and SDXL on the Gecko benchmark. You could also be biased, however the numbers don't lie.

Google's Gecko benchmark determines the most effective AI image generator

skills

Human vs. automated assessment

LEAVE A REPLY Cancel reply

Must Read

VCs predict that corporations will spend more on AI in 2026 through fewer vendors

2025 war das Jahr, in dem KI einen Vibe-Check erhielt

AI agents got here onto the market in 2025 – here's what happened and what challenges lie ahead in 2026

Deepfakes have increased in 2025 – here's what's next

The 12 months data centers took center stage from the backend

As AI recreates the feminine voice, it also rewrites who’s heard

How can Canada develop into a worldwide AI powerhouse? By investing in mathematics

Latest articles

VCs predict that corporations will spend more on AI in 2026 through fewer vendors

2025 war das Jahr, in dem KI einen Vibe-Check erhielt

AI agents got here onto the market in 2025 – here's what happened and what challenges lie ahead in 2026

Our Newsletter

Google's Gecko benchmark determines the most effective AI image generator

skills

Human vs. automated assessment

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter