HomeIndustriesGoogle's Gecko benchmark determines the most effective AI image generator

Google's Gecko benchmark determines the most effective AI image generator

Googles DeepMind released Gecko, a brand new benchmark for comprehensive evaluation of AI text-to-image (T2I) models.

In the last two years we’ve got seen AI image generators like GIVE HER and Midjourney improve and higher with each version release.

However, deciding which of the underlying models these platforms use is best has been largely subjective and difficult to match.

It is just not really easy to make a general claim that one model is “higher” than one other. Different models excel in numerous facets of image creation. One could also be good at text rendering, while one other could also be higher at object interaction.

A key challenge for T2I models is to trace every detail of the prompt and accurately reflect this within the generated image.

With Gecko, that DeepMind Researchers have created a benchmark that assesses the capabilities of T2I models similarly to humans.


The researchers first defined a comprehensive dataset of skills relevant to T2I generation. These include spatial understanding, motion recognition, text reproduction and others. They further divided these into more specific sub-skills.

For example, in text rendering, sub-skills may include rendering different fonts, colours, or text sizes.

An LLM was then used to generate prompts to check the T2I model's ability for a selected skill or sub-skill.

This allows the creators of a T2I model to find out not only exactly which capabilities pose a challenge, but additionally at what level of complexity a capability poses a challenge to their model.

The Gecko benchmark framework uses a dataset of skills and sub-skills (a), a human Likert rating of image accuracy (b), LLM-generated queries for VQA evaluation, and ends in comprehensive metrics that correlate with human rankings. Source: arXiv

Human vs. automated assessment

Gecko also measures how closely a T2I model follows all the small print of a prompt. Again, an LLM was used to isolate key details in each prompt after which generate a series of questions on those details.

These questions could be each easy, direct questions on visible elements within the image (e.g., “Is there a cat in the image?”) or more complex questions that test understanding of the scene or the relationships between objects (e.g. “Is the cat in the image?”) “Cat is sitting over the book?”)

A Visual Question Answering (VQA) model then analyzes the generated image and answers the inquiries to see how accurately the T2I model aligns its output image to a prompt.

The researchers collected over 100,000 human annotations, with participants rating a generated image based on how the image matched certain criteria.

People were asked to think about a selected aspect of the prompt and rate the image on a scale of 1 to five depending on how well it matched the prompt.

Using human-annotated reviews because the gold standard, the researchers were able to substantiate that their automated review metric “correlates higher with human reviews than existing metrics on our recent dataset.”

The result’s a benchmarking system that may quantify specific aspects that make a generated image good or not.

Gecko essentially evaluates the output image in a way that closely follows the way in which we intuitively resolve whether or not we’re pleased with the image produced.

So what’s the most effective text-to-image model?

In their paper, the researchers concluded that Google's Muse model outperforms Stable Diffusion 1.5 and SDXL on the Gecko benchmark. You could also be biased, however the numbers don't lie.


Please enter your comment!
Please enter your name here

Must Read