HomeArtificial IntelligenceGemini’s data evaluation capabilities aren't nearly as good as Google claims

Gemini’s data evaluation capabilities aren’t nearly as good as Google claims

A selling point for Google's flagship generative AI models, Gemini 1.5 Pro and 1.5 Flash, is the quantity of knowledge they will supposedly process and analyze. In press conferences and demos, Google has repeatedly claimed that the models' “long context” allows them to handle previously not possible tasks, similar to summarizing several hundred-page documents or trying to find scenes in footage.

But recent research suggests that the models for these items are literally not superb.

Two separate Studies studied how well Google's Gemini models and other models make sense of an enormous amount of knowledge—think War and Peace-length works. Both found that Gemini 1.5 Pro and 1.5 Flash struggle to accurately answer questions on large data sets; in a series of document-based tests, the models gave the proper answer only 40% to 50% of the time.

“Although models like Gemini 1.5 Pro can technically handle long contexts, we've seen many cases that indicate that the models don't really 'understand' the content,” Marzena Karpinska, a postdoc at UMass Amherst and co-author of certainly one of the studies, told TechCrunch.

The Gemini context window is missing

A model's context, or context window, refers to input data (e.g., text) that the model considers before generating output (e.g., additional text). An easy query—”Who won the 2020 U.S. presidential election?”—can function context, as can a movie script, show, or audio clip. And as context windows grow larger, so does the scale of the documents that fit into them.

The latest versions of Gemini can handle over 2 million tokens as context (“Tokens” are chunked raw data, similar to the syllables “fan,” “tas,” and “tic” within the word “incredible.”) That’s comparable to about 1.4 million words, two hours of video, or 22 hours of audio—the most important context of any commercially available model.

In a briefing earlier this 12 months, Google showed several recorded demos designed as an example the potential of Gemini's long-context capabilities. In certainly one of these demos, Gemini 1.5 Pro searched the transcript of the Apollo 11 moon landing television broadcast — some 402 pages — for quotes containing jokes, then found a scene in the published that resembled a pencil sketch.

Oriol Vinyals, vice chairman of research at Google DeepMind, who chaired the meeting, described the model as “magical.”

“(1.5 Pro) performs most of these considering tasks on each page and each single word,” he said.

That could have been an exaggeration.

In certainly one of the aforementioned studies testing these skills, Karpinska and researchers on the Allen Institute for AI and Princeton asked the models to guage true/false statements about English-language novels. The researchers selected recent works so the models couldn't “cheat” on prior knowledge, they usually peppered the statements with references to specific details and plot points that will be not possible to know without reading the books in full.

For an announcement similar to “By using her abilities as an Apoth, Nusis is capable of reverse engineer the style of portal opened by the reagent key present in Rona's picket chest,” Gemini 1.5 Pro and 1.5 Flash needed to say – after reading the relevant book – whether the statement was true or false and explain their reasoning.

Photo credits: University of Massachusetts Amherst

When tested on a book of around 260,000 words (~520 pages) in length, the researchers found that 1.5 Pro answered the true/false statements accurately 46.7% of the time, while Flash only answered accurately 20% of the time. This signifies that a coin can answer questions on the book significantly higher than Google's latest machine learning model. On average across all benchmark results, not one of the models achieved the next accuracy in answering questions than likelihood.

“We found that the models have more difficulty verifying claims that require larger parts of the book and even your entire book than claims that may be resolved by retrieving sentence-level evidence,” Karpinska said. “Qualitatively, we also found that the models have more difficulty verifying claims about implicit information that is evident to a human reader but not explicitly stated within the text.”

The second of the 2 studies, co-authored by researchers at UC Santa Barbara, tested the flexibility of Gemini 1.5 Flash (but not 1.5 Pro) to “think through” videos – that’s, to scan the content of the videos and answer questions on them.

The co-authors created a dataset of images (e.g., a photograph of a birthday cake) and added questions for the model to reply concerning the objects depicted in the photographs (e.g., “What cartoon character is on this cake?”). To evaluate the models, they randomly chosen certainly one of the photographs and inserted “distractor” images before and after it to create slideshow-like footage.

Flash didn't achieve this well. In a test where the model needed to transcribe six handwritten digits from a 25-image “slideshow,” Flash got about 50% of the transcriptions correct. With eight digits, accuracy dropped to about 30%.

“Real question-and-answer tasks about images appear to be particularly difficult for all of the models we tested,” Michael Saxon, a doctoral student at UC Santa Barbara and certainly one of the study's co-authors, told TechCrunch. “This small amount of reasoning – recognizing that a number is in a frame and reading that number – might be why the model doesn't work.”

Google guarantees an excessive amount of with Gemini

None of the studies were peer-reviewed, nor do they examine the two million token context versions of Gemini 1.5 Pro and 1.5 Flash. (Both tested the 1 million token context versions.) And Flash isn't alleged to be as powerful as Pro by way of performance; Google promotes it as a low-cost alternative.

Still, each add fuel to the hearth that Google overpromised – and underdelivered – with Gemini from the beginning. None of the models tested by the researchers, including OpenAI's GPT-4o and Anthropics' Claude 3.5 Sonnet, performed well. But Google is the one model provider that offers the context window the very best priority in its ads.

“There's nothing flawed with simply saying 'our model can hold a certain variety of tokens' due to objective technical details,” Saxon said. “But the query is what useful things are you able to do with it.”

Generative AI typically is increasingly coming under criticism as corporations (and investors) turn out to be increasingly frustrated with the constraints of this technology.

In two recent surveys by the Boston Consulting Group, about half of respondents—all C-suite executives—said they don’t expect generative AI to steer to significant productivity gains and that they’re concerned concerning the potential for errors and data compromises introduced by generative AI-powered tools. PitchBook recently reported that for 2 consecutive quarters, the variety of earliest-stage generative AI deals has declined, falling 76% from the height in Q3 2023.

With chatbots summarizing meetings and spitting out fictitious details about people and AI search platforms which are principally plagiarism generators, customers are in search of promising differentiators. Google—which has tried, albeit sometimes clumsily, to meet up with its generative AI rivals—tried desperately to make Gemini's context certainly one of those differentiators.

But the bet was apparently premature.

“We haven't yet agreed on a way to essentially prove that 'reasoning' or 'understanding' occurs across long documents, and principally every group that publishes these models cobbles together their very own ad hoc assessments to make these claims,” ​​Karpinska said. “Without knowing how long context processing has been implemented — and corporations don't disclose these details — it's hard to say how realistic these claims are.”

Google didn’t reply to a request for comment.

Saxon and Karpinska each imagine that the antidotes to the over-the-top claims around generative AI are higher benchmarks and, for that matter, a greater emphasis on third-party critiques. Saxon points out that some of the common tests for long context (which Google cites liberally in its marketing materials), the “needle in a haystack,” only measures a model's ability to recall specific information, similar to names and numbers, from data sets—not answer complex questions on that information.

“All scientists and most engineers who use these models essentially agree that our existing benchmark culture isn’t working,” Saxon said. “So it's vital for the general public to know that these huge reports with numbers like 'general intelligence across benchmarks' must be viewed with a healthy dose of skepticism.”

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read