From hallucinations to hardware: teaching from an actual computer vision project that has gone sideways

June 28, 2025

221

Computer Vision projects rarely go exactly as planned, and this was no exception. The idea was easy: create a model that was capable of see a photograph of a laptop and discover all physical damage – things like cracked screens, missing keys or broken hinges. It appeared to be an uncomplicated application for image models and enormous voice models (LLMS), however it quickly became something more complicated.

On the best way we got here across problems with hallucinations, unreliable outputs and pictures that weren’t even laptops. In order to unravel this, we at the tip at the tip we apply an agent framework atypical – not for task automation, but to enhance the performance of the model.

In this text we’ll undergo what we tried, what didn’t work and the way a mix of approaches ultimately helped us to construct something reliable.

Where we began: monolithic request

Our original approach was quite standard for a multimodal model. We used a single, large request at hand over a picture to an possible LLM and asked to discover visible damage. This monolithic prompt strategy is straightforward to implement and works decent for clean, precisely defined tasks. But real data rarely play along.

We were early on three important problems:

Hallucinations: The model sometimes invented damage that didn’t exist, or misconception what it saw.
Junk image recognition: There was no reliable path to flags that weren’t even laptops, akin to pictures of desks, partitions or individuals who occasionally went through and received nonsensical damage reports.
Inconsistent accuracy: The combination of those problems made the model too unreliable for operating use.

This was the purpose where we realized that we needed to be.

First fix: mixture of image resolutions

One thing that we noticed was how much image quality the model was outlined. Users have uploaded all kinds of images that rang from sharp and high -resolution to blurry. This led us to refer Research Emphasize how the image resolution affects deep -learning models.

We trained and tested the model with a mix of high and low -dissolving images. The idea was that the model was more immune to the big selection of image qualities that it might encounter in practice. This contributed to improving consistency, but there are the core problems of hallucination and handling within the image of the rubbish.

The multimodal detour: llm only text becomes multimodal

Encouraged by the most recent experiments for the mixture of captions with only text-lems within the technology, by which subtitles generated from the photographs after which interpreted by a voice model, we decided to try it out.

This is how it really works:

The LLM begins with the production of several possible caps for a picture.
Another model, which is known as a multimodal embedding model, checks how well every caption matches the image. In this case we used Siglip to judge the similarity between the image and the text.
The system holds the few caps based on these reviews.
The LLM uses these top image signatures to write down latest ones and take a look at to catch up with to the image of what the image actually shows.
It repeats this process until the captions aren’t any longer improved, or it hits an outlined limit.

Although this approach is theoretically sensible, he introduced latest problems for our application:

Persistent hallucinations: The captions themselves sometimes contained imaginary damage, which the LLM then reported confidently.
Incomplete cover: Even with several caps, some problems were completely neglected.
Increased complexity, little use: The additional steps made the system more complicated without reliably exceeding the previous setup.

It was an interesting experiment, but ultimately not an answer.

A creative use of agentian frameworks

This was the turning point. While agent frameworks are normally used for the orchestive tasks (think that agents that coordinate calendars or customer invitations), we asked ourselves whether the duty of the image interpretation task in smaller, specialized agents could help.

We have created such an agent framework:

Orchestrator agent: It checked the image and identified which laptop components were visible (screen, keyboard, chassis, ports).
Component agents: Dedicated agents have inspected every component on certain varieties of damage. For example, one for cracked screens, one for missing keys.
Junk detection -agent: A separate agent marked whether the image was a laptop in any respect.

This modular, task -oriented approach led an excessive amount of precise and explainable results. Hallucinations fell dramatic, junk images were reliably marked and the duty of each agent was easy and focused enough to regulate the standard well.

As effective because it was, it wasn't perfect. Two important restrictions were displayed:

Increased latency: Execution of several sequential agents which have been added at the entire closing time.
Cover gaps: Agents could only recognize problems in keeping with which they were explicitly programmed. If an image showed something unexpected that no agent was commissioned to discover, it might be unnoticed.

We needed a strategy to compensate for precision with cover.

The hybrid solution: combining agent and monolithic approaches

To close the gaps, we have now created a hybrid system:

The Agent frame Run first and handled precise recognition of well -known damage types and junk images. We have limited the variety of agents to a very powerful thing to enhance latency.
Then a Monolithic image LLM entry prompt Scan the image in keeping with every part that the agents can have missed.
Finally we finely coordinated the model Use of a curated picture sentence for uses with a high priority, akin to often reported damage scenarios to further improve accuracy and reliability.

This combination gave us the precision and explanation of the acting setup, the broad reporting on the monolithic request and the boost of confidence in targeted tremendous -tuning.

What we learned

Just a few things became clear after we accomplished this project:

Agent framework is more versatile than they receive a credit: While you might be normally connected to the workflow management, we found that you would be able to increase the model output sensibly when you are structured, modularly applied.
Mix different approaches that depend on just one: The combination of precise, agent -based detection along with the wide cover of LLMs and just a little tremendous -tuning, where it was most vital, gave us much more reliable results than any individual method.
Visual models are prone to hallucinations: Even the more advanced setups can jump to conclusions or see things that usually are not there. A thoughtful system design is required to maintain these errors in chess.
Picture quality variety makes a difference: Training and tests with clear, high -resolution and on a regular basis, qualitatively inferior images helped the model to stay resistant in unpredictable, real photos.
You need a strategy to catch junk pictures: A special check for garbage or non -related images was one in all the only changes we made, and it had oversized impact on the reliability of the general system.

Last thoughts

What began as an easy idea and used an LLM request to detect physical damage in laptop pictures quickly became a much deeper experiment to mix different AI techniques with a view to tackle unpredictable problems with the actual world. On the best way we found that a number of the most useful tools were those who weren’t originally developed for any such work.

Agent framework that is commonly considered a workflow supply company proved to be surprisingly effective in the event that they have been reproduced for tasks akin to structured damage detection and image filtering. With just a little creativity, they helped us construct up a system that was not only more precise, but in addition easier to grasp and manage in practice.

From hallucinations to hardware: teaching from an actual computer vision project that has gone sideways

Where we began: monolithic request

First fix: mixture of image resolutions

The multimodal detour: llm only text becomes multimodal

A creative use of agentian frameworks

The blind spots: compromises of an agenty approach

The hybrid solution: combining agent and monolithic approaches

What we learned

Last thoughts

LEAVE A REPLY Cancel reply

Must Read

Meta signs industrial AI data agreements with publishers to supply real-time news on Meta AI

AI creates unrealistic body ideals, objectification and a scarcity of diversity – especially amongst athletes

The New York Times is suing Perplexity for copyright infringement

ChatGPT's user growth has reportedly slowed

Meta acquires AI device startup Limitless

MIT researchers “bring objects to life using AI and robotics.”

The words of 2025 reflect a 12 months of digital disillusionment

Latest articles

Meta signs industrial AI data agreements with publishers to supply real-time news on Meta AI

AI creates unrealistic body ideals, objectification and a scarcity of diversity – especially amongst athletes

The New York Times is suing Perplexity for copyright infringement

Our Newsletter

From hallucinations to hardware: teaching from an actual computer vision project that has gone sideways

Where we began: monolithic request

First fix: mixture of image resolutions

The multimodal detour: llm only text becomes multimodal

A creative use of agentian frameworks

The blind spots: compromises of an agenty approach

The hybrid solution: combining agent and monolithic approaches

What we learned

Last thoughts

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter