Makes it easier to review an AI model's responses

October 21, 2024

315

Despite their impressive capabilities, large language models are removed from perfect. These artificial intelligence models sometimes “hallucinate” by generating false or unsupported information in response to a question.

Because of this hallucination problem, an LLM's answers are sometimes checked by human fact-checkers, especially when a model is utilized in a high-risk environment akin to healthcare or finance. However, validation processes typically require reading through long documents cited by the model, a task so tedious and error-prone that it might deter some users from deploying generative AI models in any respect.

To support human validators, MIT researchers have developed a user-friendly system that enables humans to confirm an LLM's answers way more quickly. With this tool, called SymGenan LLM generates answers with citations that time on to the situation in a source document, for instance a selected cell in a database.

Users hover over highlighted parts of the text response to view data that the model used to generate that individual word or phrase. At the identical time, the unhighlighted parts show users which phrases require special attention for checking and verification.

“We give people the chance to specifically give attention to parts of the text that they have to be more concerned about. “In the top, SymGen can provide people greater confidence in a model’s answers because they will more easily take a better look to make certain the data is verified,” says Shannon Shen, a doctoral candidate in electrical engineering and computer science and co-lead creator from a Articles about SymGen.

Through a user study, Shen and his collaborators found that SymGen reduced review time by about 20 percent in comparison with manual processes. By making it faster and easier for humans to validate model results, SymGen could help people detect errors in LLMs utilized in quite a lot of real-world situations, from taking clinical notes to summarizing financial market reports.

Shen is assisted on the paper by Lucas Torroba Hennigen, co-lead creator and fellow EECS graduate student. EECS doctoral student Aniruddha “Ani” Nrusimha; Bernhard Gapp, President of the Good Data Initiative; and senior authors David Sontag, professor of EECS, member of the MIT Jameel Clinic and leader of the Clinical Machine Learning Group of the Computer Science and Artificial Intelligence Laboratory (CSAIL); and Yoon Kim, assistant professor of EECS and member of CSAIL. The research was recently presented on the Conference on Language Modeling.

Symbolic references

To aid validation, many LLMs are designed to generate citations that time to external documents together with their language-based answers for users to review. However, these verification systems are frequently developed after the very fact, without considering the trouble required to sift through quite a few citations, says Shen.

“Generative AI is designed to cut back the user’s time to finish a task. If you’ve got to spend hours reading through all these documents to examine whether the model says something sensible, having the generations in practice is less helpful,” says Shen.

The researchers approached the validation problem from the attitude of the individuals who will likely be doing the work.

A SymGen user first provides the LLM with data to reference in its response, akin to a table of statistics a couple of basketball game. Then the researchers don't immediately ask the model to finish a task, akin to making a game summary from this data, but as an alternative perform an intermediate step. They cause the model to generate its answer in symbolic form.

With this prompt, every time the model desires to quote words in its response, it must write the precise cell from the info table that incorporates the data it’s referencing. For example, if the model desired to quote the phrase “Portland Trailblazers” in its response, it will replace that text with the cell name in the info table that incorporates those words.

“Because now we have this intermediate step that has the text in a symbolic format, we’re capable of get really fine-grained references. We can say that for each single span of text within the output there is precisely that place in the info that it corresponds to,” says Torroba Hennigen.

SymGen then resolves each reference using a rule-based tool that copies the corresponding text from the info table into the model's response.

“This way we all know that it’s a verbatim copy, so we all know that the a part of the text that corresponds to the actual data variable doesn’t contain any errors,” adds Shen.

Optimized validation

Because of the way in which the model is trained, it may produce symbolic responses. Large language models are fed reams of knowledge from the Internet, and a few data is recorded in “placeholder” format, where codes replace actual values.

When SymGen asks the model to generate a symbolic response, it uses an identical structure.

“We design the prompt in a selected method to benefit from the capabilities of the LLM,” Shen adds.

During a user study, nearly all of participants reported that SymGen makes reviewing LLM-generated text easier. They were capable of validate the model's answers about 20 percent faster than in the event that they used standard methods.

However, SymGen is restricted by the standard of the source data. The LLM could cite an incorrect variable and a human reviewer couldn’t find out about it.

Additionally, the user will need to have source data in a structured format akin to a table to feed it into SymGen. Currently the system only works with tabular data.

In the long run, researchers will improve SymGen in order that it may process arbitrary text and other forms of knowledge. With this feature, it could, for instance, help validate parts of AI-generated legal document summaries. They also plan to check SymGen with physicians to check how it may detect errors in AI-generated clinical summaries.

This work is funded partly by Liberty Mutual and the MIT Quest for Intelligence Initiative.

Makes it easier to review an AI model's responses

LEAVE A REPLY Cancel reply

Must Read

I used AI chatbots as a news source for a month they usually were unreliable and buggy

As a part of the “physical AI” takeover of CES 2026

Humanoid robots or human connection? What Elon Musk's Optimus reveals about our AI ambitions

3 questions: How AI could optimize the ability grid

Decoding the Arctic to predict winter weather

Gmail introduces personalized AI inbox, AI digests in search, and more

Envisioning a Better Future Together – A message from our Founder and CEO about purpose, unity and motion

Latest articles

I used AI chatbots as a news source for a month they usually were unreliable and buggy

As a part of the “physical AI” takeover of CES 2026

Humanoid robots or human connection? What Elon Musk's Optimus reveals about our AI ambitions

Our Newsletter

Makes it easier to review an AI model's responses

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter