We're planning the longer term of AI, from safer answers to faster pondering

November 6, 2025

515

The adoption of latest tools and technologies occurs when users find them broadly reliable, accessible and, by way of cost, an improvement over available methods and workflows. Five graduate students from the inaugural class of the MIT-IBM Watson AI Lab Summer Program are leveraging cutting-edge resources, alleviating AI problems, and creating recent features and capabilities to advance the utility and use of AI—from learning when to trust one model that predicts the accuracy of one other to reasoning more effectively about knowledge bases. Together, the efforts of scholars and their mentors form a through line by which practical and technically sophisticated research results in more reliable and helpful models in all fields.

Students' work in constructing probes, routers, recent attention mechanisms, synthetic datasets, and program synthesis pipelines includes security, inference efficiency, multimodal data, and knowledge-based reasoning. Their techniques emphasize scaling and integration, all the time keeping impact in mind.

Learn to trust and when

MIT mathematics student Andrey Bryutkin's research prioritizes the trustworthiness of models. He looks for internal structures inside problems, corresponding to equations that govern a system and conservation laws, to know how they could be used to realize more reliable and robust solutions. On this basis and in collaboration with the laboratory, Bryutkin developed a technique to see into the character of the behavior of huge learning models (LLMs). Together with Veronika Thost of IBM Research within the lab and Marzyeh Ghassemi – associate professor and Germeshausen Career Development Professor within the MIT Department of Electrical Engineering and Computer Science (EECS) and a member of the Institute of Medical Engineering Sciences and the Laboratory for Information and Decision Systems – Bryutkin explored the “uncertainty of uncertainty” of LLMs.

Classically, tiny feed-forward neural networks two to 3 layers deep, called probes, are trained and deployed alongside LLMs to report untrustworthy answers from the larger model to developers; However, these classifiers may produce false negatives and only provide point estimates, which don’t provide much details about when the LLM fails. When studying secure/unsafe prompts and question-answer tasks, the MIT-IBM team used prompt-label pairs in addition to hidden states corresponding to activation vectors and final tokens of an LLM to measure gradient values, sensitivity to prompts, and out-of-distribution data, to find out how reliable the probe was, and to learn data ranges which are difficult to predict. Their method also helps discover potential labeling noise. This is an important feature since the trustworthiness of AI systems depends entirely on the standard and accuracy of the labeled data on which they’re based. More accurate and consistent probes are particularly essential for domains with critical data in applications corresponding to IBM's Granite Guardian family of models.

Another solution to ensure trustworthy answers to an LLM's queries is to complement them with external, trustworthy knowledge bases to forestall hallucinations. For structured data corresponding to social media connections, financial transactions or corporate databases, knowledge graphs (KG) are the natural solution; However, communication between LLM and KGs often uses fixed multi-agent pipelines, that are computationally inefficient and expensive. To address this problem, physics graduate student Jinyeop Song, together with lab researchers Yada Zhu of IBM Research and EECS Associate Professor Julian Shun, have developed a single-agent, multi-turn, reinforcement learning framework that streamlines this process. Here the group designed an API server that hosts Freebase and Wikidata KGs consisting of general web-based knowledge data and an LLM agent that issues targeted polling actions to retrieve relevant information from the server. Then, through continuous forwards and backwards, the agent appends the collected data from the KGs to the context and responds to the query. Crucially, the system uses reinforcement learning to coach itself to offer answers that strike a balance between accuracy and completeness. The framework combines an API server with a single reinforcement learning agent to orchestrate data-driven pondering with improved accuracy, transparency, efficiency and portability.

Spend your calculations correctly

The timeliness and completeness of a model's response carry similar weight to the importance of its accuracy. This is particularly true when coping with long input texts and people where elements corresponding to the theme of a story evolve over time. So EECS doctoral student Songlin Yang is rethinking what models can process at each step of inference. Rameswar Panda of IBM Research and Yoon Kim, NBX Professor and Associate Professor in EECS, focused on transformer constraints as they arise in LLMs and, along with Yang, developed next-generation language model architectures beyond transformers.

Transformers face two major limitations: high computational complexity in long-sequence modeling on account of the softmax attention mechanism and limited expressiveness on account of the weak inductive bias of RoPE (Rotary Positional Encoding). This signifies that the computational effort quadruples when the input length doubles. RoPE allows transformers to know the order of tokens (i.e. words); However, it doesn't do a very good job of capturing internal state changes over time, corresponding to: B. variable values, and is restricted to the sequence lengths observed during training.

To address this problem, the MIT-IBM team investigated theoretically sound yet hardware-efficient algorithms. As a substitute for softmax attention, they introduced linear attention, reducing the quadratic complexity that limits the possible sequence length. They also explored hybrid architectures that mix softmax and linear attention to realize a greater balance between computational efficiency and performance.

To increase expressiveness, they replaced RoPE with dynamic reflective position encoding based on Householder transform. This approach enables richer positional interactions for a deeper understanding of sequential information while ensuring fast and efficient computation. The MIT-IBM team's advances reduce the necessity for transformers to interrupt problems into many steps, as a substitute allowing them to tackle more complex subproblems with fewer inference tokens.

New visions

Visual data comprises a spread that the human brain can quickly analyze, internalize after which imitate. Using vision-language models (VLMs), two graduate students are exploring ways to realize this using code.

Over the past two summers and under the leadership of Aude Oliva, MIT director of the MIT-IBM Watson AI Lab and senior research scientist within the Computer Science and Artificial Intelligence Laboratory; and Rogerio Feris, Dan Gutfreund, and Leonid Karlinsky (now at Xero) from IBM Research, Jovana Kondic from EECS explored visual document understanding, particularly diagrams. These contain elements corresponding to data points, legends and axis labels that require optical character recognition and numerical reasoning, which models still struggle with. To facilitate performance on tasks like these, Kondic's group got down to create a big, open, synthetic graph dataset of code that could possibly be used for training and benchmarking.

Using their prototype ChartGen, the researchers created a pipeline that passes images of seed charts through a VLM, which is asked to read the chart and generate a Python script that was likely originally used to create the chart. The LLM component of the framework then iteratively expands the code from many diagrams to ultimately create over 200,000 unique diagram pairs and their codes, spanning nearly 30 diagram types, in addition to supporting data and annotations corresponding to descriptions and question-answer pairs concerning the diagrams. The team continues to expand its data set, helping to enable critical multimodal understanding of information visualizations for enterprise applications corresponding to financial and scientific reports, blogs, and more.

Instead of diagrams, EECS graduate student Leonardo Hernandez Cano has turned his attention to digital design, particularly visual texture generation for CAD applications and the goal of finding efficient ways to enable functions in VLMs. Working with lab groups led by Armando Solar-Lezama, EECS Professor and Distinguished Professor of Computing on the MIT Schwarzman College of Computing, and Nathan Fulton of IBM Research, Hernandez Cano developed a program synthesis system that learns to refine code itself. The system begins with a texture description that a user provides in the shape of a picture. It then generates an initial Python program that produces visual textures and iteratively refines the code with the aim of finding a program that produces a texture that matches the goal description and learns to look for brand new programs based on the info generated by the system itself. These refinements allow the novel program to create visualizations with the specified luminosity, color, iridescence, etc., mimicking real materials.

Taken together, these projects and the people behind them make a coherent push toward more robust and practical artificial intelligence. By addressing the core challenges of reliability, efficiency, and multimodal pondering, the work paves the best way for AI systems that will not be only more powerful, but in addition more reliable and cost-effective for real-world business and scientific applications.

We're planning the longer term of AI, from safer answers to faster pondering

LEAVE A REPLY Cancel reply

Must Read

From Svedka to Anthropic, brands are making daring plays with AI in Super Bowl ads

“That’s science!” – MIT President speaks on GBH's Boston Public Radio in regards to the importance of America's research enterprise

New technologies are strengthening the worldwide fight against wildlife trafficking

How diverse voices are changing the UN's climate science

Why comparisons between AI and human intelligence miss the purpose

Helping AI agents search to get the very best results from large language models

AI-generated text overwhelms institutions and triggers a hopeless “arms race” with AI detectors

Latest articles

From Svedka to Anthropic, brands are making daring plays with AI in Super Bowl ads

“That’s science!” – MIT President speaks on GBH's Boston Public Radio in regards to the importance of America's research enterprise

New technologies are strengthening the worldwide fight against wildlife trafficking

Our Newsletter

We're planning the longer term of AI, from safer answers to faster pondering

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter