As artificial intelligence models turn into more widely used and integrated into areas as diverse as healthcare, finance, education, transportation, and entertainment, it’s critical to know how they work behind the scenes. By interpreting the mechanisms underlying AI models, we are able to test them for safety and bias, potentially deepening our understanding of the science behind intelligence itself.
Imagine if we could study the human brain directly by manipulating each individual neuron to look at their role in perceiving a selected object. While such an experiment could be prohibitively invasive within the human brain, it’s more feasible in one other sort of neural network, a man-made one. However, artificial models with hundreds of thousands of neurons are – very like the human brain – too large and complicated to review manually, making interpretability at scale a really difficult task.
To solve this problem, researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) decided to take an automatic approach to interpreting artificial image models that evaluate various properties of images. They developed “MAIA” (Multimodal Automated Interpretability Agent), a system that automates a wide range of neural network interpretation tasks using a visible language model backbone equipped with tools for experimenting with other AI systems.
“Our goal is to create an AI researcher that may perform interpretability experiments autonomously. Existing automated interpretability methods only label or visualize data in a one-time process. MAIA, alternatively, can generate hypotheses, design experiments to check them, and refine its understanding through iterative evaluation,” says Tamar Rott Shaham, a postdoctoral researcher in electrical engineering and computer science (EECS) at MIT's CSAIL and co-author of a brand new Paper on research“By combining a pre-trained vision language model with a library of interpretability tools, our multimodal method can reply to user requests by creating and running targeted experiments on specific models and constantly refining its approach until it might provide a comprehensive answer.”
The automated agent handles three most important tasks: it labels individual components in image models and describes the visual concepts they activate, it cleans up image classifiers by removing irrelevant features to make them more robust to recent situations, and it looks for hidden biases in AI systems to uncover potential fairness issues of their outputs. “However, a key advantage of a system like MAIA is its flexibility,” says Sarah Schwettmann PhD ’21, a scientist at CSAIL and co-leader of the research. “We demonstrated the usefulness of MAIA using just a few specific tasks, but since the system is built on a base model with comprehensive reasoning capabilities, it might answer many differing kinds of interpretability queries from users and design experiments on the fly to explore them.”
Neuron by neuron
In one example task, a human user asks MAIA to explain the concepts that a selected neuron in a vision model is answerable for recognizing. To investigate this query, MAIA first uses a tool that retrieves “dataset exemplars” from the ImageNet dataset that maximally activate the neuron. For this instance neuron, these images show people in formal attire and close-ups of their chins and necks. MAIA comes up with different hypotheses about what’s driving the neuron’s activity: facial expressions, jawlines, or ties. MAIA then uses its tools to design experiments and test each hypothesis individually by generating and manipulating synthetic images—in a single experiment, adding a bow tie to a picture of a human face increases the neuron’s response. “This approach allows us to find out the precise explanation for the neuron’s activity, very like an actual scientific experiment,” says Rott Shaham.
MAIA's explanations of neuron behavior are evaluated in two ways. First, synthetic systems with known baseline behaviors are used to evaluate the accuracy of MAIA interpretations. Second, for “real” neurons in trained AI systems without baseline behavior descriptions, the authors develop a brand new automated evaluation protocol that measures how well MAIA's descriptions predict neuron behavior from unknown data.
The CSAIL-led method outperformed baseline methods describing individual neurons in a wide range of computer vision models, reminiscent of ResNet, CLIP, and the image processing transformer DINO. MAIA also performed well on the brand new dataset of synthetic neurons with known baseline descriptions. For each the true and artificial systems, the descriptions were often on par with descriptions written by human experts.
How are descriptions of AI system components, reminiscent of individual neurons, useful? “Understanding and localizing behaviors inside large AI systems is an important a part of safety-checking these systems before they’re deployed – in a few of our experiments we show how MAIA may be used to search out neurons with undesirable behavior and take away that behavior from a model,” says Schwettmann. “We are working on a more resilient AI ecosystem wherein tools for understanding and monitoring AI systems keep pace with system scaling, in order that we are able to investigate and hopefully understand unexpected challenges posed by recent models.”
A glance inside neural networks
The young field of interpretability is developing right into a research area in its own right, parallel to the proliferation of “black box” machine learning models. How can researchers decipher these models and understand how they work?
Current methods for gaining insight into the matter are limited either of their scope or within the accuracy of the reasons they will provide. In addition, existing methods are frequently tailored to a particular model and task. This led researchers to ask: How can we construct a generic system to assist users answer questions on the interpretability of AI models, combining the pliability of human experimentation with the scalability of automated techniques?
A critical area this method was designed to handle was bias. To determine whether image classifiers showed a bias against certain subcategories of images, the team checked out the ultimate level of the classification stream (in a system designed to sort or label items, just like a machine that recognizes whether a photograph shows a dog, cat, or bird) and the likelihood values of the input images (confidence levels the machine assigns to its guesses). To understand potential bias in image classification, MAIA was asked to search out a subset of images in certain classes (e.g., “Labrador Retriever”) that were prone to be mislabeled by the system. In this instance, MAIA found that images of black Labradors were prone to be misclassified, suggesting a bias within the model in favor of yellow-coated retrievers.
Because MAIA relies on external tools for experiment design, its performance is proscribed by the standard of those tools. But as the standard of tools reminiscent of image synthesis models improves, MAIA may even improve. MAIA also sometimes exhibits confirmation bias when it falsely confirms its original hypothesis. To mitigate this, the researchers developed an image-to-text tool that uses one other instance of the language model to summarize experiment results. Another failure mode is overfitting to a selected experiment, where the model sometimes jumps to conclusions based on minimal evidence.
“I feel a natural next step for our lab is to transcend artificial systems and apply similar experiments to human perception,” says Rott Shaham. “To test this, traditionally, stimuli needed to be designed and tested manually, which is labor-intensive. With our agent, we are able to scale this process and design and test quite a few stimuli concurrently. This could also allow us to match human visual perception with artificial systems.”
“Understanding neural networks is difficult for humans because they’ve a whole bunch of hundreds of neurons, each with complex behavioral patterns. MAIA helps overcome this hurdle by developing AI agents that may robotically analyze these neurons and report the outcomes back to humans in an comprehensible form,” says Jacob Steinhardt, an assistant professor on the University of California, Berkeley, who was not involved within the research. “Scaling these methods might be one of the crucial vital ways to know and safely monitor AI systems.”
Rott Shaham and Schwettmann are joined within the work by five other CSAIL members: undergraduate student Franklin Wang, MIT graduate student Achyuta Rajaram, EECS graduate student Evan Hernandez SM '22, and EECS professors Jacob Andreas and Antonio Torralba. Their work was supported partly by the MIT-IBM Watson AI Lab, Open Philanthropy, Hyundai Motor Co., the Army Research Laboratory, Intel, the National Science Foundation, the Zuckerman STEM Leadership Program, and the Viterbi Fellowship. The researchers' findings might be presented this week on the International Conference on Machine Learning.