AI agents are emerging as a promising latest research area with potential real-world applications. These agents use base models akin to Large Language Models (LLMs) and Vision Language Models (VLMs) to simply accept natural language instructions and pursue complex goals autonomously or semi-autonomously. AI agents can use various tools akin to browsers, serps, and code compilers to confirm their actions and reason about their goals.
However current evaluation by researchers at Princeton University has uncovered several deficiencies in current agent benchmarks and evaluation practices that impair their usefulness in real-world applications.
Their findings highlight that benchmarking agents presents unique challenges and that we cannot evaluate agents in the identical way that we benchmark baseline models.
Compromise between cost and accuracy
A significant problem that the researchers highlight of their study is the dearth of cost control when evaluating agents. Running AI agents might be way more expensive than a single model call because they are sometimes based on stochastic language models that may produce different results when the identical query is entered multiple times.
To increase accuracy, some agent systems generate multiple responses and use mechanisms akin to voting or external verification tools to pick out the most effective response. Sometimes the agent's accuracy might be increased by sampling a whole bunch or hundreds of responses. While this approach can improve performance, it comes with significant computational costs. In research settings where the goal is to maximise accuracy, inference costs usually are not at all times a priority.
However, in practical applications, the budget available for every query is proscribed, so controlling costs is crucial in agent evaluation. Otherwise, researchers is perhaps encouraged to develop extremely expensive agents simply to be at the highest of the leaderboard. The Princeton researchers propose visualizing the evaluation results as a Pareto curve of accuracy and inference cost, and using techniques that optimize the agent jointly for these two metrics.
The researchers evaluated the accuracy-cost trade-offs of various prompting techniques and agentic patterns presented in numerous works.
“With roughly the identical accuracy, the prices can differ by almost two orders of magnitude,” the researchers write. “Yet the fee of running these agents just isn’t listed as a key metric in any of those papers.”
The researchers argue that optimizing for each metrics can result in “agents that cost less while still being accurate.” Optimizing together also can allow researchers and developers to trade off the fixed and variable costs of running an agent. For example, they will spend more optimizing the agent design but reduce variable costs by utilizing fewer contextual learning examples within the agent's prompt.
The researchers tested the joint optimization on HotpotQAa preferred benchmark for questions and answers. Their results show that the joint optimization formulation provides a method to achieve an optimal balance between accuracy and inference cost.
“Meaningful agent evaluations must take cost into consideration – even when we ultimately don’t care about cost and only wish to discover revolutionary agent designs,” the researchers write. “Accuracy alone cannot detect progress because it might probably be improved by scientifically meaningless methods akin to repetition.”
Model development vs. downstream applications
Another issue the researchers highlight is the difference between evaluating models for research purposes and developing downstream applications. In research, accuracy is commonly the essential focus, while inference costs are largely ignored. However, when developing real-world applications on AI agents, inference costs play a vital role in deciding which model and technique to make use of.
Evaluating the inference cost for AI agents is difficult. For example, different model providers may charge different amounts for a similar model. At the identical time, the fee of API calls changes commonly and might vary depending on developers' decisions. For example, on some platforms, bulk API calls are charged in a different way.
The researchers created a website that adjusts model comparisons based on token pricing to deal with this issue.
They also conducted a case study on NovelQAa benchmark for question-answer tasks on very long texts. They found that benchmarks intended for model evaluation might be misleading when used for downstream evaluation. For example, the unique NovelQA study makes Retrieval-Augmented Generation (RAG) look much worse than long-context models than it’s in a real-world scenario. Their results show that RAG and long-context models were about equally accurate, while long-context models are 20 times costlier.
Overfitting is an issue
When learning latest tasks, machine learning (ML) models often find shortcuts that allow them to perform well on benchmarks. One well-known sort of shortcut is “overfitting,” where the model finds ways to cheat on the benchmark tests and produces results that usually are not transferable to the true world. The researchers found that overfitting is a major problem for agent benchmarks, as they have an inclination to be small and typically consist of only a couple of hundred samples. This problem is more severe than Data contamination when training base models, since knowledge about test samples might be programmed directly into the agent.
To solve this problem, the researchers suggest that benchmark developers should create and retain holdout test sets consisting of examples that can not be memorized during training and might only be solved by adequately understanding the goal task. In their evaluation of 17 benchmarks, the researchers found that many lacked suitable holdout data sets, allowing agents to take shortcuts, even unintentionally.
“Surprisingly, we discover that many agent benchmarks don’t contain withheld test sets,” the researchers write. “Benchmark developers shouldn’t only create a test set but in addition consider keeping it secret to forestall LLM contamination or overfitting of the agent.”
They also imagine that various kinds of holdout examples are required depending on the specified level of generality of the duty the agent performs.
“Benchmark developers must do their best to be certain that shortcuts are inconceivable,” the researchers write. “We see this because the responsibility of benchmark developers quite than agent developers, because developing benchmarks that don’t allow shortcuts is far easier than checking each individual agent to see if it takes shortcuts.”
The researchers tested WebArenaa benchmark that evaluates the performance of AI agents in solving problems with different web sites. They found several shortcuts within the training datasets that allowed the agents to adapt to tasks in a way that might easily break down when faced with minor changes in the true world. For example, the agent could make assumptions in regards to the structure of web addresses without considering that it’d change in the long run or that it wouldn't work on different web sites.
These errors increase accuracy estimates and result in over-optimism in regards to the agents' capabilities, the researchers warn.
Because AI agents are a brand new field, the research and development communities still have rather a lot to study methods to test the boundaries of those latest systems, which could soon turn into a vital a part of on a regular basis applications.
“Benchmarking AI agents is latest and there aren’t any best practices yet, making it difficult to differentiate real progress from hype,” the researchers write. “Our thesis is that agents are sufficiently different from models that benchmarking practices should be reconsidered.”