HomeArtificial IntelligenceWill updating your AI agents improve or hurt their performance? Raindrop's latest...

Will updating your AI agents improve or hurt their performance? Raindrop's latest Experiments tool tells you

It seems as if over the past two years since ChatGPT launched, latest large language models (LLMs) have been released almost every week by competing labs or by OpenAI itself. It is difficult for corporations to maintain up with the large pace of change, let alone understand learn how to adapt to it. Which, if any, of those latest models should they adopt to advance their workflows and the custom AI agents they develop to run them?

Help has arrived: Start of observability of AI applications Raindrops began experimentsa brand new analytics feature that the corporate calls the primary A/B testing suite designed specifically for enterprise AI agents. This allows corporations to see and compare how updating agents to latest underlying models or changing their instructions and power access impacts their performance with real end users.

The release expands Raindrop's existing observability tools and provides developers and teams the chance to see how their agents behave and evolve in real-world conditions.

Experiments allow teams to trace how changes—like a brand new tool, a prompt, a model update, or a whole pipeline redesign—affect AI performance across thousands and thousands of user interactions. The latest feature is accessible now for users of Raindrop's Pro subscription plan ($350 monthly) at raindrop.ai.

An information-driven perspective on agent development

Co-founder and Chief Technology Officer of Raindrop Ben Hylak A product announcement video (above) noted that experiments help teams see “how things have literally modified,” including tool usage, user intent, and problem rates, and examine differences based on demographic aspects like language. The goal is to make model iteration more transparent and measurable.

The Experiments interface visually displays results and shows whether an experiment is performing higher or worse than its baseline. An increase in negative signals could indicate more frequent task errors or partial code output, while improvements in positive signals could indicate more complete answers or higher user experiences.

By making this data easy to interpret, Raindrop encourages AI teams to approach agent iteration with the identical care they’d approach deploying modern software – tracking results, sharing insights, and addressing regressions before they change into more severe.

Background: From AI observability to experimentation

Raindrop's launch of Experiments builds on the corporate's foundation as an early adopter AI native observability platformsto assist corporations monitor and understand the behavior of their generative AI systems in production.

As VentureBeat reported earlier this yr, the corporate — originally often called Dawn AI — emerged to tackle Hylak, Called the “black box problem” of AI performance, a former Apple human interface designer helped teams discover errors “as they occur and explain to corporations what went unsuitable and why.”

At the time, Hylak described how “AI products fail on a regular basis – in ways each hilarious and frightening,” noting that “unlike traditional software, which throws clear exceptions, AI products fail in silence.” Raindrop's original platform focused on detecting these silent errors by analyzing signals reminiscent of user feedback, task errors, rejections, and other conversational anomalies in thousands and thousands of day by day events.

The co-founders of the corporate – Hylak, Alexis GaubaAnd Zubin Singh Koticha – developed Raindrop after experiencing firsthand how difficult it’s to debug AI systems in production.

“We began by constructing AI products, not infrastructure,” Hylak said VentureBeat. “But pretty quickly we realized that we would have liked tools to grasp AI behavior with a purpose to develop anything serious – and people tools didn’t exist.”

With Experiments, Raindrop expands on the identical mission Recognize errors To Measuring Improvements. The latest tool turns observability data into actionable comparisons, allowing corporations to check whether changes to their models, prompts or pipelines actually make their AI agents higher – or simply different.

Solving the “reviews pass, agents fail” problem.

Traditional evaluation frameworks, while useful for benchmarking, rarely capture the unpredictable behavior of AI agents operating in dynamic environments.

As co-founder of Raindrop Alexis Gauba explained in her LinkedIn announcement“Traditional evaluations don't really answer this query. They're great unit tests, but you’ll be able to't predict your users' actions and your agent runs for hours calling a whole bunch of tools.”

Gauba said the corporate has consistently heard a standard frustration amongst teams: “Reviews pass, agents fail.”

Experiments are intended to shut this gap by showing what actually changes when developers send updates for his or her systems.

The tool enables direct comparisons of models, tools, intents or properties and divulges measurable differences in behavior and performance.

Designed for real-world AI behavior

In the announcement video, Raindrop described experiments as a strategy to “compare and measure how your agent's behavior in production has actually modified across thousands and thousands of real-world interactions.”

The platform helps users discover issues reminiscent of spikes in task errors, forgetting, or latest tools that trigger unexpected errors.

It will also be used the opposite way around – ranging from a known problem, e.g. B. an “agent stuck in a loop” and tracing which model, tool or flag is causing it.

From there, developers can dive into detailed traces to seek out the basis cause and quickly deliver an answer.

Each experiment provides a visible breakdown of metrics reminiscent of tool usage frequency, error rates, conversation duration, and response length.

Users can click on any comparison to access the underlying event data, providing a transparent overview of how agent behavior has modified over time. Shared links make it easier to collaborate with teammates or report results.

Integration, scalability and accuracy

According to Hylak, Experiments integrates directly with “the feature flag platforms that corporations know and love (like Statsig!)” and is designed to work seamlessly with existing telemetry and analytics pipelines.

For corporations without these integrations, performance can still be compared over time – for instance, yesterday versus today – without additional setup.

According to Hylak, teams typically need around 2,000 users per day to realize statistically meaningful results.

To make sure the accuracy of comparisons, Experiments monitors sample size adequacy and alerts users when a test doesn’t have enough data to attract valid conclusions.

“We put loads of emphasis on ensuring metrics like task errors and user frustration are metrics you’d get up an on-call engineer for,” Hylak explained. He added that teams can dive deeper into the particular conversations or events driving these metrics, providing transparency behind each aggregate number.

Security and privacy

Raindrop operates as a cloud-hosted platform, but additionally offers on-site personally identifiable information (PII) redaction for corporations that need additional control.

Hylak said the corporate is SOC 2 compliant and has launched a PII guard Feature that uses AI to mechanically remove sensitive information from stored data. “We take the protection of customer data very seriously,” he emphasized.

Pricing and Plans

Experiments is a component of Raindrop Pro planwhich costs $350 monthly or $0.0007 per interaction. The Pro tier also includes comprehensive research tools, topic clustering, custom issue tracking, and semantic search capabilities.

Raindrops Starter plan – $65 monthly or $0.001 per interaction – provides core analytics including issue detection, user feedback signals, Slack notifications, and user tracking. Both plans include a 14-day free trial.

Larger organizations may select one Business plan with custom pricing and advanced features like SSO login, custom alerts, integrations, Edge PII redaction, and priority support.

Continuous improvement for AI systems

With Experiments, Raindrop positions itself on the interface of AI evaluation and software observability. Its give attention to “measuring truth,” as outlined within the product video, reflects a broader push throughout the industry toward accountability and transparency in AI operations.

Rather than relying solely on offline benchmarks, Raindrop's approach emphasizes real user data and contextual understanding. The company hopes this may enable AI developers to maneuver faster, discover root causes earlier and deliver more powerful models with confidence.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read