A startup founded by former meta-AI researchers has developed a light-weight AI model that may evaluate other AI systems as effectively as much larger models while providing detailed explanations for its decisions.
Patronus AI published today Glideran open source language model with 3.8 billion parameters that outperforms OpenAI's GPT-4o-mini on several key benchmarks for assessing AI output. The model is meant to function an automatic evaluator that may evaluate the responses of AI systems based on a whole bunch of various criteria while explaining the reasoning.
“Everything we do at Patronus is concentrated on providing powerful and reliable AI assessment to developers and anyone using language models or constructing latest LM systems,” said Anand Kannappan, CEO and co-founder of Patronus AI, in an exclusive interview with VentureBeat.
Small but mighty: How Glider can sustain with the performance of GPT-4
The development represents a major breakthrough in AI assessment technology. Most corporations currently depend on large proprietary models akin to GPT-4 to evaluate their AI systems, a process that might be expensive and opaque. In addition to being less expensive on account of its smaller size, Glider provides detailed explanations for its judgments through keyword reasoning and highlighted sections of text that show exactly what influenced its decisions.
“Currently, there are numerous LLMs acting as judges, but we don't know which one is best fitted to our task,” said Darshan Deshpande, research engineer at Patronus AI, who led the project. “In this paper, we show several advances: we trained a model that runs on-device, uses only 3.8 billion parameters, and provides high-quality reasoning chains.”
Real-time evaluation: Speed ​​meets accuracy
The latest model shows that smaller language models can match or exceed the capabilities of much larger models for specific tasks. Glider achieves comparable performance to models 17 times larger when operating with only one second of latency. This makes it practical for real-time applications where corporations need to evaluate AI output because it is generated.
A key innovation is Glider's ability to evaluate multiple elements of AI output concurrently. The model can evaluate aspects akin to accuracy, security, coherence, and tone abruptly, eliminating the necessity for separate evaluation passes. It also maintains strong multilingual capabilities despite being trained totally on English data.
“When you’re employed with real-time environments, latency must be as little as possible,” Kannappan explained. “This model typically responds in lower than a second, especially when used through our product.”
Data protection comes first: AI evaluation on the device is becoming a reality
For corporations developing AI systems, Glider offers several practical advantages. Due to its small size, it may run directly on consumer hardware, addressing privacy concerns when sending data to external APIs. Due to its open source nature, corporations can deploy it on their very own infrastructure while customizing it to their specific needs.
The model was trained on 183 different evaluation metrics in 685 areas, from basic aspects akin to accuracy and coherence to more nuanced elements akin to creativity and ethical considerations. This comprehensive training helps generalize to many differing types of assessment tasks.
“Customers need on-device models because they will’t send their private data to OpenAI or Anthropic,” Deshpande explained. “We also want to indicate that small language models might be effective evaluators.”
The publication comes at a time when corporations are increasingly focused on ensuring responsible AI development through robust assessment and oversight. Glider's ability to offer detailed explanations for its judgments could help corporations higher understand and improve the behavior of their AI systems.
The Future of AI Assessment: Smaller, Faster, Smarter
Patronus AI, founded by machine learning experts from Meta AI And Meta Reality Labshas positioned itself as a number one provider of AI assessment technology. The company provides a platform for automated testing and security of huge language models, with Glider being the most recent advancement to make sophisticated AI assessments more accessible.
The company plans to publish detailed technical research on Glider on arxiv.org today and show its performance in various benchmarks. Initial tests show that it achieves state-of-the-art results on several standard metrics while providing more transparent explanations than existing solutions.
“We are in the primary innings,” Kannappan said. “We expect that over time more developers and corporations will push the boundaries in these areas.”
Glider's development suggests that the longer term of AI systems may not necessarily require increasingly larger models, but slightly more specialized and efficient models optimized for specific tasks. Its success in adapting to the performance of larger models while providing higher explainability could influence how corporations approach AI assessment and development in the longer term.