OpenAI today updated its real-time API, which is currently in beta. This update adds latest voices to its platform for speech-to-speech applications and reduces the prices related to prompt caching.
Realtime API beta users now have five latest voices with which to construct their applications. OpenAI introduced three of the brand new voices, Ash, Verse and the British-sounding Ballad, in a post on X.
Two real-time API updates:
– You can now create speech-to-speech experiences with five latest voices – which can be rather more expressive and controllable. ???
– We lower the value by utilizing prompt caching. There is a 50% discount on cached text input and a 50% discount on cached audio input… pic.twitter.com/jLzZDBrR7l
— OpenAI developer (@OpenAIDevs) October 30, 2024
The company said in its API documentation that the native speech-to-speech feature “skips an intertext format, meaning low latency and more nuanced output,” while the voices are easier to regulate and more expressive than previous voices.
However, OpenAI warns that it cannot currently offer client-side authentication for the API because it remains to be in beta. It also said that there could also be issues processing real-time audio.
“Network conditions greatly impact real-time audio, and reliably delivering audio from a client to a server at scale is difficult when network conditions are unpredictable,” the corporate said.
OpenAI's history with AI-powered speech and voices is controversial. In March, the corporate released Voice Engine, a rival voice cloning platform ElfLabshowever it limited access to only a number of researchers. After demonstrating its GPT-4o and Voice Mode in May, the corporate paused use of considered one of the voices, Sky, after actress Scarlett Johansson commented on its similarity to her voice.
The company launched ChatGPT Advanced Voice Mode within the US in September for paid subscribers (those using ChatGPT Plus, Enterprise, Teams and Edu).
Speech-to-speech AI would ideally allow firms to create more real-time responses using a voice. Suppose a customer calls an organization's customer support platform. In this case, the speech-to-speech feature can capture the person's voice, understand what they’re asking, and respond with a lower latency AI-generated voice. Speech-to-Speech also allows users to generate voice-overs, where a user speaks their lines, however the voice acting isn’t theirs. One platform that provides that is replica and naturally ElevenLabs.
OpenAI released the Realtime API during its Dev Day this month. The aim of the API is to speed up the event of voice assistants.
Reduce costs
However, using speech-to-speech features may very well be expensive.
When Realtime API was introduced, the pricing structure was $0.06 per minute of audio input and $0.24 per audio output, which isn’t low cost. However, the corporate plans to cut back real-time API prices through quick caching.
Cached text input is reduced by 50% and cached audio input is reduced by 80%.
OpenAI also announced prompt caching during Dev Day and would keep regularly requested contexts and prompts within the model's memory. This reduces the variety of tokens that should be created to generate responses. Lowering input prices could encourage more interested developers to connect with the API.
OpenAI isn't the one company introducing prompt caching. Anthropocene launched prompt caching for Claude 3.5 Sonnet in August.