Large language models like ChatGPT, Gemini, and Claude, which initially operated primarily through written commands, are gradually transforming into systems capable of real-time conversations with users. Thanks to the "voice AI" technologies that have evolved significantly in the last two years, artificial intelligence systems are no longer just simple assistants that perceive voice commands; they are becoming systems that understand speech, follow context, perform tasks, and engage in natural dialogue with users. The new generation of voice models introduced by OpenAI this week stands out as one of the most striking examples of this transformation.
Three new models offered to developers via API by OpenAI—GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper—are considered the company's most ambitious step in real-time voice AI. According to the company's statement, these models aim to enable users to communicate naturally with AI without using a keyboard while driving, navigating an airport, or interacting with customer service.
GPT-Realtime-2 Can Perform Tasks During Conversation
The most notable model here is GPT-Realtime-2. Described by the company as "the first voice model with GPT-5 level reasoning capabilities," this system can follow long and complex conversations and maintain a natural dialogue even if the user interrupts. One of the points particularly emphasized by OpenAI is that the model can now not only converse but also actively perform tasks during the conversation. Developers can grant this model access to tools such as calendars, search engines, or internal company systems. The model can also provide natural feedback to the user while performing these operations, such as "checking your calendar" or "I'm looking that up now."
The new model also features significant technical improvements. OpenAI has increased its voice model's context window from 32K to 128K. This allows the model to follow much longer conversations and perform more complex tasks without forgetting previous dialogues. This capacity increase is particularly crucial in use cases such as customer service or long-term support calls. The company also states that the model can recover better from failed operations and more accurately understand technical terminology used in fields like healthcare.
The performance tests shared by OpenAI also demonstrate a significant improvement in the new model's voice interactions. According to the company's data, GPT-Realtime-2 achieved a 15.2% higher score in Big Bench Audio tests compared to the previous generation.
To Compete with Gemini Live
OpenAI's new models put the company in direct competition with Google's Gemini Live system. However, there are distinct differences in the two companies' approaches. While Google focuses more on fast response times and broad language support, OpenAI appears to be prioritizing the development of a natural and seamless conversational experience.
The second model introduced, GPT-Realtime-Translate, focuses on real-time translation. According to OpenAI's statement, this model supports over 70 input languages and can simultaneously translate them into 13 different languages. Moreover, it can do this while maintaining the speaker's pace. The company positions this system particularly for customer service, travel applications, and multilingual communication platforms.
OpenAI also shared some examples of companies that have started using this technology. One such company, Deutsche Telekom, is developing voice support systems where customers can speak in their own languages, and the AI instantly translates the conversation. Such systems are expected to reduce the need for human translators, especially in international customer service.
The third model announced, GPT-Realtime-Whisper, focuses on live transcription. This model can simultaneously convert speech into text as the user speaks. This technology, which can be used in areas such as meeting notes, call centers, live broadcast subtitles, or voice recording solutions, is considered the next-generation version of OpenAI's long-developed Whisper infrastructure.
According to the company's statement, the long-term goal is to create fully-fledged AI agents that can listen, think, translate, transcribe, and take action simultaneously. OpenAI's new voice models are seen as an important step towards this new evolution of artificial intelligence.
0 Comments: