OpenAI has officially launched three new real-time speech models, aimed at providing developers with more advanced speech application solutions. These three models are GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper, each tailored for different application scenarios.

GPT-Realtime-2 is the first speech model with GPT-5 level reasoning capabilities. This model can handle complex requests and maintain conversations in a more natural way. It is specifically designed for real-time speech interaction, allowing it to perform reasoning and maintain conversation continuity when users ask questions or give instructions. In addition, GPT-Realtime-2 can also call tools, process user interruptions and corrections, and provide more appropriate responses based on the current context.

The second model, GPT-Realtime-Translate, focuses on real-time translation features, supporting over 70 input languages and 13 output languages. Its design aims to keep up with the speaker's speaking speed as much as possible, providing an experience close to "simultaneous interpretation." This allows users to communicate more smoothly in scenarios such as cross-language calls, meetings, or live broadcasts.

GPT-Realtime-Whisper is a real-time streaming speech transcription model that emphasizes low-latency speech-to-text capabilities. The model can transcribe speech in real time while the speaker is talking, bringing faster and more sensitive performance for various real-time products. Whether it's real-time generation of subtitles for live streams or meeting records that keep up with the discussion rhythm, this model demonstrates its wide application potential.

In terms of access methods and pricing, OpenAI stated that these three new models have been incorporated into its Realtime API system. The pricing for GPT-Realtime-2 is $32 per million audio input tokens and $64 per million audio output tokens. GPT-Realtime-Translate costs $0.034 per minute, while GPT-Realtime-Whisper is priced at $0.017 per minute. Developers can test these new models directly through Playground or quickly integrate them into existing applications.

Against the backdrop of generative AI continuously moving towards multimodal and real-time interactions, OpenAI's three newly released speech models will provide developers with more convenient tools, driving innovation in speech intelligence applications.

Key points:   

🔊 GPT-Realtime-2 has advanced reasoning capabilities, enabling more natural real-time conversations.   

🌐 GPT-Realtime-Translate supports multiple languages, offering a near-simultaneous interpretation translation experience.   

📝 GPT-Realtime-Whisper achieves low-latency transcription, suitable for scenarios such as live stream subtitles and meeting records.