OpenAI debuts its most advanced speech‑to‑speech model gpt‑realtime and Realtime API - includes MCP tools, image input, phone calls for low‑latency, single‑model/API customer support agents • DigiBanker

OpenAI has released what it calls its “most advanced speech-to-speech model yet.” Dubbed gpt-realtime, the model is better at following complex instructions, calling tools with precision, and producing speech that sounds more natural and expressive. “We trained the model in close collaboration with customers to excel at real-world tasks like customer support, personal assistance and education—aligning the model to how developers build and deploy voice agents,” OpenAI said. OpenAI also said that it has made the Realtime API generally available after introducing it in public beta in October and seeing thousands of developers build with it. The API now has new features that help developers build voice agents. These include supporting remote MCP servers, image inputs and phone calling through Session Initiation Protocol (SIP). These features make voice agents more capable through access to additional tools and context. Unlike traditional pipelines that chain together multiple models across speech-to-text and text-to-speech, the Realtime API processes and generates audio directly through a single model and API. This reduces latency, preserves nuance in speech and produces more natural, expressive responses.

Read Article