Mistral’s AI voice model can listen to and transcribe up to 30 minutes of audio or 40 minutes of audio understanding without switching to a separate mode and offer summarization at less than half the price of comparable APIs • DigiBanker

Mistral released an open-sourced voice model that could rival paid voice AI, such as those from ElevenLabs and Hume AI, which the company said bridges the gap between proprietary speech recognition models and the more open, yet error-prone versions. The company said Voxtral “offers state-of-the-art accuracy and native semantic understanding in the open, at less than half the price of comparable APIs.” Voxtral, at a 32K token context, can listen to and transcribe up to 30 minutes of audio or 40 minutes of audio understanding. It offers summarization, meaning the model can answer questions based on the audio content and generate summaries without switching to a separate mode. Users can trigger functions and API calls based on spoken instructions. The model is based on Mistral’s Mistral Small 3.1 and supports multiple languages and can automatically detect languages. Mistral added enterprise features to Voxtral, including private deployment, so that organizations can integrate the model into their own ecosystems. These features also include domain-specific fine-tuning and advanced context and priority access to engineering resources for customers who need help integrating Voxtral into their workflows. Mistral stated that Voxtral outperformed existing voice models, including OpenAI’s Whisper, Gemini 2.5 Flash and Scribe from ElevenLabs. Voxtral presented fewer word errors compared to Whisper, which is currently considered the best automatic speech recognition model available.

Read Article