Microsoft’s new open-source AI model create podcasts and other audio generates four distinct voices for up to 90 minutes; includes research only licensing and watermarking safeguards • DigiBanker

Microsoft has released VibeVoice, a new open-source AI model that lets users create podcasts and other audio — a counter to Google’s popular NotebookLM. Microsoft’s text-to-speech model can generate four voices and up to 90 minutes of podcast-quality speech. NotebookLM can do two voices. Additionally, VibeVoice reads and organizes text while NotebookLM ingests documents and turns them into two-person podcasts. Users can also query and get document summaries, according to tech firm Hugging Face. That means VibeVoice doesn’t try to understand the text but rather performs it audibly, ostensibly to replace a recording studio. VibeVoice runs on 1.5 billion parameters, relatively small for a model capable of sustaining dialogue across multiple speakers. It was trained using Alibaba’s open-source Qwen2.5, a large language model that helps orchestrate natural turn-taking and contextually aware speech patterns during dialogues. Microsoft claims this means VibeVoice can produce fluid conversations among four voices and yet maintain each voice’s distinct characteristics, even in longer conversations. Potential research applications of VibeVoice include the following: Prototyping podcasts and training content: Creators could generate mock podcasts, panel discussions or training modules with multiple AI voices. Instead of hiring four voice actors to test dialogue flow, users can create a synthetic version in minutes using text. Accessibility and education: Educational material, textbooks or research papers could be turned into long-form audio with distinct narrators. This could help people who learn better by listening, or make dense material more engaging. Game and media development: Game developers or storytellers could use VibeVoice to prototype dialogue between characters. Because it handles four speakers, you can stage a full in-game conversation without recording sessions.

Read Article