Inworld AI launches Realtime TTS-2, a new generation of voice model built to evolve how AI agents handle realtime conversation. The model is able to understand the full context of conversations and the user’s emotional state, tone and pacing to determine not only what to say, but how to say it. The result is voice AI that feels as good as it sounds.
Realtime conversation is the most human way to connect with each other. Now, as AI takes up more of our conversations, it must evolve beyond intelligence to afford the same shared emotionality and context-awareness that makes connection meaningful. But until today, voice AI has been tuned for static audiobooks and voiceovers rather than live connection. It has been text-to-speech and speech-to-text in the most literal sense: words converted to audio, and audio converted to words, making AI voice interactions feel more like a misinterpreted text message than a meaningful conversation.
The problem with today’s voice AI
When a frustrated customer calls support, today’s voice agents respond with the same bright, even tone they use for everything, because they have no awareness of how the caller is speaking. Inworld’s Realtime TTS-2 hears the frustration. Its voice softens and its pace slows. It reasons through the weight of the moment before it responds.
Or consider a patient calling to discuss lab results. They start the call measured and calm. Then the agent shares an unexpected finding. The patient’s voice tightens; their questions come faster. Today’s voice agents would barrel ahead at the same pace and pitch, ignoring the gravity of the situation. TTS-2 registers the shift in real time. It slows down. It leaves space. It delivers the next piece of information with steadiness and care, not because someone scripted a “nervous patient” pathway, but because the model heard how the person was speaking and adapted the way a human would.
The reason today’s voice agents sound mechanical in conversation is architectural. Conventional voice models receive a string of text and produce audio. They have no access to how the user sounds, what was said before, or what the moment requires.
A new kind of voice model
Inworld’s previous model, TTS-1.5, already ranks #1 on the Artificial Analysis Speech Arena, above Google and ElevenLabs. With voice quality achieved, Inworld set out to build TTS-2 with a fundamentally different architecture – one that can process conversation the way a human listener would, before a single word is spoken.
Before speech is generated, TTS-2 captures the user’s audio and extracts context, emotion, and tone in real time. It then reasons over the full conversational history: what was said in previous turns, what the most important moments were, what can be inferred from how the user sounds right now. From this, it estimates the user’s emotional state and determines the agent’s appropriate response state: not just what to say, but how to say it. What non-verbal expressions are appropriate. How what the agent says might land given everything that came before.
Inside TTS-2, all of that context converges. The model receives what to say, how to say it (full natural-language voice direction, not preset emotion tags), direct audio from the user (to further condition expression in real time), and the full conversation history. TTS-2 synthesizes all of this into emotionally aware, contextual speech, adjusting tone, pacing, and delivery based on the complete picture of the interaction.
The result is a voice system that sounds like a person in conversation, not a person reading an audiobook. Developers steer the model with natural language the way they prompt an LLM: full descriptions like [act like you just got home from a long day, tired but warm], combined with inline controls for specific moments ([whispering], [sigh], [excited]). The voice is as controllable as it is expressive, across over 100 languages with on-the-fly switching inside a single generation, preserving the speaker’s voice identity across every language.
“We are obsessed with how voice AI feels, not just how it sounds. Realtime voice is the most natural way for people to communicate with AI, because it is the most natural way people communicate with each other. Voice is how we actually connect. We built TTS-2 to make that connection feel real,” said Kylan Gibbs, CEO and Co-Founder of Inworld AI.
“Most TTS models generate speech in isolation from the conversation around them. TTS-2 is trained to use audio context from the full multi-turn exchange, and take voice direction so how the model speaks adjusts to how it was spoken to. Building a system that does this in real-time, at production quality, with full controllability, required solving problems that the field had treated as future work for years. It is a different generation of system than a text-to-audio model, and it is what is required for voice AI that behaves naturally inside a realtime pipeline,” said Igor Poletaev, Chief Science Officer at Inworld AI.
Availability
Inworld Realtime TTS-2 is available via the Inworld API, and as part of the Inworld Realtime API for end-to-end speech-to-speech over a single persistent connection. Integration partners include Layercode, LiveKit, NLX, Pipecat, Vapi, and Voximplant. Developers can try the live demo or learn more at inworld.ai/tts. See inworld.ai/pricing for current rates.
About Inworld AI
Inworld is a research lab focused on solving realtime interaction. The company’s Realtime TTS is ranked #1 on the Artificial Analysis Speech Arena, with two of the top five positions. Realtime STT offers speech recognition that includes voice profiling to detect detailed user context. The Realtime Router is a user-aware reasoning layer that selects the optimal model and prompt for every context. And Realtime API unifies everything into a single persistent connection for full-duplex conversational AI that includes contextual understanding with natural speech. The founding team comes from DeepMind and Google, and they have raised $125M+ from leading investors like Lightspeed Venture Partners, Section 32, Bitkraft, Kleiner Perkins, and Founders Fund.
View source version on businesswire.com: https://www.businesswire.com/news/home/20260505096579/en/
Media gallery

