Inworld Launches New Frontier Voice Model That Gives AI Agents Contextual Empathy

Inworld AI launches Realtime TTS-2, a new generation of voice model built to evolve how AI agents handle realtime conversation. The model is able to understand the full context of conversations and the user’s emotional state, tone and pacing to determine not only what to say, but how to say it. The result is voice AI that feels as good as it sounds.

Realtime conversation is the most human way to connect with each other. Now, as AI takes up more of our conversations, it must evolve beyond intelligence to afford the same shared emotionality and context-awareness that makes connection meaningful. But until today, voice AI has been tuned for static audiobooks and voiceovers rather than live connection. It has been text-to-speech and speech-to-text in the most literal sense: words converted to audio, and audio converted to words, making AI voice interactions feel more like a misinterpreted text message than a meaningful conversation.

The problem with today’s voice AI

When a frustrated customer calls support, today’s voice agents respond with the same bright, even tone they use for everything, because they have no awareness of how the caller is speaking. Inworld’s Realtime TTS-2 hears the frustration. Its voice softens and its pace slows. It reasons through the weight of the moment before it responds.

Or consider a patient calling to discuss lab results. They start the call measured and calm. Then the agent shares an unexpected finding. The patient’s voice tightens; their questions come faster. Today’s voice agents would barrel ahead at the same pace and pitch, ignoring the gravity of the situation. TTS-2 registers the shift in real time. It slows down. It leaves space. It delivers the next piece of information with steadiness and care, not because someone scripted a “nervous patient” pathway, but because the model heard how the person was speaking and adapted the way a human would.

The reason today’s voice agents sound mechanical in conversation is architectural. Conventional voice models receive a string of text and produce audio. They have no access to how the user sounds, what was said before, or what the moment requires.

A new kind of voice model

Inworld’s previous model, TTS-1.5, already ranks #1 on the Artificial Analysis Speech Arena, above Google and ElevenLabs. With voice quality achieved, Inworld set out to build TTS-2 with a fundamentally different architecture – one that can process conversation the way a human listener would, before a single word is spoken.

Before speech is generated, TTS-2 captures the user’s audio and extracts context, emotion, and tone in real time. It then reasons over the full conversational history: what was said in previous turns, what the most important moments were, what can be inferred from how the user sounds right now. From this, it estimates the user’s emotional state and determines the agent’s appropriate response state: not just what to say, but how to say it. What non-verbal expressions are appropriate. How what the agent says might land given everything that came before.

Inside TTS-2, all of that context converges. The model receives what to say, how to say it (full natural-language voice direction, not preset emotion tags), direct audio from the user (to further condition expression in real time), and the full conversation history. TTS-2 synthesizes all of this into emotionally aware, contextual speech, adjusting tone, pacing, and delivery based on the complete picture of the interaction.

The result is a voice system that sounds like a person in conversation, not a person reading an audiobook. Developers steer the model with natural language the way they prompt an LLM: full descriptions like [act like you just got home from a long day, tired but warm], combined with inline controls for specific moments ([whispering], [sigh], [excited]). The voice is as controllable as it is expressive, across over 100 languages with on-the-fly switching inside a single generation, preserving the speaker’s voice identity across every language.

“We are obsessed with how voice AI feels, not just how it sounds. Realtime voice is the most natural way for people to communicate with AI, because it is the most natural way people communicate with each other. Voice is how we actually connect. We built TTS-2 to make that connection feel real,” said Kylan Gibbs, CEO and Co-Founder of Inworld AI.

“Most TTS models generate speech in isolation from the conversation around them. TTS-2 is trained to use audio context from the full multi-turn exchange, and take voice direction so how the model speaks adjusts to how it was spoken to. Building a system that does this in real-time, at production quality, with full controllability, required solving problems that the field had treated as future work for years. It is a different generation of system than a text-to-audio model, and it is what is required for voice AI that behaves naturally inside a realtime pipeline,” said Igor Poletaev, Chief Science Officer at Inworld AI.

Availability

Inworld Realtime TTS-2 is available via the Inworld API, and as part of the Inworld Realtime API for end-to-end speech-to-speech over a single persistent connection. Integration partners include Layercode, LiveKit, NLX, Pipecat, Vapi, and Voximplant. Developers can try the live demo or learn more at inworld.ai/tts. See inworld.ai/pricing for current rates.

About Inworld AI

Inworld is a research lab focused on solving realtime interaction. The company’s Realtime TTS is ranked #1 on the Artificial Analysis Speech Arena, with two of the top five positions. Realtime STT offers speech recognition that includes voice profiling to detect detailed user context. The Realtime Router is a user-aware reasoning layer that selects the optimal model and prompt for every context. And Realtime API unifies everything into a single persistent connection for full-duplex conversational AI that includes contextual understanding with natural speech. The founding team comes from DeepMind and Google, and they have raised $125M+ from leading investors like Lightspeed Venture Partners, Section 32, Bitkraft, Kleiner Perkins, and Founders Fund.

View source version on businesswire.com: https://www.businesswire.com/news/home/20260505096579/en/

Media gallery

Inworld Launches New Frontier Voice Model That Gives AI Agents Contextual Empathy

Latest Press Releases

Latest Lifestyle

Industry Research Reveals the 3 Disciplines That Define Top-Performing Property Asset Management Firms

SAM Technology and GuardHill Financial to Launch Lending Intelligence™ Platform

RedAwning Expands Industry-First Seasonal Cancellation Policy Support to Airbnb + Record Success with FLEXStep Algorithm

Rewind Receives Atlassian Partner of the Year 2026: Marketplace Partner Channel Growth

BTR: i2 Group Launches i2 Amplify, a Community Platform for Intelligence Professionals Worldwide

M6 Global Defense partners with UncommonX to deliver integrated cybersecurity and exposure management

More from NEWSnet

Technology

ALYKA Health selected as a participant in Medicare ACCESS Model, expanding access to value-based digital health

SideBar: Optimism in Action Welcomes Katherine Grainger, Managing Partner of Civitas Public Affairs Group

Leinco Technologies and CellCarta collaborate to advance proteomics solutions and applications for drug development

blinc Celebrates 27 Years of Tubing™ Technology and Performance Beauty Innovation

Chibbs Management Expands Executive Search Offering to Support High-Growth Brands Scaling Beyond $10M Revenue

Entertainment

At ConnectX, SiteSee and Albireo AI announce Partnership to Accelerate Telecom Site Monetization

Get 15% Off with IssueWire’s Digital PR Magazine Services

Javier Burillo Azcárraga Shares Lessons on Building World-Class Guest Experiences

BestAgents.us proudly announces the recipients of its Top Real Estate Professionals for 2026.

FurGPT Expands AI Behavioral Engine for Adaptive User Interaction

Health

H.C. Wainwright Announces 2nd Annual Royalty Company Virtual Conference

Atlantic Union Bankshares Corporation Declares Quarterly Common Stock Dividend and Preferred Stock Dividend

Port Houston Launches ‘Anchored in Action Community-Investment Plan’

SES Shareholder Alert: Investors With Losses May Seek to Lead the Class Action in SES AI Corporation Securities Lawsuit – Contact Levi & Korsinsky

BJ’s Wholesale Club Fort Worth Location Set to Open on May 8