Inworld AI Releases TTS-1.5 Real-Time, Production-Grade Voice Agent

0 12 3 minutes read

Inworld AI Releases TTS-1.5 Real-Time, Production-Grade Voice Agent

Inworld AI has introduced the Inworld TTS-1.5, an upgrade to its TTS-1 family targeting real-time voice agents with strict constraints on latency, quality, and cost. TTS-1.5 is described as the most advanced text-to-speech system in Applied Analytics and is designed to be more transparent and stable than previous generations while remaining suitable for large-scale consumer deployments.

Real-time latency for interactive agents

The TTS-1.5 focuses on the P90 time to first sound latency, which is an important metric for the user’s perceived response. In TTS-1.5 Max, the P90 time of the first sound is less than 250 ms. In the TTS-1.5 Mini, the P90’s first sound time is less than 130 ms. These values are 4 times faster than the previous generation TTS according to Inworld.

The TTS-1.5 stack supports streaming via WebSocket so that compilation and playback can start as soon as the first piece of audio is generated. In practice this keeps end-to-end latency in the same range as real-time language model responses when the models use modern GPUs, which is important when TTS is part of a full agent pipeline.

Inworld recommends the TTS-1.5 Max for most applications because it balances a delay close to 200 ms with high stability and quality. The TTS-1.5 Mini is positioned for latency-sensitive workloads such as real-time playback or responsive voice agents where every millisecond matters.

Exposure, stability and benchmarking

The TTS-1.5 builds on the TTS-1 and delivers 30 percent more clear range and 40 percent better stability than previous models.

Here speech refers to features such as prosody, stress, and emotional variation. Stability is measured by metrics such as nominal error rate and consistency of output across long series and diverse data. Lowering the word error rate reduces problems such as truncated sentences, unintended word substitutions, or artifacts, which are important when TTS output is fed directly into the text of the generated language model.

Price and cost profile for the average consumer

The TTS-1.5 is priced in two main configurations. The Inworld TTS-1.5 Mini costs $5 for 1 million words, which is $0.005 per minute of talk. TTS-1.5 Max costs 10 dollars for 1 million coins, which is 0.01 dollars per minute.

This cost profile makes it possible to use TTS continuously for high-use products such as voice, educational platforms, or customer support lines without TTS becoming a variable cost overhead.

Multi-language support, voice integration and sending options

Inworld TTS-1.5 supports 15 languages. The list includes English, Spanish, French, Korean, Dutch, Chinese, German, Italian, Japanese, Polish, Portuguese, Russian, Hindi, Arabic, and Hebrew. This allows a single TTS pipeline to cover a broad set of markets without separate models for each region.

The program provides fast voice cloning and professional voice cloning. Instant voice cloning can create a custom voice from about 15 seconds of audio and is exposed directly to the Inworld portal and API. Professional voiceovers use at least 30 minutes of clean audio, 20 minutes or more recommended for best results, and target branded voices and unusual accents.

For implementation, TTS-1.5 is available as a cloud API and also as a prem solution, where the full model works within the customer’s infrastructure for data independence and compatibility. The same quality profile is maintained for both deployment methods, and the models also include partner platforms such as LiveKit, Pipecat, and Vapi for complete voice agent stacks.

Key Takeaways

Inworld TTS 1.5 delivers real-time performance, with a P90 time to first sound of less than 250 ms for the Max model and less than 130 ms for the Mini model, 4 times faster than the previous generation.
The model increases speech expression by about 30 percent and improves stability with a 40 percent lower error rate.
Pricing is optimized for the consumer scale, the TTS 1.5 Mini costs about $5 per 1 million grains and the TTS 1.5 Max costs about $10 per 1 million grains, which is cheaper per minute than many competing systems.
TTS 1.5 supports 15 languages and offers fast and professional speech synthesis, allowing custom voices and words from short reference audio or long recorded data sets.
The system is available as a cloud API and as a prem deployment, and integrates with existing voice agent stacks, making it suitable for producing real-time agents that require clear guarantees of latency, quality, and data control.

Check out Technical details. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.