FlashLabs Researchers Release Chroma 1.0: A 4B Real-Time Chat Model With Personalized Voice Cloning

Chroma 1.0 is a real-time speech-to-conversation model that takes sound as input and returns sound as output while preserving speaker identity across multi-turn conversations. It is presented as the first open source end-to-end spoken conversation system that combines high fidelity voice interaction personalized from just a few seconds of reference audio.
The model applies directly to different speech representations than to text documents. It targets the same use cases as real-time commercial agents, but in the context of a combined 4B parameter dialog and a design that treats speaker uniformity as a primary goal, not as an auxiliary feature. Chroma achieves a reported 10.96% relative improvement in speaker similarity over human baseline and achieves a Real Time Factor (RTF) of 0.43, so it can produce speech more than 2 times faster than playback.

From broken ASR ➡️ LLM ➡️ TTS ➡️ end to end S2S
Most production assistants still use a three-stage pipeline, automatic speech recognition to convert audio to text, large-scale language modeling, and text-to-speech integration. This structure is flexible but introduces delays and loses paralinguistic information such as timbre, emotion, speech rate and prosody when the system converts sound to text. In real time dialogue this loss of acoustic information directly damages the credibility with the speaker and the environment.
Chroma follows a new class of speech in speech systems that maps between sequences of codec tokens. A speech token and a neural codec generate acoustic codes with value. The language model then reasons and responds in a sequence that leaves text tokens and audio codes, without a clear intermediate transcription. This keeps the model in the state of prosody and speaker identity during the entire processing chain.
Architecture, reasoner + speech generation stack
Chroma 1.0 has two major subsystems. Chroma Reasoner handles multi-modal understanding and text processing. The speech stack, Chroma Backbone, Chroma Decoder and Chroma Codec Decoder, transforms that semantic output into audio for personalized response.
Chroma Reasoner is built on the Thinker module from the Qwen-omni series and uses the Qwen2 Audio encoding pipeline. It processes text and audio input with shared foregrounds, combines them with different attention, and aligns them over time using Time aligned Multimodal Rotary Position Embedding (TM-RoPE). The output is a sequence of hidden instances that carry both linguistic content and acoustic cues, for example rhythm and stress.


Chroma Backbone is a 1B model of the LLaMA parameter based on Llama3. It is encoded into the target voice using CSM-1B, which combines a short reference audio clip and its transcription into sequentially preset embedding commands. During reasoning, the embedding of tokens and hidden contexts from Reasoner is provided as a coherent context, so Backbone always recognizes the semantic context of the conversation while generating acoustic codes.
To support streaming, the system uses a fixed output schedule between 1 and 2. For every text token from Reasoner, Backbone generates 2 audio code tokens. This allows the model to start extracting speech as soon as text generation begins and avoids waiting for full sentences. This gap is the main path behind the low Time to First Token.
Chroma Decoder is a simple version of LLaMA with 100M parameters. Backbone only predicts the first Residual Vector Quantization codebook per frame, which is a rough representation. The Decoder then takes the hidden state of the Backbone and the original code and automatically predicts the remaining RVQ levels within the same frame. This factorization preserves the temporal structure of the long context in the Backbone and limits the Decoder to frame spatial refinement, which reduces computation and improves detailed prosody and pronunciation.
Chroma Codec Decoder combines coarse and fine codes and places them on waveform samples. It follows the Mimi vocoder decoder design and uses a causal convolutional neural network so that each output sample depends only on the previous context, which is required for transmission. The system uses 8 codebooks, which reduces the number of steps for the decoder’s automatic optimization while preserving enough information for speech synthesis.
Training setup and speech-to-speech (S2S) processing data.
High-quality speech data with robust cognitive features is scarce. So Chroma uses a speech-to-speech (S2S) pipeline. Resolution like LLM first generates written answers to user questions. A Text-to-Speech System (TTS) then combines the target speech with a reference audio timbre for those responses. These synthetic pairs train Backbone and Decoder for acoustic modeling and voice cloning. The reasoner remains frozen and serves as a provider of text embedding and multimodal hidden contexts.
Voice synthesis quality and comparison with existing systems
Objective evaluation uses the SEED-TTS-EVAL protocol for English CommonVoice speakers. Chroma operates at a sampling rate of 24 kHz and achieves a speaker uniformity ratio of 0.81. The human base is 0.73. CosyVoice-3 reaches 0.72 and many other TTS bases are below human reference. The research team reports this as a relative improvement of 10.96% over the human baseline, indicating that the model captures better paralinguistic information consistently than human recordings in this metric.


The test below compares Chroma to the ElevenLabs eleven_multilingual_v2 model. In terms of CMOS nature, listeners preferred ElevenLabs 57.2% of the time compared to Chroma’s 24.4%, for a deuce of 18.3%. In the CMOS speaker comparison, the scores are very close, 42.4% for ElevenLabs and 40.6% for Chroma, with a deuce of 17.0%. A follow-up test asking which sound sounded more natural between ElevenLabs and the actual recording revealed a 92.0% preference for ElevenLabs compared to 8.0% for ground truth, indicating that perceived naturalness and speaker fidelity are not aligned.
Latency and real-time behavior
Latency is measured for one stream at a time. With a response time of 38.80 seconds, the total production time is 16.58 seconds, giving a Real Time Factor (RTF) of 0.43. Reasoner contributes 119.12 ms TTFT, Backbone 8.48 ms and Decoder 19.27 ms per frame on average. Codec Decoder works in groups of 4 frames so TTFT does not work in that part. The total Time to First Token is 146.87 ms, which is less than one second and good for conversation.


Verbal communication and thinking scales
Chroma is tested on the basic URO Bench track. It uses only 4B parameters yet it got a perfect performance score of 57.44%. GLM-4 Voice, a 9B parameter model, leads with 69.09%. Chroma ranks second overall and surpasses several 7B and 0.5B omni bases in most dimensions. It reaches 71.14% in Storal, 51.69% in TruthfulQA and 22.74% in GSM8K. In the verbal conversation metrics it scores high in MLC at 60.26% and CommonVoice at 62.07%.


Importantly, the Chroma is the only model in this comparison that supports personal voice integration. All other programs focus on speaking and reasoning only. This means that Chroma provides competitive intelligence while enabling reliable voice personalization in real time.
Key Takeaways
- End to end real-time speech in speech: Chroma 1.0 is a 4B parameter spoken conversation model that displays speech to speech directly using codec tokens, avoids the obvious stages of ASR and TTS and preserves prosody and speaker identity throughout the pipeline.
- Reasoner plus stack for speech structure: The system includes Qwen-based Chroma Reasoner with 1B LLaMA Backbone style, 100M Chroma Decoder and Mimi based Codec Decoder, using RVQ codebooks and integrated text 1 to 2 audio token schedule to support streaming and low Token Start Time.
- Powerful personal voice cloning: In SEED-TTS-EVAL with CommonVoice speakers, Chroma achieves a speaker similarity score of 0.81 at 24 kHz, this is reported as a relative improvement of 10.96 percent over the human baseline of 0.73 and surpasses CozyVoice 3 and other TTS baselines.
- A second lower latency and faster than real-time generation: Single stream rendering on the H200 GPU reveals a total Time to First Token of about 147 ms, with a 38.80 second response the model generates sound in 16.58 seconds, resulting in a Real Time Factor of 0.43 which is more than 2 times faster than playback.
- Competitive dialogue and discussion of cloning as a unique feature: On the basic URO bench track, Chroma achieves 57.44 percent overall performance and competitive scores on Storal, TruthfulQA, GSM8K, MLC and CommonVoice.
Check out Paper, Model Weights, Project again The playground. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.



