Gemini 3.1 Flash TTS

Capability demo — 38 clips, 8 minutes, rendered 2026-07-05 straight from the API with zero post-processing, editing, or cherry-picking beyond one retry policy. Every clip shows the exact text that produced it.

★ Persona-text casting · Prebuilt voices · Emotion · Inline tags · Non-verbal sounds · Pacing & delivery · Accents on demand · Creative & character styles · Multilingual · Multi-speaker dialogue

Model

gemini-3.1-flash-tts-preview (public preview since Apr 2026; siblings: 2.5 Flash / 2.5 Pro TTS)

Blind-test rank

Elo ~1211–1215 on the public TTS arenas — top 3, above ElevenLabs v3 (~1178)

Voices

30 prebuilt, each with a personality; accents & styles steerable on top of any of them

Languages

~80, auto-detected from the text

Emotion control

Natural-language director's notes + 200+ inline [bracket] tags, switchable mid-sentence

Non-verbals

[laughs] [giggles] [cough] [sniffs] [sighs] [gasp] [crying] … performed, never read aloud

Multi-speaker

One-pass dialogue, up to 2 speakers per generation, independent voice + style each

Audio out

24 kHz 16-bit PCM; streaming supported on 3.1

Context

32k tokens per session (chunk long scripts — quality drifts past a few minutes per generation)

Price

$20 / 1M audio tokens ≈ $0.03 per finished minute ($0.015 batch); free tier in AI Studio

Provenance

Every clip carries Google's SynthID inaudible watermark

Status

Preview, not GA — no availability SLA yet

★ Persona-text casting (A/B)

A technique demo: each host is defined only by a free-text persona paragraph, which is resolved to a base voice + accent + delivery brief; a bickering 2-host conversation is then rendered in one generation. A/B it against the identical script rendered with the persona briefs stripped.

persona text drives the voices Kore + Algenib · 88s

text sent to the model

HANK's persona text: "Raspy older Texan man... dry, deadpan humor, slow unhurried drawl. Mildly cynical but secretly fond of his co-host." -> voice Algenib + West-Texas-drawl brief. MORAG's: "Sharp, fast-talking young Glaswegian woman. Warm, quick to laugh, impatient. Teases Hank relentlessly but clearly adores him." -> voice Kore + Glaswegian brief. One 2-speaker generation, sparse inline tags at real beats.

same script, briefs stripped (baseline) Kore + Algenib · 81s

text sent to the model

Identical script and voices, but rendered WITHOUT the persona-derived accent/delivery briefs and tags — the control clip.

Prebuilt voices

30 named voices ship with the model, each with its own character. Six of them reading the same line — no steering, just the voice.

Zephyr — bright female Zephyr · 6s

text sent to the model

Welcome back to the show. Today we are talking about the strangest discovery of the decade.

Puck — upbeat male Puck · 6s

text sent to the model

Welcome back to the show. Today we are talking about the strangest discovery of the decade.

Kore — firm female Kore · 6s

text sent to the model

Welcome back to the show. Today we are talking about the strangest discovery of the decade.

Charon — informative male Charon · 6s

text sent to the model

Welcome back to the show. Today we are talking about the strangest discovery of the decade.

Enceladus — breathy Enceladus · 6s

text sent to the model

Welcome back to the show. Today we are talking about the strangest discovery of the decade.

Algenib — gravelly male Algenib · 7s

text sent to the model

Welcome back to the show. Today we are talking about the strangest discovery of the decade.

Emotion — director's notes

The exact same sentence, re-performed from a one-line natural-language direction prefixed to the text. Nothing else changes — same voice (Kore), same words.

no steering (baseline) Kore · 5s

text sent to the model

I just got the results back from the lab. You are not going to believe what they found.

director's note: bursting with excitement Kore · 6s

text sent to the model

Say this bursting with excitement and joy, almost out of breath: I just got the results back from the lab. You are not going to believe what they found.

director's note: devastated, near tears Kore · 9s

text sent to the model

Say this quietly, devastated, on the verge of tears, slowly: I just got the results back from the lab. You are not going to believe what they found.

director's note: barely-controlled anger Kore · 9s

text sent to the model

Say this seething with barely-controlled anger, clipped and cold: I just got the results back from the lab. You are not going to believe what they found.

director's note: panicked, terrified Kore · 6s

text sent to the model

Say this panicked and terrified, voice shaking, out of breath: I just got the results back from the lab. You are not going to believe what they found.

director's note: gentle bedtime story Kore · 8s

text sent to the model

Say this softly and warmly, like a gentle bedtime story: I just got the results back from the lab. You are not going to believe what they found.

director's note: dripping sarcasm Kore · 8s

text sent to the model

Say this dripping with sarcasm, thoroughly unimpressed: I just got the results back from the lab. You are not going to believe what they found.

Inline tags — emotion switches mid-utterance

Square-bracket tags placed inside the text change the performance mid-sentence. One utterance walks through four registers.

one utterance: [excited] → [whispers] → [sarcastic] → [laughs] Kore · 13s

text sent to the model

[excited] We won the grant! [whispers] Don't tell anyone yet — it's not announced. [sarcastic] I'm sure the committee will be absolutely thrilled about the leak. [laughs]

Non-verbal sounds

Tags like [laughs], [cough], [sniffs], [sighs], [gasp], [crying] render as actual vocal events in the speaker's voice — the words around them stay intact, and the tag itself is never read aloud.

[laughs] / [giggles] / [chuckles] Puck · 13s

text sent to the model

[laughs] Okay, okay — [giggles] I'm sorry, I can't read this with a straight face. [chuckles] Give me a second. Okay. I'm good. I'm good.

[sniffs] / [cough] / [sighs] Charon · 13s

text sent to the model

[sniffs] I'm fine, really. [cough] Okay — maybe I'm not fine. [sighs] I should have stayed in bed. [sniffs] Is there any soup left?

[gasp] / [crying] / [sighs] Zephyr · 11s

text sent to the model

[gasp] No. That can't be right. [crying] He was standing right there... [sighs] and then he was gone.

Pacing & delivery

Whispering, shouting, speed, and dramatic pauses — all tag-driven.

[whispers] Enceladus · 8s

text sent to the model

[whispers] Everyone's asleep. If we're going to get to the kitchen, we move now — and we do not wake the dog.

[shouting] Puck · 7s

text sent to the model

[shouting] GOAL! GOAL! I do not believe what we have just witnessed here in the hundredth minute!

[very fast] — disclaimer read Kore · 7s

text sent to the model

[very fast] Terms and conditions apply, offer not valid in all regions, consult your physician before starting any new exercise program, batteries not included.

[very slow] Algenib · 14s

text sent to the model

[very slow] Some things... cannot be rushed. Good barbecue. Old whiskey. And this sentence.

[long pause] / [short pause] Charon · 8s

text sent to the model

You're asking if I can do it. Let me think. [long pause] Yes. [short pause] The answer is yes.

Accents on demand

Same voice (Kore), same sentence, six accents — steered entirely by a one-line direction. No separate voice models.

Glasgow, Scotland Kore · 6s

text sent to the model

Speak with a thick Glaswegian Scottish accent: I just got the results back from the lab. You are not going to believe what they found.

Texas drawl Kore · 10s

text sent to the model

Speak with a slow Texas Southern drawl: I just got the results back from the lab. You are not going to believe what they found.

Lagos, Nigeria Kore · 6s

text sent to the model

Speak with a Nigerian English accent, Lagos: I just got the results back from the lab. You are not going to believe what they found.

Mumbai, India Kore · 6s

text sent to the model

Speak with an Indian English accent, Mumbai: I just got the results back from the lab. You are not going to believe what they found.

Australian Kore · 6s

text sent to the model

Speak with a broad Australian accent: I just got the results back from the lab. You are not going to believe what they found.

French-accented English Kore · 6s

text sent to the model

Speak in English with a strong Parisian French accent: I just got the results back from the lab. You are not going to believe what they found.

Creative & character styles

The steering is freeform: character voices, personas, and formats the docs never enumerate.

[like dracula] Algenib · 10s

text sent to the model

[like dracula] Good evening. I have been expecting you. Please — come in. Leave the garlic bread outside.

sports commentator Puck · 6s

text sent to the model

Say like an over-caffeinated sports commentator calling the final seconds of a championship: Three seconds left — she takes the shot from half court — IT'S GOOD! IT'S GOOD! THE CROWD IS ON THEIR FEET!

movie-trailer narrator Algenib · 10s

text sent to the model

Say like a deep, gravelly movie-trailer narrator: In a world where every podcast sounds the same... one model dared to clear its throat.

Multilingual (~80 languages)

The input language is auto-detected — no language parameter at all.

Spanish — auto-detected Kore · 7s

text sent to the model

Bienvenidos de nuevo al programa. Hoy hablamos del descubrimiento más extraño de la década.

French — auto-detected Charon · 7s

text sent to the model

Bienvenue dans l'émission. Aujourd'hui, nous parlons de la découverte la plus étrange de la décennie.

Japanese — auto-detected Zephyr · 7s

text sent to the model

番組へようこそ。今日は、この十年で最も奇妙な発見についてお話しします。

Multi-speaker dialogue — one generation

Two voices rendered in a single pass, so turn-taking prosody emerges: the speakers react to each other. Inline tags work inside the conversation. (API cap: 2 speakers per generation.)

2 voices, one generation — turn-taking prosody + inline tags Kore + Charon · 16s

text sent to the model

Read this as a natural, warm podcast conversation - real banter, the hosts reacting to each other. TTS the following conversation: S1: [excited] Okay I have to tell you about this study before I explode. S2: [laughs] You said that last week about the octopus thing. S1: [mock offended] The octopus thing was incredible! [sighs] Fine. This one's better. S2: [skeptical] Go on then, convince me.

emotion arc across speakers: calm → furious → defeated Zephyr + Algenib · 19s

text sent to the model

Read this as a tense dramatic scene - quiet intensity building to a breaking point. TTS the following conversation: S1: [calm] You knew. This whole time, you knew. S2: [nervous] I was going to tell you. [sighs] I just needed the right moment. S1: [furious] The right moment was three years ago! S2: [quietly, defeated] ...I know. [long pause] I know.