Text-to-speech for Russian — what 2026 models can do

In 2020 I worked on a project that needed to synthesize Russian speech for voice assistants. We tried everything on the market, settled on the least bad option, and for the next two years I couldn't listen to any TTS narration without flinching. Reflex. In 2026 I listen to AI-narrated audiobooks and ten minutes in I forget they're synthesized. What happened in five years — and where the holes still are — let me try to break it down.

What changed under the hood

The big shift is that the model stopped being a splicer. Old pipelines went text → phonetic markup → library of pre-recorded sound bites → stitch into a waveform. That's the call-center sound, because stitching can't reason about context.

Modern models generate the audio waveform directly through a neural network that sees the surrounding sentence, often the whole paragraph. That means the pause after a comma is actually natural, the rise before a question mark is actually a rise, and a dramatic scene gets a softer drop in timbre. Not because the rules say so — because the model has seen millions of examples of how live people do it.

In 2026 the production stack is mostly transformer-based — same architecture as large language models, retrained for audio. Diffusion-based and flow-matching compete in research and sometimes win on quality, but they're slower. Gemini TTS, which we run, is transformer-based.

What Russian TTS can do in 2026

In normal use, almost everything you'd want.

Read prose with natural intonation — yes. Stress correct 95–97% of the time — yes. Context-aware emotional coloring (sadness, joy, tension) — yes, and not overcooked. Voice differentiation in dialogue — works. Pace adjusts to text type (literary slower, non-fiction faster).

Middle ground: question and exclamation intonation (sometimes slightly oversold), proper nouns (native — fine, foreign — coin flip), pauses in the right places (mostly yes, occasionally breathing in odd spots).

Still hard: poetry with meter preserved — no. Subtle irony or sarcasm without explicit cues — almost none. Texts heavy with footnotes — confused. Specialized terminology (medical, legal) — many errors. Formulas and math — disaster.

Why Russian is harder than English

Briefly: because the language asks the model for more understanding than English does.

Homographs with different stress are a uniquely Russian pain. "За́мок" (castle) and "замо́к" (lock) are written identically. Without context the model is just guessing.

Cases. Russian word endings carry grammatical role — subject, object, instrument, destination. To intonate correctly the model has to figure out the structure of the sentence, not just read the words in order.

Free word order. "Я написал книгу" and "Книгу я написал" mean the same thing but emphasize different things. English word order is rigid, easier to navigate. In Russian the model has to infer what's important from context.

Training data. There's an order of magnitude more English audiobooks with transcripts in datasets than Russian ones. Pure math: the model learns from what's there, and English has more.

Verb aspect. "Делать" and "сделать" are different verbs — process versus result. English doesn't draw that line so cleanly; it works through tense and context.

Gemini handles all of this better than the alternatives — Google has put real effort into Russian localization. But head-to-head, the same model's English remains cleaner than its Russian. That's normal, and the gap will probably narrow over the next couple of years.

Concrete glitches everyone hits

Numerals. "1 500 000" — one model reads "one and a half million," another reads "one five hundred thousand." If the number matters, write it out.

Dates. "12.04.2026" can come out as "twelve dot zero four dot two thousand twenty-six." Brutal. Write "April 12, 2026" and the model handles it.

Abbreviations. "USSR" usually goes letter by letter, which is correct. "NATO," "VAT," shorter agency names — coin flip. Verify.

Foreign technical terms in Russian text. "DevOps engineer" might end up "dev-ops engineer" syllable-by-syllable. Or "devops engineer," which is fine. Depends on the model and luck.

URLs and emails. AI doesn't know what to do with @ — silent or read as "at," which is awkward. If you have an email in the text that matters, write it out: "name at domain dot com."

Where this is heading

I don't love forecasts, but a few things look clear at the 2027–2028 horizon.

Voice cloning becomes routine. Right now it's largely an English-and-Western-services story. Russian quality will catch up to original, and that will reshape podcasting, first-person audiobooks, and personal archives.

Multilingual models with preserved character. Today, switching from Russian to English mid-book (a quotation, a name, a term) "jumps" voice — different timbre, different manner. Soon: smooth crossover without losing identity.

Real-time. Right now a full book renders in hours. That's a model limitation, not a service one. By 2028, expect real-time for most jobs — upload, listen.

Explicit emotional control. Style hints work today, but unevenly. The future is clear in-text tags — <gentle>, <grim> — handled cleanly and predictably.

Book-level context. Today the model sees a paragraph at most. Soon — a chapter, eventually a whole book. That gives consistent character intonation from page one to page eight hundred, instead of "cheerful sometimes, sad sometimes, no clear reason."

What I'd pick today

If you have to choose a Russian TTS in 2026, here's how I'd shop.

For literary fiction and audiobooks — Gemini TTS. Currently the top of the market. We run on it; I don't know anything better.

For technical and system content (IVR, navigation, voice assistants) — Yandex SpeechKit. Stable, narrow voice roster, but the voices are quality and built for the job.

For multilingual projects with Russian — Gemini again. If you specifically need a clone of a voice for short pieces — ElevenLabs, with the awareness that Russian is weaker than English.

For open-source and personal pet projects — Silero. Free for personal use, narrow voice roster, quality is fine for home, not for production.

The list will shift in a year or two. The market moves fast, new models drop quarterly, and "top three today" is something you keep updating in your head. Worth checking back on reviews every six months instead of telling yourself your current pick is settled for years.