How to turn any text into an AI audiobook in 2026

For a long time I didn't believe you could just upload a file and get back an audiobook you'd actually want to finish. Every attempt before about 2024 ended the same way: five minutes in, ears go heavy, brain checks out, robot voice droning through the fog. Then I tried what's available now — and closed the freelance narrator price page that had been sitting in my browser tabs for six weeks.

This piece is about what actually changed, and where the holes still are so you don't fall into them.

What "AI narration" even means now

Short version: it sounds like a person. Longer version: it sounds like a person who, unlike the actual person, isn't tired, didn't trip on a word, didn't go on vacation, and didn't ask for a re-record fee.

The technical reason is straightforward. Old TTS stitched together prerecorded phonemes — that's where the call-center voice came from. New models generate the whole audio waveform conditioned on the surrounding sentence, sometimes the whole paragraph. So when a character is angry, the voice doesn't just hit a louder volume on emoji-style cues — it actually sounds tighter, lower, closer to the teeth. Not flawless, but in a blind test I'm right about half the time. Two years ago I was right within a second.

What you can throw at it

Almost any text format you've got: epub, fb2, txt, markdown, doc. A fanfic from AO3 or FFN. A lecture transcript. A 4-year-old unpublished novel that's been sitting in your drafts being shy about being read. I've even pushed my own year of journal entries through it and listened in the car like a summary of my own head.

What it dislikes:

PDFs, especially scanned ones — extract the text first or you get garbage.
Books with math and formulas. AI still can't read ∫(x²+1)dx gracefully, and I wouldn't pretend it can.
Footnotes and reference markers. They either get dropped or read in the same flat tone, which destroys the flow in academic writing.
Poetry. It's better than it was, but most models still don't really get rhythm.

Languages: English and Russian are tier one. German, French, Spanish — fine. Smaller languages — varying.

How the pipeline runs

I'll spare you the twelve-step recipe, because once you open the actual UI it's mostly self-evident. Quick version:

You upload the file. The service splits it into chapters and paragraphs, walks the text, and tries to figure out who's in it. For a novel with dialogue, it picks up speakers and assigns each a voice — male, female, age, sometimes character archetype. You can take the suggestions or rework them. Then it renders.

Render time scales with length. A short story finishes in a few minutes. An essay collection in half an hour. A full novel in a few hours, sometimes overnight if the queue is busy or the provider is using batch APIs. Decent services (we're one) prioritize the first chapters so you can start listening before the rest is done — useful if quality turns out to be off and you don't want to wait through the whole render.

Money

Pricing is almost universally per character, not per audio minute — which is right, because 1000 characters comes out to roughly a minute of audio, and the exact ratio depends on language and voice tempo.

In real numbers: a 50k-character short story comes in cheaper than a movie ticket. A 600k-character novel costs about as much as a decent dinner out. A 1.5M-character tome lands around the price of a mid-range board game. Not "cheap as coffee," but not the kind of money you regret either.

Subscriptions are mostly gone. Most services (us included) bill per use: you narrate, you pay, you keep. No silent monthly charges on something you forgot you signed up for.

Per-character voices

The single most felt improvement in the last couple of years is that one narrator no longer reads everyone. Anna and Maria used to sound the same and you guessed by intonation which mom was talking, which daughter. Now Anna gets a low warm voice, Maria gets a clearer mid-range, the narrator stays neutral, and you stop flipping back to figure out who said what.

For thrillers and mysteries, where the plot literally hinges on "who said that," this is a quality-of-life upgrade you don't realize you needed until you have it. For non-fiction, a single voice is fine.

What you typically get to control on top of voice picks:

Style hints — "cold, detached," "warm, conversational."
Age range — teen, adult, older.
Per-character voice override.

A piece of advice that saved me hours of re-listening: don't trust the auto-assignment for your main characters. For background players it's fine, you can't really tell. But for the protagonist, listen to three voice options before committing the whole book.

Emotion — without overdoing it

Modern models do emotional coloring, and the temptation is to crank it. Don't. Real audiobook narrators read more flatly than you'd think — most of the work happens in the pauses and the pacing, not in screams. Push the emotion sliders all the way and after ten minutes you need a break.

We have a note pinned in our internal memory on this: light coloring only when the intensity is genuinely high. Sadness — yes. Tension — yes. Hysterics and screaming — almost never. Not because the model can't, but because nobody actually wants to listen to that for fifteen hours.

How long is one book, really

A 300-page novel is 12-15 hours of audio at standard pace. That's a lot. Most people I know listen at 1.25x — clarity holds, and a book finishes in a couple of work weeks of commute time.

Render times in our setup:

Tiny (under 10k chars) — minutes.
Average (under 100k) — about a quarter of an hour, give or take.
Full novel — a few hours.

If the queue is busy, the service will still ship the early chapters first so you can start listening while the tail catches up.

Where AI still breaks

I'm not going to pretend it's perfect. There are real bumps.

Stress in proper nouns, especially non-English ones, lands wrong about half the time. "Bertholt," "Proust," "Margarita Therese" — flip a coin. On your own book this can be patched with stress markers in the source (за+мок instead of замок), but spending an evening hand-marking a list of names is not the high point of the workflow.

Charts, tables, infographics — gone. If a book leans on visual material, audio won't carry it, and no AI will fix that.

Russian and other Cyrillic-script languages occasionally trip on archaic vocabulary. 19th-century classics are listenable, but every once in a while a word comes out that the model clearly hadn't seen and is guessing on. Annoying if you're a linguist or an editor. Almost invisible if you're not.

And there's the personal-fatigue thing. Some listeners' brains, after about 40 minutes, sense that something's off, even if they can't say what. Most don't. Just a fact, no fixing it.

Your books vs the public catalog

A lot of services (us included) split this in two: your own book — you upload, you listen, no copyright issue under fair personal use. Public catalog — the service licenses popular titles and gives them to everyone. Personal listening doesn't bump into rights. Commercial distribution of AI-narrated work is a separate topic, and by default it's not allowed almost everywhere — you need explicit licenses and rights-holder consent.

What I'd do the first time

Don't go straight to a thousand pages. Pick a short story or novella — something you'd read in a couple of hours, where you can feel how it actually sits in your ears. Within ten minutes of upload you'll have the first finished chapter. If it works, push the rest. If not, fiddle with voices, tempo, style, try again. Odds are by the third or fourth pass you'll find a combination where the book sounds like yours.

That's how I assembled my first one. It still lives in my commute playlist, and somewhere around hour eight I stop noticing it isn't a person reading.

Related posts