Now Translating in 103 Languages — With AI Voice in 74
May 2026
Live Translate Live just got a major engine refresh. We’ve swapped our speech-recognition stack from Deepgram to ElevenLabs Scribe v2 Realtime, our translation layer from Google Cloud Translation to Google Gemini 2.5, and added live AI voice playback in Audio mode powered by ElevenLabs v3. The headline number: we’re live in 103 languages for real-time speech, with AI voice in 74 of them, and translation in any direction between any two.
If you read the old supported-languages post when we launched at 47, the count has roughly doubled. That post is now updated to reflect the new reality; this post is the announcement of what changed and why it matters when you’re picking up the phone for a real conversation.
What you can now do that you couldn’t before
The most visible change for users is in the language picker. Scroll down and you’ll see twice as many entries. The languages that joined are not obscure — they’re languages a lot of you have been asking for: Persian, Bengali, Tamil, Telugu, Marathi, Hindi at higher quality, Cantonese as a distinct entry from Mandarin, Burmese, Khmer, Lao, Mongolian, Hausa, Swahili, Yoruba, Zulu, Welsh, Irish, Hebrew, and many more.
The second change is harder to spot but you’ll feel it in conversation: translation quality is noticeably better, especially on longer or more nuanced turns. Gemini 2.5 carries conversational context across turns instead of translating each sentence in isolation. Pronouns get the right antecedent. Gendered agreement holds across a sequence. Idioms get unpacked into the target language’s nearest equivalent rather than translated word-for-word. The marquee feels more like a translator and less like a dictionary.
The third change is brand new: Audio mode now plays the translation out loud in a natural AI voice. You speak, you tap Translate, your phone speaks the translated sentence in the listener’s language. This is the mode for taxis, market stalls, hospital waiting rooms — places where reading a scrolling display is impractical and you’d naturally hand the phone back and forth.
How accurate is the speech recognition?
ElevenLabs publishes a four-tier accuracy grid for Scribe v2 based on word-error-rate (WER) benchmarks. We surface those tiers as colored dots next to each language in the in-app picker, and we’ve reproduced the grouping here so you can find your language at a glance. A lower WER means more words come through correctly.
| Tier | WER | Languages |
|---|---|---|
| Excellent | ≤ 5% | Belarusian, Bosnian, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Latvian, Macedonian, Malay, Malayalam, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Turkish, Ukrainian, Vietnamese (36) |
| High | 5–10% | Armenian, Azerbaijani, Bengali, Cantonese, Filipino, Georgian, Gujarati, Hindi, Kazakh, Lithuanian, Maltese, Mandarin Chinese, Marathi, Nepali, Odia, Persian, Serbian, Slovenian, Swahili, Tamil, Telugu (21) |
| Good | 10–15% | Afrikaans, Arabic, Assamese, Asturian, Burmese, Hausa, Hebrew, Javanese, Korean, Kyrgyz, Luxembourgish, Māori, Occitan, Punjabi, Tajik, Thai, Uzbek, Welsh (18) |
| Developing | 15%+ | Amharic, Ganda, Igbo, Irish, Khmer, Kurdish, Lao, Mongolian, Northern Sotho, Pashto, Shona, Sindhi, Somali, Urdu, Wolof, Xhosa, Yoruba, Zulu (18) |
WER ranges are from ElevenLabs’ published Scribe v2 benchmarks. In practice, microphone position and ambient noise matter more than the gap between the top two tiers; in a quiet room with a good mic, an “Excellent” and a “High” language are hard to tell apart in conversation.
A useful way to read the table: if your pair sits in the top two tiers, the recognition layer effectively disappears — the words you say show up on screen as you say them. If one side of the pair is in the Good tier, you’ll see occasional substitutions on uncommon words, especially proper nouns. If a side is in the Developing tier, the language works but transcripts are rougher; in those cases Audio mode can be friendlier because you can review the transcript before tapping Translate.
Live AI voice playback in 74 languages
Audio mode is the bigger product change. The marquee was always the right tool for two people sharing one screen at a table. Audio mode is for the rest of the time — one phone, one hand, two people who need to hear each other rather than read.
When you tap Translate Now, ElevenLabs v3 (or its faster sibling Flash v2.5, depending on the language) generates the spoken translation and the phone plays it back. It sounds like a person, not a robot. You can hit Replay as many times as you want without spending more credits. If your listener didn’t catch it, just tap Replay.
Of the 103 languages we recognize, 74 have AI voice coverage today. Languages outside that set still work in Audio mode — you’ll see the translated text on screen — just without the spoken playback. The voice list grows as ElevenLabs ships coverage updates; the app re-checks on startup and picks up new languages automatically.
If you haven’t tried Audio mode yet, the deep dive lives in our Audio mode post and the canonical reference is at /languages.
Translation covers all 103 in any direction
Gemini 2.5 handles the translation layer, and it’s any-to-any. You can speak Japanese and have it land directly in Portuguese, no English in the middle. Hindi to Arabic. Korean to Swahili. Vietnamese to Polish. With 103 languages, that’s 10,506 unique pairs, every one of them supported simultaneously in two-way conversation mode.
The thing that’s different about Gemini 2.5 in particular is conversational context. Older translation engines treat each sentence as an independent string. That’s why you sometimes get pronouns translated into the wrong noun, or formality registers that flip mid-conversation, or idioms that come out as literal nonsense. Gemini 2.5 sees the previous few turns and translates the next one with that context in scope. The result feels less like a phrase-by-phrase lookup and more like a translator who’s actually been in the room with you the whole conversation.
What this means for which pairs feel best
All 10,506 pairs work. Some feel more fluent than others. Three factors drive the experience of a given pair in practice:
- Both sides in the top tier. When both languages are Excellent or High, conversation flows. Examples: English ↔ Spanish, English ↔ French, English ↔ Japanese, Spanish ↔ Portuguese, German ↔ Dutch.
- One side in Good or Developing. Still works, but transcripts are rougher on the lower-tier side — expect occasional substitutions on uncommon words and proper nouns. Audio mode can mitigate this because you see the transcript before committing to translate.
- Different scripts. Latin ↔ non-Latin pairs (English ↔ Japanese, Arabic ↔ French, Hindi ↔ Korean) all work; the marquee has to do a small font swap mid-sentence, which is essentially imperceptible in 2026 but is the one place rendering can feel like work.
Why we changed engines
The short version: the language coverage and quality we used to get from Deepgram + Google Cloud Translation was excellent at launch but stopped scaling once we wanted to cover more of the world. Scribe v2 ships with broader streaming coverage at lower latency, Gemini 2.5 carries context across turns, and ElevenLabs v3 unlocked the voice playback we needed for Audio mode. The long version is in a separate post with latencies, benchmarks, and the architectural decisions behind the swap.
Try it
Pick your two languages and start a real-time bilingual conversation. No app to download. Translation credits start at $1 for 15 minutes in the marquee; in Audio mode, transcription is free until you tap Translate.
Start in the marquee · Try Audio mode · Full language reference · View pricing