Why We Swapped Deepgram + Google Translate for ElevenLabs Scribe v2 + Gemini 2.5
May 2026
This post is the behind-the-scenes companion to the 103-language launch announcement. If that one was “here’s what changed,” this one is “here’s why we picked the vendors we picked.” The headline result: roughly double the language coverage, lower-latency speech recognition, conversational-context translation, and live AI voice playback in 74 languages.
Why we changed vendors
The old stack was Deepgram for speech recognition and Google Cloud Translation for the translation layer. It was solid at launch. The ceiling was language coverage. Deepgram’s streaming model handled roughly 40–50 languages at production quality and the list wasn’t growing fast enough to keep up with users asking for Bengali, Tamil, Telugu, Marathi, Cantonese as a distinct entry from Mandarin, Burmese, Khmer, Welsh, Hebrew, and more.
The second pressure was on the output side. We wanted to ship Audio mode — turn-based translation with the result spoken aloud in the listener’s language. That meant adding a TTS layer the old stack didn’t have. Once you’re bringing in a vendor for one part of the pipeline, it’s worth asking whether you may as well consolidate.
Why Scribe v2 for speech recognition
ElevenLabs released Scribe v2 Realtime in January 2026. The headline claims from ElevenLabs: ~150 ms streaming latency, 5.8% multilingual word-error-rate on the FLEURS benchmark, and 93.5% accuracy across the 30 benchmark languages they evaluated against industry-standard ASR models. The supported language list is around 100, with a published four-tier accuracy grid spanning Excellent (≤5% WER), High (5–10%), Good (10–15%), and Developing (15%+).
We did our own bake-off against Deepgram on the languages we’d been running. The latency claim held up — transcribed words appear under the speaker’s voice almost beat-by-beat, fast enough that the perceived bottleneck shifts to the translation step. Head-to-head transcription quality was at parity or better on the languages we’d already supported, with the biggest wins on languages that had been weak: Hindi went from “works but rough” to “works cleanly,” Bengali and Tamil went from “not in production” to “in production at High tier.”
The other thing we liked: Scribe ships with native per-segment language identification, which simplified our two-speaker handling considerably and meant we could grow the language list without compounding integration work for each addition.
Why Gemini 2.5 for translation
Stateless per-sentence machine translation has a known set of failure modes. Pronouns get translated without their antecedents, gendered languages drift mid-conversation, formality registers flip, and idioms come out as literal nonsense. These all share a root cause: the translator only sees the current sentence.
Gemini 2.5 carries conversational context across turns. The model sees the recent history of the conversation when it translates the next utterance, which fixes most of those drift problems without us having to bolt anything special on top. In practice the translations feel less like dictionary lookups and more like the work of someone who’s been in the room with you for the whole conversation. The trade-off is slightly higher per-call latency than the old stateless MT — in the low hundreds of milliseconds rather than tens — but end-to-end “speaker stops talking” to “listener sees the translation” is still well under a second on the languages we’ve measured.
The other reason we like Gemini for this: language coverage on the translation side isn’t the constraint anymore. Gemini 2.5 covers every language Scribe recognizes, in any direction, which is what makes the any-to-any 10,506-pair claim true rather than aspirational.
Why ElevenLabs v3 for Audio mode TTS
Audio mode introduced a new pipeline stage: convert the translated text back to spoken audio in the listener’s language. We picked ElevenLabs v3 because of language coverage (~74 languages today) and voice quality. The voices sound like people, not like dictation software, and the multilingual support means the same product surface works across the whole top half of our supported language list. For the languages where ElevenLabs Flash v2.5 is available, we prefer it: it’s faster and cheaper, with quality close enough that side-by-side comparison is hard.
The list of languages with live voice playback grows as ElevenLabs ships coverage updates; the app picks up new languages automatically when they become available.
What users notice
- More languages in the picker. 103 entries, roughly double the previous list, including most of the most-requested additions.
- Translations feel more natural. Pronouns resolve correctly, formality holds across turns, idioms get unpacked sensibly. This is the conversational-context effect.
- Audio mode plays the translation out loud. 74 languages with AI voice today; the rest still work in Audio mode with text-only output.
- Tier dots in the language picker. A small colored dot next to each language signals the expected speech-recognition accuracy — green Excellent, yellow High, orange Good, red Developing — based on ElevenLabs’ published WER benchmarks.
- Two-way conversation still feels two-way. Both sides get translated simultaneously, no turn-taking, no awkward pauses.
Numbers
- Languages (STT): 103, up from 47
- Languages (live TTS): 74
- Language pairs (translation): 10,506 (up from 2,162)
- Scribe v2 streaming latency: ~150 ms (ElevenLabs published)
- Multilingual WER on FLEURS: 5.8% (ElevenLabs published)
- Billing: per-character, evenly applied across transcription, translation, and TTS — one credit per character processed; in Audio mode, transcription is free until you tap Translate
If you want the user-facing version
The launch announcement post covers the same change from the user’s side — what’s new in the language picker, what to expect from each accuracy tier, and how Audio mode feels in practice. The full canonical language reference is at /languages. And if you want to try it, the marquee is here and Audio mode is here.