Audio Mode — Turn-Based Translation with Voice Playback
March 2026
Live Translate Live has two ways to translate a live conversation. The primary mode is the scrolling marquee — continuous speech recognition streaming translated text across a shared screen. The second mode is audio mode: turn-based, push-to-talk, with the translated result spoken out loud by an AI voice. You speak, you review your transcript, you fix anything the recognizer misheard, you tap Translate, and then the other person hears the translation in their language. Then you pass the phone across the counter, across the table, or back to yourself, and it is their turn.
The marquee is built for shared screens. Audio mode is built for the phone in your hand.
Audio mode vs. marquee mode — when to choose which
Both modes ship with every plan and you can switch between them mid-conversation. They solve different problems. The marquee works best when two people can comfortably see one screen at the same time, and neither of them needs to hold the device. Audio mode works best when there is no good surface to rest a phone on, when the environment is too loud to read a scrolling display at a glance, or when one or both speakers cannot read the translation for any reason.
| Situation | Marquee | Audio | Why |
|---|---|---|---|
| Restaurant table with a shared phone flat between you | Yes | Okay | Both speakers can read their side of the screen; continuous flow feels natural over a meal. |
| Busy market stall, no flat surface | No | Yes | You are standing and holding the phone. Audio mode lets you talk, show, and hand over. |
| Live stream or broadcast overlay | Yes | No | OBS captures the scrolling marquee as a window source. Audio mode has no screen for the audience. |
| Hearing-impaired user on one side | Yes | No | Text on screen is the whole point. Spoken playback does not help. |
| Blind or low-vision user on one side | No | Yes | Voice playback removes the requirement to read a scrolling display. |
| Loud construction site or factory floor | Okay | Yes, with earbuds | Audio mode paired with earbuds delivers the translation directly; screen reading is hard with a hard hat on. |
| Quiet conference room or hotel lobby | Yes | Okay | Continuous marquee shines when nobody has to interrupt to hand a phone over. |
| Rideshare or taxi | No | Yes | Driver keeps their eyes on the road. Voice playback over the phone speaker handles it. |
A good rule of thumb: if you would naturally pass the phone back and forth, use audio mode. If you would naturally set the phone down, use the marquee.
The turn-based workflow, step by step
Audio mode is deliberately linear. Each turn is one round trip — you speak, you translate, you hand over. Here is exactly what happens on each turn:
- Tap push-to-talk and speak your sentence. Hold your phone in a comfortable talking position. You can use push-to-talk (hold the button, release when done) or the always-listening toggle. Push-to-talk is better in noisy places because the mic is only active while you are talking.
- Watch the live transcription appear on screen. Your words are transcribed in real time with dynamic font sizing so they fit the display. Transcription is free — no credits are spent at this step, no matter how long you talk or how many times you restart.
- Review and edit the transcript if needed. Speech recognizers make mistakes on proper nouns, numbers, and unusual technical terms. Tap the transcript to fix a word before translating. This is the step the marquee cannot give you — it translates immediately, so a misheard word is already on the other side of the screen. In audio mode the translation is based on exactly what you meant to say.
- Tap Translate. This is the only step that costs credits. You are billed per character of translated text and per character of synthesized speech — nothing for the transcription that came before it.
- Hear the AI voice play back in the target language. The translation is spoken aloud through the phone speaker (or earbuds, if connected). The translated text also appears on screen as a fallback for anyone who prefers to read.
- Hand the device over, or replay. Pass the phone to the other speaker for their turn, or tap replay if they want to hear the translation again. Clear the screen when you are ready to start the next exchange.
Credit efficiency — transcribe freely, translate selectively
This is the part of audio mode that surprises people. The marquee bills continuously, because it is continuously listening and continuously translating — that is what makes it feel live. Audio mode does not. In audio mode:
- Speech recognition runs only while you are actively holding the push-to-talk button (or while the always-listening mic is on).
- Transcription is free. Talk, clear the screen, restart, talk again — no credits move.
- Credits are consumed only when you tap Translate, and only for that specific sentence.
A ten-exchange conversation at a market stall — "how much is this," "do you have it in blue," "I will take two" — is typically under a thousand characters of translated text and a thousand characters of synthesized speech. That is pennies. The same ten exchanges in marquee mode would involve keeping the recognizer running continuously between sentences (including the awkward pauses, the vendor talking to another customer, the ambient noise), which adds up at the time-based rate. Audio mode is dramatically cheaper for short, transactional conversations — the kind of conversations that happen when you are on your feet and handing a phone around.
The trade-off is obvious and worth being honest about: audio mode is not continuous. You are choosing when to translate, and that introduces small pauses between turns. For a sit-down dinner or a meeting where you want the translation to feel uninterrupted, the marquee is the right tool. For everything else — especially the scenarios below — audio mode pays for itself.
Scenarios where audio mode earns its keep
Street-vendor transactions
You are at a night market in Taipei or a souk in Marrakesh. You are holding the phone in one hand and a paper bag in the other. There is no table. The vendor is behind a counter, three feet away, with their own stream of customers. You tap push-to-talk, ask your question, tap Translate, and the vendor hears the answer in their language without needing to lean over a screen. If they want to reply, you hand the phone across the counter for their turn. The whole exchange takes maybe fifteen seconds and costs a fraction of a credit.
Noisy markets and tourist areas
Audio playback over earbuds cuts through ambient noise in a way that screen reading does not. If both speakers have earbuds — or if you share a pair — the translation plays directly into the ear, even if the street around you is at 85 dB. Push-to-talk is the right input choice here because it keeps the mic closed when you are not actively speaking, so the recognizer is not trying to transcribe the crowd.
Accessibility for low-vision users
The AI voice playback is not a convenience feature for low-vision users — it is the core feature. You speak, the translation is spoken in the target language, and no one ever has to read a scrolling display. This is one of the clearest wins audio mode has over the marquee, and one reason we keep both modes in the product instead of picking a side.
Rideshare and taxi conversations
The driver is driving. They are not going to look at your screen, and you do not want them to. Audio mode over the phone speaker lets you give directions, ask about the route, or agree on a fare without either of you taking your eyes off the road. For the driver's reply, you can hand the phone to a front-seat passenger, or use the always-listening mode while they speak briefly.
Healthcare intake and clinical questions
A nurse reads a question from a clipboard. You answer in your own language. You tap Translate, and the clinician hears the answer spoken audibly — hands-free — while they write or type into the intake form. Because transcription is free, you can take as long as you need to answer, reword as you go, and only spend credits when the answer is final. For medical proper nouns (drug names, conditions), the review-and-edit step is especially useful.
Hotel front desk and service counters
You hold the phone on your side of the counter, speak, and then slide it across for the clerk to respond. The audio plays audibly enough for both of you, and the transcript on screen works as a backup when the lobby is echoey. For short exchanges — check-in, check-out, "is there a pharmacy nearby" — audio mode costs almost nothing and removes the awkwardness of two people leaning over one phone.
Device placement and volume tips
A few things that make audio mode work better in the real world:
- Hold the phone close enough to catch clean audio, but not so close the mic clips. Six to twelve inches from your mouth is a good range. Phones with multiple mics handle wind and background noise reasonably well, but they cannot rescue audio recorded from across a table.
- Use push-to-talk when ambient noise is loud. Always-listening will try to transcribe whatever it hears, including the person next to you. Push-to-talk closes the mic between turns.
- Turn the media volume up before your first translation. The AI voice playback routes through the phone's media channel, not the ringer. If your media volume is at zero, the first playback will sound silent and you will think something broke.
- Earbuds are better than a speaker in crowded places, both for playback clarity and for privacy. If you are sharing a pair, hand the free earbud across with the phone.
- For long exchanges, plug in. Continuous push-to-talk over a long conversation drains the battery noticeably — less than the marquee, but still noticeable.
Honest limits of AI voice playback
The AI voice is good. It is not human. A few things to know:
- Prosody is better in some languages than others. English, Spanish, French, German, Japanese, and Mandarin tend to sound most natural. Smaller-population languages can sound more clipped or robotic, especially on longer sentences.
- Proper nouns are a known weakness. Personal names, street names, brand names, and technical terms sometimes get pronounced as if they were common words in the target language. Reviewing and lightly rewriting the transcript before translation helps — for example, spelling out "Saint-Laurent Boulevard" phonetically.
- Short pauses between sentences, not natural flow. Each translation is generated as a complete utterance. Two back-to-back translations sound like two separate sentences, not like a continuous speaker. This is usually fine in a turn-based conversation and is the correct behavior given that you are tapping Translate between each one.
- 32 languages support voice playback. Languages outside that set still translate correctly in text — they just do not play back audibly. The marquee handles those languages without this constraint.
FAQ
How many credits does a single translation cost in audio mode?
It depends on the length of what you translate, but short conversational sentences (a question, a price, a one-line reply) typically cost a fraction of a credit each — billed per character of translated text and per character of generated speech. A ten-turn market conversation usually comes out to pennies. See the pricing page for exact rates.
Can I use audio mode without an internet connection?
No. Speech recognition, translation, and voice synthesis all run in the cloud. A stable connection matters more than a fast one — audio mode sends short bursts of audio, not a continuous stream, so it works well on shaky cellular data as long as it is connected.
What happens if I misspeak — can I re-record?
Yes, and you should. Transcription is free, so there is no penalty for restarting. Clear the transcript and press push-to-talk again, or just keep speaking — the transcript updates live. You only commit to a translation when you tap Translate, and you can edit the transcript text directly before that point.
Can I switch to marquee mode mid-conversation?
Yes. Mode selection is a toggle, not a session boundary. If the conversation shifts from a standing market exchange to a sit-down coffee, switch to the marquee without losing your language pair or your history. See same-language transcription mode for a third related mode that overlaps with audio mode's free transcription.
Try audio mode
If you have a traveller's conversation in your near future — a market, a taxi, a clinic, a hotel desk — audio mode is the one to try first. Pair it with the general talking-to-someone-who-does-not-share-your-language habits (short sentences, one question at a time, confirm proper nouns) and it will handle the majority of real-world exchanges at a cost you will not notice on your bill.
Try for $1 — no subscription · View pricing · See all features