Generate Subtitle (STT)¶

Transcribe audio or video into timed subtitles. Picks up speech and emits SRT / VTT / ASS / SSA — with optional translation in the same pass.

What you need¶

FFmpeg on PATH for audio/video decoding — see FFmpeg setup.
A transcription backend, one of:
- faster-whisper — local, offline, free (default; no setup needed)
- Google Cloud Speech-to-Text — cloud, paid, more accurate on noisy audio. See Google Cloud setup.
- Soniox — cloud, paid, real-time and speaker diarization. See Soniox setup.

Click Generate Subtitle in the sidebar.
Drop one or more audio / video files (.mp3, .wav, .m4a, .flac, .ogg, .aac, .wma, .mp4, .webm, .mkv, .avi, .mov, .wmv).
Pick the Source language (the language spoken in the audio) — leave on Auto-detect for Whisper to figure it out.
Pick a Target language — pick No translation for a plain transcript, or any of the 45 supported languages to get the transcript translated in the same pass.
Pick the Output format (SRT / VTT / ASS / SSA).
Click Generate (or Ctrl+Enter).
Watch the queue. Open the row when done.

Format	Best for
SRT	Universal — almost every player supports it
VTT	HTML5 `<video>` `<track>` elements
ASS / SSA	Karaoke, styled subtitles, fansub workflows

The four formats round-trip through the same parser, so you can switch output format on a re-translate without losing timing.

Switch in Settings → Subtitle:

Model	Size	Speed	Accuracy
`tiny`	~75 MB	very fast	low
`base` (default)	~150 MB	fast	decent
`small`	~500 MB	medium	good
`medium`	~1.5 GB	slow	high
`large`	~3 GB	very slow	best

Models download on first use and cache locally. On a slow connection the first run feels long; subsequent runs are fast.

Backend	Cost	Online?	Speaker diarization	Languages
Whisper (local)	Free	No	No	99
Google Cloud STT	Paid	Yes	Yes (`latest_long` model)	125+
Soniox	Paid	Yes	Yes (per-token speaker labels)	60+

Switch in Settings → Subtitle → STT method.

Stop button — interrupt an in-flight batch. Files queued behind the active one stay queued; you can resume later.
Re-generate — right-click a Done entry to re-run with a different format / language / STT method.
Long-form audio — Whisper handles hours of audio fine; budget ~1 minute of processing per minute of audio on a CPU base model.