Smatter provides real time, local translation of livestreams (on the viewer side). Currently it is in a very basic state, and has only been tested on windows, but seems initially usable for youtube and twitch streams. It relies primarily on faster-whisper for translation, with some help from silero_vad.
You will need binaries for these applications installed for Smatter to function:
And others are recommended too:
- mpv - required to use 'watch' output.
- phantomJS - yt-dlp recommends this to avoid throttling issues
- CUDA Toolkit - Will allow faster-whisper to use an Nvidia GPU. See CTranslate2 for details.
- Check out or download the latest code
- Add required libraries using
pip install -r .\requirements.txt
- Install dependencies, and add them to your PATH (or to ./libs/bin directory)
- Add mpv-2.dll to ./libs/bin folder on windows. See here.
- Pre-downloading faster-whisper model(s) is also recommended, the application doesn't wait for this process currently, and will probably fail.
usage: main.py [-h] --source SOURCE [--quality QUALITY] [--start START] --output {srt,vtt,watch} [--output-dir OUTPUT_DIR] [--output-file OUTPUT_FILE]
[--model-size {tiny,base,small,medium,large,large-v2,tiny.en,base.en,small.en,medium.en}] [--force-gpu] [--source-language SOURCE_LANGUAGE]
[--goal {translate,transcribe}] [--log-level {debug,info,warning,error,critical,none}]
options:
-h, --help show this help message and exit
--source SOURCE URL of stream/video
--quality QUALITY Max vertical video size (e.g 480 for 480p). If not specified, the best possible quality is chosen.
--start START Start point of a vod in HH:mm:ss (defaults to 0:00:00)
--output {srt,vtt,watch}
What output format is desired (file or video window)
--output-dir OUTPUT_DIR
Directory to store output files (defaults to ./output)
--output-file OUTPUT_FILE
Filename for any output file (defaults to output.srt)
--model-size {tiny,base,small,medium,large,large-v2,tiny.en,base.en,small.en,medium.en}
Whisper model selection (defaults to base)
--force-gpu Force using GPU for translation (requires CUDA
libraries). Shouldn't be necessary in most cases.
--source-language SOURCE_LANGUAGE
Source language short code (defaults to en for English)
--goal {translate,transcribe}
Select between translation or transcription (defaults to transcription)
--log-level {debug,info,warning,error,critical,none}
How much log info to show in command window (defaults to warning)
Open video window with transcribed english speech shown along with video stream (NOTE: currently, quality default of 'best' seems to cause problems for MPV, possibly issues with 60fps?)
python main.py --source https://www.youtube.com/watch?v=lKDZ_hmDqMI --output watch --quality 480
Save srt file to be used with separate application later
python main.py --source https://www.youtube.com/watch?v=lKDZ_hmDqMI --output srt
Save webvtt file instead
python main.py --source https://www.youtube.com/watch?v=lKDZ_hmDqMI --output vtt
(NOTE: srt and webvtt files can be used with browser plugins for vods)
Watch video stream (NOTE: A Larger model seems to do better at translation.)
python main.py --source https://www.youtube.com/watch?v=jjiXgRO8qDw --output watch --quality 480 --goal translate --source-language it --model-size large-v2
Watch video stream
python main.py --source https://www.youtube.com/watch?v=D_DtKgsr9WQ --output watch --quality 480 --goal translate --source-language ja --model-size large-v2 --start 0:08
- Three confidence markers are prefixed to each line (e.g.
[---]
). A summary is that[---]
is likely to be fairly accurate, but once you see?
or!
it's more likely to be inaccurate, or completely wrong. They represent, in order:- Probability (from Whisper's log probablity): -
-
,?
and!
represent high to low respectively. - No Speech (Noise) probability:
-
,?
and!
represent low to high respectively. - Compression:
-
,?
and!
represent low to high respectively.
- Probability (from Whisper's log probablity): -
- Translation works reasonably well with background music or sound, so long as it isn't too loud compared to the speech.
- Singing or speaking with unusual speech patterns may produce poor results.
- Multiple speakers will often be translated well, but the output will not differentiate between them.
- The Whisper models will sometimes produce repeated false translations. You can filter these with
gigo_phrases.txt
for now.
If possible, I would like to add these features in the future:
- Restreaming, primarily to allow the use of video players other than MPV, such as a mobile device.
- Separating the translation component, so it can be run remotely, or using a docker container.
- Better handling of various scenarios (seeking, disconnection, upcoming streams, etc).
- A nicer interface, and easier installation (of app and dependencies).
- Streamer side translation (OBS Plugin?)
- Other nice things (e.g. more destination languages, other models, easier installation, etc)