smatter

Smatter provides real time, local translation of livestreams (on the viewer side). Currently it is in a very basic state, and has only been tested on windows, but seems initially usable for youtube and twitch streams. It relies primarily on faster-whisper for translation, with some help from silero_vad.

Dependencies

You will need binaries for these applications installed for Smatter to function:

yt-dlp
ffmpeg

And others are recommended too:

mpv - required to use 'watch' output.
phantomJS - yt-dlp recommends this to avoid throttling issues
CUDA Toolkit - Will allow faster-whisper to use an Nvidia GPU. See CTranslate2 for details.

Installation

Check out or download the latest code
Add required libraries using pip install -r .\requirements.txt
Install dependencies, and add them to your PATH (or to ./libs/bin directory)
Add mpv-2.dll to ./libs/bin folder on windows. See here.
Pre-downloading faster-whisper model(s) is also recommended, the application doesn't wait for this process currently, and will probably fail.

Command Line Arguments

usage: main.py [-h] --source SOURCE [--quality QUALITY] [--start START] --output {srt,vtt,watch} [--output-dir OUTPUT_DIR] [--output-file OUTPUT_FILE]
               [--model-size {tiny,base,small,medium,large,large-v2,tiny.en,base.en,small.en,medium.en}] [--force-gpu] [--source-language SOURCE_LANGUAGE] 
               [--goal {translate,transcribe}] [--log-level {debug,info,warning,error,critical,none}]

options:
  -h, --help            show this help message and exit
  --source SOURCE       URL of stream/video
  --quality QUALITY     Max vertical video size (e.g 480 for 480p). If not specified, the best possible quality is chosen.
  --start START         Start point of a vod in HH:mm:ss (defaults to 0:00:00)
  --output {srt,vtt,watch}
                        What output format is desired (file or video window)
  --output-dir OUTPUT_DIR
                        Directory to store output files (defaults to ./output)
  --output-file OUTPUT_FILE
                        Filename for any output file (defaults to output.srt)
  --model-size {tiny,base,small,medium,large,large-v2,tiny.en,base.en,small.en,medium.en}
                        Whisper model selection (defaults to base)
  --force-gpu           Force using GPU for translation (requires CUDA
                        libraries). Shouldn't be necessary in most cases.
  --source-language SOURCE_LANGUAGE
                        Source language short code (defaults to en for English)
  --goal {translate,transcribe}
                        Select between translation or transcription (defaults to transcription)
  --log-level {debug,info,warning,error,critical,none}
                        How much log info to show in command window (defaults to warning)

Examples

Minimum configuration

Open video window with transcribed english speech shown along with video stream (NOTE: currently, quality default of 'best' seems to cause problems for MPV, possibly issues with 60fps?)

python main.py --source https://www.youtube.com/watch?v=lKDZ_hmDqMI --output watch --quality 480

Save srt file to be used with separate application later

python main.py --source https://www.youtube.com/watch?v=lKDZ_hmDqMI --output srt

Save webvtt file instead

python main.py --source https://www.youtube.com/watch?v=lKDZ_hmDqMI --output vtt

(NOTE: srt and webvtt files can be used with browser plugins for vods)

Translate to English

Watch video stream (NOTE: A Larger model seems to do better at translation.)

python main.py --source https://www.youtube.com/watch?v=jjiXgRO8qDw --output watch --quality 480 --goal translate --source-language it --model-size large-v2

Start later into a video stream

Watch video stream

python main.py --source https://www.youtube.com/watch?v=D_DtKgsr9WQ --output watch --quality 480 --goal translate --source-language ja --model-size large-v2 --start 0:08

Notes on Translation/Transcription

Three confidence markers are prefixed to each line (e.g. [---]). A summary is that [---] is likely to be fairly accurate, but once you see ? or ! it's more likely to be inaccurate, or completely wrong. They represent, in order:
- Probability (from Whisper's log probablity): - -, ? and ! represent high to low respectively.
- No Speech (Noise) probability: -, ? and ! represent low to high respectively.
- Compression: -, ? and ! represent low to high respectively.
Translation works reasonably well with background music or sound, so long as it isn't too loud compared to the speech.
Singing or speaking with unusual speech patterns may produce poor results.
Multiple speakers will often be translated well, but the output will not differentiate between them.
The Whisper models will sometimes produce repeated false translations. You can filter these with gigo_phrases.txt for now.

Future plans

If possible, I would like to add these features in the future:

Restreaming, primarily to allow the use of video players other than MPV, such as a mobile device.
Separating the translation component, so it can be run remotely, or using a docker container.
Better handling of various scenarios (seeking, disconnection, upcoming streams, etc).
A nicer interface, and easier installation (of app and dependencies).
Streamer side translation (OBS Plugin?)
Other nice things (e.g. more destination languages, other models, easier installation, etc)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

smatter

Dependencies

Installation

Command Line Arguments

Examples

Minimum configuration

Translate to English

Start later into a video stream

Notes on Translation/Transcription

Future plans

Files

README.md

Latest commit

History

README.md

File metadata and controls

smatter

Dependencies

Installation

Command Line Arguments

Examples

Minimum configuration

Translate to English

Start later into a video stream

Notes on Translation/Transcription

Future plans