You can install the Whisperx package from the Python Package Index.
conda create --name whisperx python=3.10
conda activate whisperx
pip install whisperx
or you can install the latest version from the [GitHub repository]
git clone git@github.com:m-bain/whisperX.git
cd whisperX
pip install -e .
whisperx examples/sample01.wav
device = 'cpu' #or cuda if available
audio_file = "/Path/to/'file.wav"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float32" # change to "int8" if low on GPU mem (may reduce accuracy)
language = "en"
model = whisperx.load_model("large-v2", device=str(device), compute_type=compute_type, task='translate', language=language)
# 2. Load the audio
audio = whisperx.load_audio(audio_file)
# 3. Perform translation
result = model.transcribe(audio, batch_size=batch_size)
- tiny
- base
- small
- medium
- large
Note:
- Live translation is kinda super fast, with base model on CPU, so we can use whisperx.
- For live transcription, with large model (more accurate detection, we need GPU, tiny and base model, CPU is enough, nearly 90% accuracy, for words, some words are tricky, with large model, all words are detecting good , but GPU is recommended)
- Speaker detection is process heavy, CPU is not enough, need GPU machine for the best performance
- Translation happens in less than second (Mac M1, CPU), speaker detection takes around 30 seconds ( CPU, Mac M1)