- Git: Download from https://git-scm.com/download/win.
- Python (3.10.11 3.11 also works.Don't Use Windows Store Version. If you have that installed, uninstall and please install from python.org. During installation remember to check the box for "Add Python to PATH when you are at the "Customize Python" screen.
- Visual C++ Runtime: Download vc_redist.x64.exe and install it.
- Install HIP SDK 5.7.1 from HERE the correct version, "Windows 10 & 11 5.7.1 HIP SDK"
- To see system variables : Right click My Computer - Properties - Advanced System Settings (on the menu right side) - Environment Variable.
Add the system variable HIP_PATH, value:
C:\\Program Files\\AMD\\ROCm\\5.7\\
(This is the default folder, if you have installed it on another drive, change if necessary)- Check the variables on the lower part (System Variables), there should be a variable called: HIP_PATH.
- Also check the variables on the lower part (System Variables), there should be a variable called: "Path".
Double-click it and click "New" add this:
C:\Program Files\AMD\ROCm\5.7\bin
- If you have an AMD GPU below 6800 (6700,6600 etc.), download the recommended library files for your gpu
from Brknsoul Repository
- Go to folder "C:\Program Files\AMD\ROCm\5.7\bin\rocblas", there would be a "library" folder, backup the files inside to somewhere else.
- Open your downloaded optimized library archive and put them inside the library folder (overwriting if necessary): "C:\Program Files\AMD\ROCm\5.7\bin\rocblas\library"
- Reboot your system.
Open a cmd prompt. (Powershell doesn't work, you have to use command prompt.)
[ You open command prompt via typing "cmd" in start / run OR the easier way going into the drive or directory you want to install nunif/iw3 to on explorer, click on the address bar and type "cmd" press enter, this would open a commandline window on the directory you are in on explorer at the moment. ]
git clone git clone https://github.com/patientx/fish-speech-zluda.git
cd fish-speech-zluda
install-amd.bat
to start for later use (or create a shortcut to) :
fsz.bat
******** The first time you run the webui and generate, it would seem like your computer is doing nothing, there would a message saying. "Compiling in progress.." ******** that's normal , zluda is creating a database for future use. That only happens once (or at least goes very fast for next sessions.)
- We are using "--half" parameter by default , it makes the generation speed almost 3 times faster than normal on my rx 6600 , so feel free to try it if you have higher gpu's which might gain from using the standard bf16.
- "--compile" doesn't work because it requires triton , which is hard to install correctly on windows and requires torch 2.5, which brings with it other problems.
- DO NOT use non-english characters as folder names to put fish-speech-zluda under.
- Wipe your pip cache "C:\Users\USERNAME\AppData\Local\pip\cache" You can also do this when venv is active with : pip cache purge (if needed)
- Have the latest drivers installed for your amd gpu. Also, Remove Any Nvidia Drivers you might have from previous nvidia gpu's.
- If you see zluda errors make sure these three files are inside "fish-speech-zluda\venv\Lib\site-packages\torch\lib" cublas64_11.dll (231kb) cusparse64_11.dll (199kb) nvrtc64_112_0.dll (129kb) If they are there but much bigger in size please run : patchzluda.bat
- If for some reason you can't solve with these and want to start from zero, delete "venv" folder and re-run install-amd.bat
- If you can't git pull to the latest version, run these commands, git fetch --all and then git reset --hard origin/master now you can git pull
- Problems with caffe2_nvrtc.dll: if you are sure you properly installed hip and can see it on path, please DON'T use python from windows store, use the link provided or 3.11 from the official website. After uninstalling python from windows store and installing the one from the website, be sure the delete venv folder, and run install-amd.bat once again.
- rocBLAS-error: If you have an integrated GPU by AMD (e.g. AMD Radeon(TM) Graphics) you need to add HIP_VISIBLE_DEVICES=1 to your environment variables. Otherwise it will default to using your iGPU.
English | 简体中文 | Portuguese | 日本語 | 한국어
This codebase is released under Apache License and all model weights are released under CC-BY-NC-SA-4.0 License. Please refer to LICENSE for more details.
We are very excited to announce that we have made our self-research agent demo open source, you can now try our agent demo online at demo for instant English chat and English and Chinese chat locally by following the docs.
You should mention that the content is released under a CC BY-NC-SA 4.0 licence. And the demo is an early alpha test version, the inference speed needs to be optimised, and there are a lot of bugs waiting to be fixed. If you've found a bug or want to fix it, we'd be very happy to receive an issue or a pull request.
-
Zero-shot & Few-shot TTS: Input a 10 to 30-second vocal sample to generate high-quality TTS output. For detailed guidelines, see Voice Cloning Best Practices.
-
Multilingual & Cross-lingual Support: Simply copy and paste multilingual text into the input box—no need to worry about the language. Currently supports English, Japanese, Korean, Chinese, French, German, Arabic, and Spanish.
-
No Phoneme Dependency: The model has strong generalization capabilities and does not rely on phonemes for TTS. It can handle text in any language script.
-
Highly Accurate: Achieves a low CER (Character Error Rate) and WER (Word Error Rate) of around 2% for 5-minute English texts.
-
Fast: With fish-tech acceleration, the real-time factor is approximately 1:5 on an Nvidia RTX 4060 laptop and 1:15 on an Nvidia RTX 4090.
-
WebUI Inference: Features an easy-to-use, Gradio-based web UI compatible with Chrome, Firefox, Edge, and other browsers.
-
GUI Inference: Offers a PyQt6 graphical interface that works seamlessly with the API server. Supports Linux, Windows, and macOS. See GUI.
-
Deploy-Friendly: Easily set up an inference server with native support for Linux, Windows and MacOS, minimizing speed loss.
-
Completely End to End: Automatically integrates ASR and TTS parts, no need to plug-in other models, i.e., true end-to-end, not three-stage (ASR+LLM+TTS).
-
Timbre Control: Can use reference audio to control the speech timbre.
-
Emotional: The model can generate speech with strong emotion.
We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.
V1.4 Demo Video: Youtube
@misc{fish-speech-v1.4,
title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},
author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
year={2024},
eprint={2411.01156},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2411.01156},
}