🤗 Paper | 🤗 Model| 📃 Paper| 📼 Online Demo
- A Speech-to-Text Model Bridging the Gap Between Speech and Text
- Utilizes Data-Efficient Strategy and Unique Architecture, Trained on Only 10k Hours of Data
- Exceptional Performance in Speech Translation and AIR-Bench Speech Tasks
- Retains Intelligence During Conversations, Ideal for Interactive Tasks
- [05/03/2025] 🔥 We released our Soundwave weights 🤗 Model !
- [19/02/2025] Try our model now in the 📼 Online Demo .
- [19/02/2025] The online demo and model weights are coming soon.
- [18/02/2025] Release the model architecture and inference code.
.
├── assets/
│ └── audio/ # Directory for test audio files (e.g., .wav files)
├── README.md
├── run_inference.py # Main inference script
└── Soundwave.py # Model architecture
Python version 3.10.11 is used in the Soundwave project.
conda create -n soundwave python=3.10.11
conda activate soundwave
pip install -r requirements.txt
Before starting, ensure you have at least 21GB of GPU memory to run our model inference.
To run the inference script and process the audio, use the following command:
python run_inference.py --model_path <model_path>
# model_path: Path to the pre-trained Soundwave model.
Below are some quick usage examples you can try:
import torch
import librosa
from run_inference import load_model, gen_model_inputs, CONFIG
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model, audio_processor, tokenizer = load_model("FreedomIntelligence/Soundwave", device)
# apply chat template
prompt = "What does the person say?"
model_inputs = gen_model_inputs(tokenizer, prompt, device)
# audio preprocess
audio_path = "assets/audio/example_1.wav"
audio, _ = librosa.load(audio_path, sr=CONFIG.sampling_rate, mono=True)
audio_feat = audio_processor(
audio, sampling_rate=CONFIG.sampling_rate, return_tensors="pt"
).input_features.to(device, dtype=torch.float16)
# inference
output_ids = model.generate(
**model_inputs,
audios=audio_feat,
max_new_tokens=512,
eos_token_id=tokenizer.eos_token_id,
do_sample=True,
top_p=0.9,
temperature=0.2
)
input_token_len = model_inputs["input_ids"].shape[1]
response = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
print(response)
If you found this repository useful, please consider citing this work:
@article{zhang2025soundwave,
title={Soundwave: Less is More for Speech-Text Alignment in LLMs},
author={Zhang, Yuhao and Liu, Zhiheng and Bu, Fan and Zhang, Ruiyu and Wang, Benyou and Li, Haizhou},
journal={arXiv preprint arXiv:2502.12900},
year={2025}
}