Skip to content

FreedomIntelligence/Soundwave

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Soundwave: Less is More for Speech-Text Alignment in LLMs

🤗 Paper | 🤗 Model| 📃 Paper| 📼 Online Demo 

✨ Highlights of Our Soundwave Model !️

  • A Speech-to-Text Model Bridging the Gap Between Speech and Text
  • Utilizes Data-Efficient Strategy and Unique Architecture, Trained on Only 10k Hours of Data
  • Exceptional Performance in Speech Translation and AIR-Bench Speech Tasks
  • Retains Intelligence During Conversations, Ideal for Interactive Tasks

💌 News

  • [05/03/2025] 🔥 We released our Soundwave weights 🤗 Model !
  • [19/02/2025] Try our model now in the 📼 Online Demo .
  • [19/02/2025] The online demo and model weights are coming soon.
  • [18/02/2025] Release the model architecture and inference code.

Project Structure

.
├── assets/
│   └── audio/                     # Directory for test audio files (e.g., .wav files)
├── README.md                      
├── run_inference.py               # Main inference script
└── Soundwave.py                   # Model architecture

Getting Started

Installation Requirements

Python version 3.10.11 is used in the Soundwave project.

conda create -n soundwave python=3.10.11
conda activate soundwave
pip install -r requirements.txt 

Inference

Before starting, ensure you have at least 21GB of GPU memory to run our model inference.

Usage Command

To run the inference script and process the audio, use the following command:

python run_inference.py --model_path <model_path>
# model_path: Path to the pre-trained Soundwave model.

Below are some quick usage examples you can try:

import torch
import librosa
from run_inference import load_model, gen_model_inputs, CONFIG

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model, audio_processor, tokenizer = load_model("FreedomIntelligence/Soundwave", device)

# apply chat template
prompt = "What does the person say?"
model_inputs = gen_model_inputs(tokenizer, prompt, device)

 # audio preprocess
audio_path = "assets/audio/example_1.wav"
audio, _ = librosa.load(audio_path, sr=CONFIG.sampling_rate, mono=True)
audio_feat = audio_processor(
    audio, sampling_rate=CONFIG.sampling_rate, return_tensors="pt"
).input_features.to(device, dtype=torch.float16)

 # inference
output_ids = model.generate(
    **model_inputs,
    audios=audio_feat,
    max_new_tokens=512,
    eos_token_id=tokenizer.eos_token_id,
    do_sample=True,
    top_p=0.9,
    temperature=0.2
)

input_token_len = model_inputs["input_ids"].shape[1]
response = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]

print(response)

Citation

If you found this repository useful, please consider citing this work:

@article{zhang2025soundwave,
  title={Soundwave: Less is More for Speech-Text Alignment in LLMs},
  author={Zhang, Yuhao and Liu, Zhiheng and Bu, Fan and Zhang, Ruiyu and Wang, Benyou and Li, Haizhou},
  journal={arXiv preprint arXiv:2502.12900},
  year={2025}
}

About

The official Soundwave repository

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages