Skip to content

Latest commit

 

History

History
107 lines (88 loc) · 3.94 KB

README.md

File metadata and controls

107 lines (88 loc) · 3.94 KB

Soundwave: Less is More for Speech-Text Alignment in LLMs

🤗 Paper | 🤗 Model| 📃 Paper| 📼 Online Demo 

✨ Highlights of Our Soundwave Model !️

  • A Speech-to-Text Model Bridging the Gap Between Speech and Text
  • Utilizes Data-Efficient Strategy and Unique Architecture, Trained on Only 10k Hours of Data
  • Exceptional Performance in Speech Translation and AIR-Bench Speech Tasks
  • Retains Intelligence During Conversations, Ideal for Interactive Tasks

💌 News

  • [05/03/2025] 🔥 We released our Soundwave weights 🤗 Model !
  • [19/02/2025] Try our model now in the 📼 Online Demo .
  • [19/02/2025] The online demo and model weights are coming soon.
  • [18/02/2025] Release the model architecture and inference code.

Project Structure

.
├── assets/
│   └── audio/                     # Directory for test audio files (e.g., .wav files)
├── README.md                      
├── run_inference.py               # Main inference script
└── Soundwave.py                   # Model architecture

Getting Started

Installation Requirements

Python version 3.10.11 is used in the Soundwave project.

conda create -n soundwave python=3.10.11
conda activate soundwave
pip install -r requirements.txt 

Inference

Before starting, ensure you have at least 21GB of GPU memory to run our model inference.

Usage Command

To run the inference script and process the audio, use the following command:

python run_inference.py --model_path <model_path>
# model_path: Path to the pre-trained Soundwave model.

Below are some quick usage examples you can try:

import torch
import librosa
from run_inference import load_model, gen_model_inputs, CONFIG

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model, audio_processor, tokenizer = load_model("FreedomIntelligence/Soundwave", device)

# apply chat template
prompt = "What does the person say?"
model_inputs = gen_model_inputs(tokenizer, prompt, device)

 # audio preprocess
audio_path = "assets/audio/example_1.wav"
audio, _ = librosa.load(audio_path, sr=CONFIG.sampling_rate, mono=True)
audio_feat = audio_processor(
    audio, sampling_rate=CONFIG.sampling_rate, return_tensors="pt"
).input_features.to(device, dtype=torch.float16)

 # inference
output_ids = model.generate(
    **model_inputs,
    audios=audio_feat,
    max_new_tokens=512,
    eos_token_id=tokenizer.eos_token_id,
    do_sample=True,
    top_p=0.9,
    temperature=0.2
)

input_token_len = model_inputs["input_ids"].shape[1]
response = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]

print(response)

Citation

If you found this repository useful, please consider citing this work:

@article{zhang2025soundwave,
  title={Soundwave: Less is More for Speech-Text Alignment in LLMs},
  author={Zhang, Yuhao and Liu, Zhiheng and Bu, Fan and Zhang, Ruiyu and Wang, Benyou and Li, Haizhou},
  journal={arXiv preprint arXiv:2502.12900},
  year={2025}
}