Skip to content

[Model][VLM] Add Qwen2.5-Omni model support (end-to-end full support) #16347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

fyabc
Copy link
Contributor

@fyabc fyabc commented Apr 9, 2025

This draft PR adding support for Qwen2.5-Omni model (end-to-end full support).

This PR is a later version of #15130, it adds support for talker, code2wav, and an OmniLLMEngine class to manage the end-to-end audio generation process.
You can see #15130 for more details about Qwen2.5-Omni model architecture.

NOTE: Since this PR makes significant changes to vLLM, its a draft and will not be merged in the short term.

Requirements

This PR requires huggingface/transformers#36752.

pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8

Note: You need to install transformers from source from that branch

Example Usage

python examples/offline_inference/qwen2_5_omni/end2end.py --model Qwen/Qwen2.5-Omni-7B --prompt audio-in-video-v2 --enforce-eager --do-wave --voice-type m02 --warmup-voice-type m02

This command will print text output and generate .wav output files under current folder.

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Copy link

github-actions bot commented Apr 9, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) tpu Related to Google TPUs labels Apr 9, 2025
Copy link

mergify bot commented Apr 9, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fyabc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 9, 2025
@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 9, 2025

I think we can further split this PR, with the first one (after Qwen2.5-Omni thinker only) adding prompt_embeds support to vLLM. For reference, here are some previous/ongoing efforts to add this feature:

@ywang96
Copy link
Member

ywang96 commented Apr 9, 2025

Thanks for this contribution! As we discussed offline, we'll be carefully reviewing this PR/design and think about how to enable end-to-end support for models like this with vLLM!

@mergify mergify bot added the ci/build label Apr 10, 2025
fyabc and others added 3 commits April 10, 2025 22:37
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
(cherry picked from commit 005879f2b22e40b7d03be7063e80686862a72e2d)
Signed-off-by: fyabc <suyang.fy@alibaba-inc.com>
elif 'video' in ele:
audio_key = 'video'
audios.append(librosa.load(ele[audio_key], sr=16000)[0])
videos.append(fetch_and_read_video(audio_key))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
videos.append(fetch_and_read_video(audio_key))
videos.append(fetch_and_read_video(ele[audio_key]))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) needs-rebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants