Add Qwen2.5-Omni #36752

BakerBunker · 2025-03-16T13:47:55Z

What does this PR do?

Add Qwen2.5 Omni Model

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

github-actions · 2025-03-16T13:48:08Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

…ansformers-internal into qwen25omni

Rocketknight1 · 2025-03-19T17:06:37Z

cc @qubvel @zucchini-nlp

zucchini-nlp

Super excited for the model release, and to see more modalities integrated within an LLM!

The modeling is huge, which is expected to support all 3 modalities and speech generation. For easier review process and general inspection, I suggest to add a modular_qwen_omni.py file. There is more info about modular in internal slack and if you have any troubles with that, lmk in slack (for faster replies)

Regarding the PR, I did a first quick review and left some comments, most of which are about the general API and style for Audio generation blocks. Unfortunately we don't have a DiT support from within transformers, so let's try to rewrite it as general as possible. In the future we might want to add DiT as a new model maybe

The general points are about naming of one-letter variables, adding docs in some fn and minimize possible code paths. I see some config values are never assigned and we rely on defaults, then we don't need to support the other path.

The Token2Wav model needs some alignment with other models API, for ex we already have NTK RoPE layer in Llama-like model and reuse it. Same for attention-like DiT layers which need to be similar to text attention layers with some minor changes

I can review on Monday one more time after the changes. Also if you have strict deadlines for release, lmk so we can prioritize tasks internally

zucchini-nlp · 2025-03-19T11:36:56Z

docs/source/en/model_doc/qwen2_5_omni.md

+# Need install ffmpeg to read non wav&flac audios
+audios, images, videos = process_mm_info(conversation, USE_AUDIO_IN_VIDEO)


we updated recently chat templates to load and process all modalities. More info in https://huggingface.co/docs/transformers/main/en/chat_templating_multimodal

Can you let us know if that works for you. Audio support is coming, waiting for the Qwen2-Audio PR on the hub to be merged 😅

If that doesn't fully work for qwen-omni, lmk which parts are missing so we can add it. The aim is to let users do inference in one-line and not depend on external helpers

A missing part is to extract the audio track in video and inform the model. Like the USE_AUDIO_IN_VIDEO parameter in this line.

oke, I will add it before qwen release. Should not be hard since we already have a PR for audio and support videos without audio part

Support added, with a single flag users can load audio from video (#36955). Merging in a few min

Current implementation breaks when setting load_audio_from_video to True and have a video without audio track.

Or if the audio track of the audio or video contains a format that is not supported by librosa, such as the AAC format in this case, it can also cause a crash.

ffprobe message here:

ffprobe version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2007-2021 the FFmpeg developers built with gcc 11 (Ubuntu 11.2.0-19ubuntu1) configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared libavutil 56. 70.100 / 56. 70.100 libavcodec 58.134.100 / 58.134.100 libavformat 58. 76.100 / 58. 76.100 libavdevice 58. 13.100 / 58. 13.100 libavfilter 7.110.100 / 7.110.100 libswscale 5. 9.100 / 5. 9.100 libswresample 3. 9.100 / 3. 9.100 libpostproc 55. 9.100 / 55. 9.100 Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4': Metadata: major_brand : qt minor_version : 0 compatible_brands: qt creation_time : 2025-03-14T07:52:19.000000Z com.apple.quicktime.artwork: {"data":{"editType":"default","edittime":835,"infoStickerId":"","is_ai_lyric":0,"is_aimusic_mv":0,"is_use_ai_image_generation":0,"is_use_ai_sound":0,"is_use_ai_video_generation":0,"is_use_aimusic_bgm":0,"is_use_aimusic_vocal":0,"is_use_graph_chart":0,"is_ Duration: 00:00:20.97, start: 0.000000, bitrate: 13130 kb/s Stream #0:0(und): Video: hevc (Main 10) (hvc1 / 0x31637668), yuv420p10le(tv, bt709), 3840x2160, 12992 kb/s, 30 fps, 30 tbr, 600 tbn, 600 tbc (default) Metadata: creation_time : 2025-03-14T07:52:19.000000Z handler_name : Core Media Video vendor_id : [0][0][0][0] encoder : HEVC Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 131 kb/s (default) Metadata: creation_time : 2025-03-14T07:52:19.000000Z handler_name : Core Media Audio vendor_id : [0][0][0][0]

I wouldn't expect one to load audio if video has no audio in it. We will add support for other formats with moviepy in subsequent PRs. For the release though, I would prefer to use processor only examples

src/transformers/models/qwen2_5_omni/configuration_qwen2_5_omni.py

src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py

src/transformers/models/qwen2_5_omni/modular_qwen2_5_omni.py

wangxiongts · 2025-04-03T17:32:55Z

In addition, another situation is that starting today, we will be on a national holiday in China, so can the merge operation be launched next Monday, so that we can update readme/docker/vllm/cookbooks simultaneously to prevent some users from encountering various problems that cannot run after pulling the latest transformers. Thanks :)

zucchini-nlp · 2025-04-03T17:51:03Z

@wietsedv sure, we will release after you are back then. Enjoy!

src/transformers/models/qwen2_5_omni/processing_qwen2_5_omni.py

fyabc · 2025-04-08T10:39:23Z

src/transformers/models/qwen2_5_omni/processing_qwen2_5_omni.py

+            positions = sorted([match.group() for match in re.finditer(pattern, sample)])
+
+            for special_token in positions:
+                if special_token == self.audio_token:


Hi @ArthurZucker @zucchini-nlp @wangxiongts, in our vllm PR, processor will be called with multimodal tokens but without related mm data (e.g., text='<|VIDEO|>', videos=None) (called in _apply_hf_processor_text_only). So we need to change conditions to if audio is not None and self.audio_token, and provide video_second_per_grid argument.

Hmm afaik vLLM creates dummy mm data in that case, since most VLMs in transformers raise error when number of tokens in text don't correspond to number of images 🤔

This is how to workaround it with vLLM as per docs (https://docs.vllm.ai/en/latest/design/mm_processing.html#dummy-text)

src/transformers/models/qwen2_5_omni/processing_qwen2_5_omni.py

weedge · 2025-04-13T11:40:56Z

hi, test thinker can stream out text, but talker how to stream out audio together: https://github.com/fyabc/vllm/blob/729feed3ec2beefe63fda30a345ef363d08062f8/vllm/engine/omni_llm_engine.py#L1965

run streaming cases

# run streaming cases
## run thinker-only token streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c asr_stream -d L4

## run thinker-only chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c thinker_chunk_stream -d L4

## run thinker talker-token code2wav-chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_stream -d L4

## run text -> text+speech | thinker-chunk talker-token code2wav-chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_segment_stream -d L40s

## run  vision (video with audio) -> text+speech | thinker-chunk talker-token code2wav-chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c screen_recording_interaction_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c screen_recording_interaction_chunk_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c video_information_extracting_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c video_information_extracting_chunk_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_math_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_music_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_math_chunk_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_music_chunk_stream -d L40s

zucchini-nlp · 2025-04-14T10:36:37Z

Merging, thanks everyone for you help 🚀

* Add qwen2.5-omni * Remove einops dependency * Add torchdiffeq dependency * Sort init * Add torchdiffeq to extras['diffeq'] * Fix repo consistency * use cached_file * del odeint * renew pytest * format * Remove torchdiffeq * format * fixed batch infer bug * Change positional_embedding to parameter * Change default speaker * Config revision * Use modular & code clean * code clean * decouple padding with model & code cleaning * sort init * fix * fix * Second code review * fix * fix * rename vars to full name + some comments * update pytest * Code clean & fix * fix * style * more clean up * fixup * smaller vision model in tests * fix processor test * deflake a bit the tests (still flaky though) * de-flake tests finally + add generation mixin * final nits i hope * make sure processor tests are complete * replace with Qwen2_5OmniForConditionalGeneration * fix tests after updating ckpt * fix typos when cleaning, also we can't change ckpt * fixup * images and videos kwargs for processor * thinker and talker loadable from hub ckpt * address comments and update tests after rebase * fixup * skip for now * fixup * fixup * remove torch dependency in processors --------- Co-authored-by: lvyuanjun.lyj <lvyuanjun.lyj@alibaba-inc.con> Co-authored-by: feizi.wx <feizi.wx@alibaba-inc.com> Co-authored-by: raushan <raushan@huggingface.co>

github-actions bot marked this pull request as draft March 16, 2025 13:48

BakerBunker force-pushed the qwen25omni branch from 3fabd95 to 6f7c28a Compare March 16, 2025 14:10

Add qwen2.5-omni

b4ff115

BakerBunker force-pushed the qwen25omni branch from 6f7c28a to b4ff115 Compare March 16, 2025 14:15

BakerBunker marked this pull request as ready for review March 16, 2025 14:15

github-actions bot requested review from ArthurZucker and Rocketknight1 March 16, 2025 14:16

lvyuanjun.lyj and others added 12 commits March 17, 2025 00:11

Remove einops dependency

33e479e

Add torchdiffeq dependency

241abde

Sort init

9b847cb

Add torchdiffeq to extras['diffeq']

2157b5a

Fix repo consistency

e541399

use cached_file

d937243

del odeint

586f7ff

Merge branch 'qwen25omni' of http://gitlab.alibaba-inc.com/DamoAGI/tr…

29db603

…ansformers-internal into qwen25omni

renew pytest

7f158f5

format

6027556

Remove torchdiffeq

1a8533d

format

bdbb8ea

BakerBunker force-pushed the qwen25omni branch from 00729f7 to bdbb8ea Compare March 17, 2025 06:32

feizi.wx and others added 2 commits March 17, 2025 16:13

fixed batch infer bug

827404a

Change positional_embedding to parameter

1d04f0d

fyabc mentioned this pull request Mar 19, 2025

[Model][VLM] Add Qwen2.5-Omni model support (thinker only) vllm-project/vllm#15130

Merged

ArthurZucker added the New model label Mar 20, 2025

zucchini-nlp reviewed Mar 20, 2025

View reviewed changes

lvyuanjun.lyj added 3 commits March 22, 2025 15:38

Change default speaker

a169211

Config revision

f1d63db

Use modular & code clean

7413de0

wangxiongts reviewed Apr 3, 2025

View reviewed changes

src/transformers/models/qwen2_5_omni/modular_qwen2_5_omni.py Outdated Show resolved Hide resolved

wangxiongts reviewed Apr 3, 2025

View reviewed changes

src/transformers/models/qwen2_5_omni/modular_qwen2_5_omni.py Show resolved Hide resolved

wangxiongts reviewed Apr 3, 2025

View reviewed changes

src/transformers/models/qwen2_5_omni/modular_qwen2_5_omni.py Outdated Show resolved Hide resolved

images and videos kwargs for processor

466f4bf

wangxiongts reviewed Apr 3, 2025

View reviewed changes

src/transformers/models/qwen2_5_omni/modular_qwen2_5_omni.py Outdated Show resolved Hide resolved

wangxiongts reviewed Apr 4, 2025

View reviewed changes

src/transformers/models/qwen2_5_omni/processing_qwen2_5_omni.py Outdated Show resolved Hide resolved

thinker and talker loadable from hub ckpt

b317f84

Puiching-Memory mentioned this pull request Apr 7, 2025

support Qwen/Qwen2.5-Omni-7B (sft/dpo/grpo) modelscope/ms-swift#3613

Merged

wangxiongts reviewed Apr 8, 2025

View reviewed changes

src/transformers/models/qwen2_5_omni/processing_qwen2_5_omni.py Show resolved Hide resolved

fyabc reviewed Apr 8, 2025

View reviewed changes

fyabc mentioned this pull request Apr 9, 2025

[Model][VLM] Add Qwen2.5-Omni model support (end-to-end full support) vllm-project/vllm#16347

Draft

zucchini-nlp added 6 commits April 10, 2025 14:45

merge main

ee7b7f9

address comments and update tests after rebase

4da369e

fixup

ecd3133

merge main

7ed74dd

skip for now

03752e4

fixup

ecc920e

wangxiongts reviewed Apr 12, 2025

View reviewed changes

src/transformers/models/qwen2_5_omni/processing_qwen2_5_omni.py Outdated Show resolved Hide resolved

fixup

d9be9c9

zucchini-nlp added 3 commits April 14, 2025 10:37

remove torch dependency in processors

605da81

Merge branch 'main' into qwen25omni

1240a81

Merge branch 'main' into qwen25omni

ef73ef7

zucchini-nlp merged commit 4b8c6d4 into huggingface:main Apr 14, 2025
18 checks passed

weedge mentioned this pull request Apr 24, 2025

feat: add qwen2.5-omni ai-bot-pro/achatbot#143

Merged

guangy10 mentioned this pull request Apr 29, 2025

Qwen model export regression #37876

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen2.5-Omni #36752

Add Qwen2.5-Omni #36752

BakerBunker commented Mar 16, 2025

github-actions bot commented Mar 16, 2025

Rocketknight1 commented Mar 19, 2025

zucchini-nlp left a comment

zucchini-nlp Mar 19, 2025

BakerBunker Mar 22, 2025

zucchini-nlp Mar 22, 2025

zucchini-nlp Mar 27, 2025

BakerBunker Mar 29, 2025

BakerBunker Mar 29, 2025

zucchini-nlp Mar 31, 2025

wangxiongts commented Apr 3, 2025

zucchini-nlp commented Apr 3, 2025

fyabc Apr 8, 2025

zucchini-nlp Apr 10, 2025 •

edited

Loading

weedge commented Apr 13, 2025 •

edited

Loading

zucchini-nlp commented Apr 14, 2025

		# Need install ffmpeg to read non wav&flac audios
		audios, images, videos = process_mm_info(conversation, USE_AUDIO_IN_VIDEO)

Add Qwen2.5-Omni #36752

Add Qwen2.5-Omni #36752

Conversation

BakerBunker commented Mar 16, 2025

What does this PR do?

Before submitting

Who can review?

github-actions bot commented Mar 16, 2025

Rocketknight1 commented Mar 19, 2025

zucchini-nlp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangxiongts commented Apr 3, 2025

zucchini-nlp commented Apr 3, 2025

Choose a reason for hiding this comment

zucchini-nlp Apr 10, 2025 • edited Loading

Choose a reason for hiding this comment

weedge commented Apr 13, 2025 • edited Loading

zucchini-nlp commented Apr 14, 2025

zucchini-nlp Apr 10, 2025 •

edited

Loading

weedge commented Apr 13, 2025 •

edited

Loading