-
Notifications
You must be signed in to change notification settings - Fork 28.8k
Add Qwen2.5-Omni #36752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Qwen2.5-Omni #36752
Conversation
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super excited for the model release, and to see more modalities integrated within an LLM!
The modeling is huge, which is expected to support all 3 modalities and speech generation. For easier review process and general inspection, I suggest to add a modular_qwen_omni.py
file. There is more info about modular in internal slack and if you have any troubles with that, lmk in slack (for faster replies)
Regarding the PR, I did a first quick review and left some comments, most of which are about the general API and style for Audio generation blocks. Unfortunately we don't have a DiT support from within transformers, so let's try to rewrite it as general as possible. In the future we might want to add DiT as a new model maybe
The general points are about naming of one-letter variables, adding docs in some fn and minimize possible code paths. I see some config values are never assigned and we rely on defaults, then we don't need to support the other path.
The Token2Wav model needs some alignment with other models API, for ex we already have NTK RoPE layer in Llama-like model and reuse it. Same for attention-like DiT layers which need to be similar to text attention layers with some minor changes
I can review on Monday one more time after the changes. Also if you have strict deadlines for release, lmk so we can prioritize tasks internally
# Need install ffmpeg to read non wav&flac audios | ||
audios, images, videos = process_mm_info(conversation, USE_AUDIO_IN_VIDEO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we updated recently chat templates to load and process all modalities. More info in https://huggingface.co/docs/transformers/main/en/chat_templating_multimodal
Can you let us know if that works for you. Audio support is coming, waiting for the Qwen2-Audio PR on the hub to be merged 😅
If that doesn't fully work for qwen-omni, lmk which parts are missing so we can add it. The aim is to let users do inference in one-line and not depend on external helpers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A missing part is to extract the audio track in video and inform the model. Like the USE_AUDIO_IN_VIDEO
parameter in this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oke, I will add it before qwen release. Should not be hard since we already have a PR for audio and support videos without audio part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Support added, with a single flag users can load audio from video (#36955). Merging in a few min
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation breaks when setting load_audio_from_video
to True
and have a video without audio track.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or if the audio track of the audio or video contains a format that is not supported by librosa, such as the AAC format in this case, it can also cause a crash.
ffprobe message here:
ffprobe version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2007-2021 the FFmpeg developers
built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
libavutil 56. 70.100 / 56. 70.100
libavcodec 58.134.100 / 58.134.100
libavformat 58. 76.100 / 58. 76.100
libavdevice 58. 13.100 / 58. 13.100
libavfilter 7.110.100 / 7.110.100
libswscale 5. 9.100 / 5. 9.100
libswresample 3. 9.100 / 3. 9.100
libpostproc 55. 9.100 / 55. 9.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4':
Metadata:
major_brand : qt
minor_version : 0
compatible_brands: qt
creation_time : 2025-03-14T07:52:19.000000Z
com.apple.quicktime.artwork: {"data":{"editType":"default","edittime":835,"infoStickerId":"","is_ai_lyric":0,"is_aimusic_mv":0,"is_use_ai_image_generation":0,"is_use_ai_sound":0,"is_use_ai_video_generation":0,"is_use_aimusic_bgm":0,"is_use_aimusic_vocal":0,"is_use_graph_chart":0,"is_
Duration: 00:00:20.97, start: 0.000000, bitrate: 13130 kb/s
Stream #0:0(und): Video: hevc (Main 10) (hvc1 / 0x31637668), yuv420p10le(tv, bt709), 3840x2160, 12992 kb/s, 30 fps, 30 tbr, 600 tbn, 600 tbc (default)
Metadata:
creation_time : 2025-03-14T07:52:19.000000Z
handler_name : Core Media Video
vendor_id : [0][0][0][0]
encoder : HEVC
Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 131 kb/s (default)
Metadata:
creation_time : 2025-03-14T07:52:19.000000Z
handler_name : Core Media Audio
vendor_id : [0][0][0][0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't expect one to load audio if video has no audio in it. We will add support for other formats with moviepy
in subsequent PRs. For the release though, I would prefer to use processor only examples
src/transformers/models/qwen2_5_omni/configuration_qwen2_5_omni.py
Outdated
Show resolved
Hide resolved
src/transformers/models/qwen2_5_omni/configuration_qwen2_5_omni.py
Outdated
Show resolved
Hide resolved
src/transformers/models/qwen2_5_omni/configuration_qwen2_5_omni.py
Outdated
Show resolved
Hide resolved
src/transformers/models/qwen2_5_omni/configuration_qwen2_5_omni.py
Outdated
Show resolved
Hide resolved
In addition, another situation is that starting today, we will be on a national holiday in China, so can the merge operation be launched next Monday, so that we can update readme/docker/vllm/cookbooks simultaneously to prevent some users from encountering various problems that cannot run after pulling the latest transformers. Thanks :) |
@wietsedv sure, we will release after you are back then. Enjoy! |
src/transformers/models/qwen2_5_omni/processing_qwen2_5_omni.py
Outdated
Show resolved
Hide resolved
positions = sorted([match.group() for match in re.finditer(pattern, sample)]) | ||
|
||
for special_token in positions: | ||
if special_token == self.audio_token: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ArthurZucker @zucchini-nlp @wangxiongts, in our vllm PR, processor will be called with multimodal tokens but without related mm data (e.g., text='<|VIDEO|>', videos=None
) (called in _apply_hf_processor_text_only
). So we need to change conditions to if audio is not None and self.audio_token
, and provide video_second_per_grid
argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm afaik vLLM creates dummy mm data in that case, since most VLMs in transformers raise error when number of tokens in text don't correspond to number of images 🤔
This is how to workaround it with vLLM as per docs (https://docs.vllm.ai/en/latest/design/mm_processing.html#dummy-text)
src/transformers/models/qwen2_5_omni/processing_qwen2_5_omni.py
Outdated
Show resolved
Hide resolved
hi, test thinker can stream out text, but talker how to stream out audio together: https://github.com/fyabc/vllm/blob/729feed3ec2beefe63fda30a345ef363d08062f8/vllm/engine/omni_llm_engine.py#L1965
# run streaming cases
## run thinker-only token streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c asr_stream -d L4
## run thinker-only chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c thinker_chunk_stream -d L4
## run thinker talker-token code2wav-chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_stream -d L4
## run text -> text+speech | thinker-chunk talker-token code2wav-chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_segment_stream -d L40s
## run vision (video with audio) -> text+speech | thinker-chunk talker-token code2wav-chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c screen_recording_interaction_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c screen_recording_interaction_chunk_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c video_information_extracting_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c video_information_extracting_chunk_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_math_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_music_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_math_chunk_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_music_chunk_stream -d L40s
|
Merging, thanks everyone for you help 🚀 |
* Add qwen2.5-omni * Remove einops dependency * Add torchdiffeq dependency * Sort init * Add torchdiffeq to extras['diffeq'] * Fix repo consistency * use cached_file * del odeint * renew pytest * format * Remove torchdiffeq * format * fixed batch infer bug * Change positional_embedding to parameter * Change default speaker * Config revision * Use modular & code clean * code clean * decouple padding with model & code cleaning * sort init * fix * fix * Second code review * fix * fix * rename vars to full name + some comments * update pytest * Code clean & fix * fix * style * more clean up * fixup * smaller vision model in tests * fix processor test * deflake a bit the tests (still flaky though) * de-flake tests finally + add generation mixin * final nits i hope * make sure processor tests are complete * replace with Qwen2_5OmniForConditionalGeneration * fix tests after updating ckpt * fix typos when cleaning, also we can't change ckpt * fixup * images and videos kwargs for processor * thinker and talker loadable from hub ckpt * address comments and update tests after rebase * fixup * skip for now * fixup * fixup * remove torch dependency in processors --------- Co-authored-by: lvyuanjun.lyj <lvyuanjun.lyj@alibaba-inc.con> Co-authored-by: feizi.wx <feizi.wx@alibaba-inc.com> Co-authored-by: raushan <raushan@huggingface.co>
What does this PR do?
Add Qwen2.5 Omni Model
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.