Skip to content

Add Qwen2.5-Omni #36752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 57 commits into from
Apr 14, 2025
Merged

Add Qwen2.5-Omni #36752

merged 57 commits into from
Apr 14, 2025

Conversation

BakerBunker
Copy link
Contributor

What does this PR do?

Add Qwen2.5 Omni Model

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Copy link

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

@github-actions github-actions bot marked this pull request as draft March 16, 2025 13:48
@BakerBunker BakerBunker marked this pull request as ready for review March 16, 2025 14:15
@Rocketknight1
Copy link
Member

cc @qubvel @zucchini-nlp

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super excited for the model release, and to see more modalities integrated within an LLM!

The modeling is huge, which is expected to support all 3 modalities and speech generation. For easier review process and general inspection, I suggest to add a modular_qwen_omni.py file. There is more info about modular in internal slack and if you have any troubles with that, lmk in slack (for faster replies)

Regarding the PR, I did a first quick review and left some comments, most of which are about the general API and style for Audio generation blocks. Unfortunately we don't have a DiT support from within transformers, so let's try to rewrite it as general as possible. In the future we might want to add DiT as a new model maybe

The general points are about naming of one-letter variables, adding docs in some fn and minimize possible code paths. I see some config values are never assigned and we rely on defaults, then we don't need to support the other path.

The Token2Wav model needs some alignment with other models API, for ex we already have NTK RoPE layer in Llama-like model and reuse it. Same for attention-like DiT layers which need to be similar to text attention layers with some minor changes

I can review on Monday one more time after the changes. Also if you have strict deadlines for release, lmk so we can prioritize tasks internally

Comment on lines 69 to 70
# Need install ffmpeg to read non wav&flac audios
audios, images, videos = process_mm_info(conversation, USE_AUDIO_IN_VIDEO)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we updated recently chat templates to load and process all modalities. More info in https://huggingface.co/docs/transformers/main/en/chat_templating_multimodal

Can you let us know if that works for you. Audio support is coming, waiting for the Qwen2-Audio PR on the hub to be merged 😅

If that doesn't fully work for qwen-omni, lmk which parts are missing so we can add it. The aim is to let users do inference in one-line and not depend on external helpers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A missing part is to extract the audio track in video and inform the model. Like the USE_AUDIO_IN_VIDEO parameter in this line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oke, I will add it before qwen release. Should not be hard since we already have a PR for audio and support videos without audio part

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support added, with a single flag users can load audio from video (#36955). Merging in a few min

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation breaks when setting load_audio_from_video to True and have a video without audio track.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if the audio track of the audio or video contains a format that is not supported by librosa, such as the AAC format in this case, it can also cause a crash.

ffprobe message here:

ffprobe version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2007-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libswscale      5.  9.100 /  5.  9.100
  libswresample   3.  9.100 /  3.  9.100
  libpostproc    55.  9.100 / 55.  9.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4':
  Metadata:
    major_brand     : qt  
    minor_version   : 0
    compatible_brands: qt  
    creation_time   : 2025-03-14T07:52:19.000000Z
    com.apple.quicktime.artwork: {"data":{"editType":"default","edittime":835,"infoStickerId":"","is_ai_lyric":0,"is_aimusic_mv":0,"is_use_ai_image_generation":0,"is_use_ai_sound":0,"is_use_ai_video_generation":0,"is_use_aimusic_bgm":0,"is_use_aimusic_vocal":0,"is_use_graph_chart":0,"is_
  Duration: 00:00:20.97, start: 0.000000, bitrate: 13130 kb/s
  Stream #0:0(und): Video: hevc (Main 10) (hvc1 / 0x31637668), yuv420p10le(tv, bt709), 3840x2160, 12992 kb/s, 30 fps, 30 tbr, 600 tbn, 600 tbc (default)
    Metadata:
      creation_time   : 2025-03-14T07:52:19.000000Z
      handler_name    : Core Media Video
      vendor_id       : [0][0][0][0]
      encoder         : HEVC
  Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 131 kb/s (default)
    Metadata:
      creation_time   : 2025-03-14T07:52:19.000000Z
      handler_name    : Core Media Audio
      vendor_id       : [0][0][0][0]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expect one to load audio if video has no audio in it. We will add support for other formats with moviepy in subsequent PRs. For the release though, I would prefer to use processor only examples

@wangxiongts
Copy link

In addition, another situation is that starting today, we will be on a national holiday in China, so can the merge operation be launched next Monday, so that we can update readme/docker/vllm/cookbooks simultaneously to prevent some users from encountering various problems that cannot run after pulling the latest transformers. Thanks :)

@zucchini-nlp
Copy link
Member

@wietsedv sure, we will release after you are back then. Enjoy!

positions = sorted([match.group() for match in re.finditer(pattern, sample)])

for special_token in positions:
if special_token == self.audio_token:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ArthurZucker @zucchini-nlp @wangxiongts, in our vllm PR, processor will be called with multimodal tokens but without related mm data (e.g., text='<|VIDEO|>', videos=None) (called in _apply_hf_processor_text_only). So we need to change conditions to if audio is not None and self.audio_token, and provide video_second_per_grid argument.

Copy link
Member

@zucchini-nlp zucchini-nlp Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm afaik vLLM creates dummy mm data in that case, since most VLMs in transformers raise error when number of tokens in text don't correspond to number of images 🤔

This is how to workaround it with vLLM as per docs (https://docs.vllm.ai/en/latest/design/mm_processing.html#dummy-text)

@weedge
Copy link

weedge commented Apr 13, 2025

hi, test thinker can stream out text, but talker how to stream out audio together: https://github.com/fyabc/vllm/blob/729feed3ec2beefe63fda30a345ef363d08062f8/vllm/engine/omni_llm_engine.py#L1965

  • run streaming cases
# run streaming cases
## run thinker-only token streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c asr_stream -d L4

## run thinker-only chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c thinker_chunk_stream -d L4

## run thinker talker-token code2wav-chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_stream -d L4

## run text -> text+speech | thinker-chunk talker-token code2wav-chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_segment_stream -d L40s

## run  vision (video with audio) -> text+speech | thinker-chunk talker-token code2wav-chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c screen_recording_interaction_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c screen_recording_interaction_chunk_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c video_information_extracting_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c video_information_extracting_chunk_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_math_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_music_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_math_chunk_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_music_chunk_stream -d L40s

@zucchini-nlp
Copy link
Member

Merging, thanks everyone for you help 🚀

@zucchini-nlp zucchini-nlp merged commit 4b8c6d4 into huggingface:main Apr 14, 2025
18 checks passed
cyr0930 pushed a commit to cyr0930/transformers that referenced this pull request Apr 18, 2025
* Add qwen2.5-omni

* Remove einops dependency

* Add torchdiffeq dependency

* Sort init

* Add torchdiffeq to extras['diffeq']

* Fix repo consistency

* use cached_file

* del odeint

* renew pytest

* format

* Remove torchdiffeq

* format

* fixed batch infer bug

* Change positional_embedding to parameter

* Change default speaker

* Config revision

* Use modular & code clean

* code clean

* decouple padding with model & code cleaning

* sort init

* fix

* fix

* Second code review

* fix

* fix

* rename vars to full name + some comments

* update pytest

* Code clean & fix

* fix

* style

* more clean up

* fixup

* smaller vision model in tests

* fix processor test

* deflake a bit the tests (still flaky though)

* de-flake tests finally + add generation mixin

* final nits i hope

* make sure processor tests are complete

* replace with Qwen2_5OmniForConditionalGeneration

* fix tests after updating ckpt

* fix typos when cleaning, also we can't change ckpt

* fixup

* images and videos kwargs for processor

* thinker and talker loadable from hub ckpt

* address comments and update tests after rebase

* fixup

* skip for now

* fixup

* fixup

* remove torch dependency in processors

---------

Co-authored-by: lvyuanjun.lyj <lvyuanjun.lyj@alibaba-inc.con>
Co-authored-by: feizi.wx <feizi.wx@alibaba-inc.com>
Co-authored-by: raushan <raushan@huggingface.co>
@guangy10 guangy10 mentioned this pull request Apr 29, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.