🔥 🔥 🔥 Awesome MLLMs/Benchmarks for Short/Long/Streaming Video Understanding 📹
Title | Venue | Date | Code | Frames |
---|---|---|---|---|
[Benchmark] OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? |
arXiv | 2025-01 | Github | Streaming |
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction |
arXiv | 2025-01 | Github | Streaming |
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction |
arXiv | 2025-01 | Github | Streaming |
Streaming long video understanding with large language models | arXiv | 2024-05 | - | 16(Streaming) |
Title | Venue | Date | Code | Frames |
---|---|---|---|---|
T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs |
arXiv | 2024-12 | Github | - |
Title | Venue | Date | Repo | LeaderBoard |
---|---|---|---|---|
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark |
- | 2023-12 | Github | - |
TempCompass: Do Video LLMs Really Understand Videos? |
ACL | 2024-03 | Github | - |
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis |
- | 2024-06 | Github | - |
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding |
NIPS D&B | 2024-06 | Github | - |
MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding |
arXiv | 2024-06 | Github | - |
HourVideo: 1-Hour Video-Language Understanding |
NIPS D&B | 2024-11 | Github | comming soon |