PUBLICATION     CVPR'25

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng#

CVPR  ·  2025   ·  arXiv: arxiv.org/abs/2503.22952


Abstract

The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 real-world interactive videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enhance real-time interactive reasoning with minimum finetuning on pre-trained MLLMs. Extensive experimental results reveal that the existing MLLMs fall short in interactive streaming understanding, particularly struggling with proactive tasks and multi-turn queries. Our proposed M4, though lightweight, demonstrates a significant improvement in handling proactive tasks and real-time interactions.

Figure 1. An overview of OmniMMI. OmniMMI consists of two categories of multi-modal interactive challenges: streaming video understanding (top) and proactive reasoning (bottom). Each query is processed into natural language text and synthetic audio as input.



Citation

@inproceedings{cvpr25omnimmi,
    title={OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts},
    author={Wang, Yuxuan and Wang, Yueqian and Chen, Bo and Wu, Tong and Zhao, Dongyan and Zheng, Zilong},
    booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)},
    year={2025}
}

    Related Publications

  • Zheng et al., MCU: An Evaluation Framework for Open-Ended Game Agents, in ICML, 2025.
  • Wang et al., Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge for Long Video Understanding, in EMNLP, 2024.
  • Wu et al., An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding, in NeurIPS, 2024.
  • Wang et al., JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models, TPAMI, 2024.
  • Wang et al., ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning, in CoLM, 2024.