PUBLICATION | ICCV'25

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

Yuxuan Wang^*, Yiqi Song^*, Cihang Xie, Yang Liu, and Zilong Zheng^#

ICCV · 2025 · arXiv: arxiv.org/abs/2409.01071

Abstract

Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers. In this work, we introduce VideoLLaMB, a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences alongside historical visual data, effectively preserving semantic continuity and enhancing model performance across various tasks. This approach includes recurrent memory tokens and a SceneTilling algorithm, which segments videos into independent semantic units to preserve semantic integrity. Empirically, VideoLLaMB significantly outstrips existing video-language models, demonstrating a 5.5 points improvement over its competitors across three VideoQA benchmarks, and 2.06 points on egocentric planning. Comprehensive results on the MVBench show that VideoLLaMB-7B achieves markedly better results than previous 7B models of same LLM. Remarkably, it maintains robust performance as PLLaVA even as video length increases up to 8 times. Besides, the frame retrieval results on our specialized Needle in a Video Haystack (NIAVH) benchmark, further validate VideoLLaMB’s prowess in accurately identifying specific frames within lengthy videos. Our SceneTilling algorithm also enables the generation of streaming video captions directly, without necessitating additional training. In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU with linear GPU memory scaling, ensuring both high performance and cost-effectiveness, thereby setting a new foundation for long-form video-language models in both academic and practical applications.

Figure 1. An overview of VideoLLaMB. VideoLLaMB first extracts the video features using an off-the-shelf vision encoder, then apply SceneTilling to segment the video into semantic segments. Next, we use recurrent memory on these semantic segments to store video information within memory tokens. We further employ a retrieval mechanism to update the memory tokens and address long-dependency issues. Finally, we project the memory-token-augmented features from the current video segment into the LLM.

Citation

@inproceedings{wang2025videollamb,
    title={VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges}, 
    author={Yuxuan Wang and Cihang Xie and Yang Liu and Zilong Zheng},
    year={2025},
    booktitle={International Conference on Computer Vision},
    url={https://arxiv.org/abs/2409.01071}, 
}

Related Publications

Wang et al., OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts, in CVPR, 2025.

Zheng et al., MCU: An Evaluation Framework for Open-Ended Game Agents, in ICML, 2025.

Wang et al., Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge for Long Video Understanding, in EMNLP, 2024.

Wu et al., An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding, in NeurIPS, 2024.

Wang et al., JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models, TPAMI, 2024.

Wang et al., ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning, in CoLM, 2024.