Zilong Zheng's Homepage

I received my Ph.D. degree (21’) from the Department of Computer Science at University of California, Los Angeles (UCLA). My research interests lie in the intersection of statistical machine learning, natural language processing and cognition. Current research themes include:

Trustworthy AI: Crafting faithful, interpretable and trustworthy AI frameworks.
Human-like Conversational Agents: Building interactive models that align with human values and social norms.
Efficient Language Models: Efficient training and inference of long-context language models.
Generative Modeling: Statistical generative modeling (e.g. EBMs, diffusions) on high-dimensional data.

I am always looking for self-motivated students and long-term collaborators. Please contact me if you have excellent background or share similar research interests with me.

NEWS

Jun, 2025	VideoLLaMB is accepted to ICCV 2025. Congratulations to Yuxuan and Yiqi!
Jun, 2025	In-context Value Alignment and Navi2Gaze are accepted to IROS’25 for Oral Presentations!
May, 2025	I will be serving as Senior Area Chair for EMNLP.
May, 2025	Three papers on bidirectional LLM Encoder, ReflectEvo (Meta Reflection Learning) and Causal Value Steering are accepted to ACL’25! One paper on combinational creativity in VLMs is accepted to CogSci’25 for Oral presentation! Congratulations to Ziyong, Jiaqi, Yipeng and Yongqian!
May, 2025	Three papers on TokenSwift (long sequence acceleration), ToEdit (LLM model collapse) and MCU (open-ended agent evaluation) are accepted to ICML’25! MCU is awarded as Spotlight Poster! Congratulations to Tong, Xuekai and Xinyue!
Mar, 2025	OmniMMI is accepted to CVPR’25 . We devised the first-ever benchmark for streaming interactive Omni understanding. Please try your models on OmniMMI Leaderboard.
Jan, 2025	Three papers on in-context knowledge editing, multimodal knowledge editing and in-context alignment are accepted to ICLR’25!
Dec, 2024	I will co-host 1st workshop on Large Language Models and Structure Modeling. Stay tuned .
Dec, 2024	Diver-CT is accepted to AAAI’25. Congratulations to Andrew!
Sep, 2024	Two papers on long context window extension and situated inductive reasoning are accepted to NeurIPS’24. Congratulations to Tong and Xiaojuan!
Sep, 2024	Two papers on video understanding and sentence representation are accepted to EMNLP’24 Main. Congratulations to Yuxuan and Ziyong!
Jul, 2024	One paper on ToM for dialogue modeling is accepted for an Oral presentation at SIGDial’24. Congratulations to Shuwen!
Jul, 2024	One paper on compositional visual reasoning is accepted to CoLM’24. Congratulations to Yuxuan!
Apr, 2024	Four papers are accepted to ACL’24 with 2 Main conference and 2 Findings.
Jan, 2024	One paper is accepted to NAACL’24. Congratulations to Steven Gong.

selected publications

SEE ALL PUBLICATIONS

How to Synthesize Text Data without Model Collapse? ICML'25

Xuekai Zhu, Daixuan Cheng, Hengli Li, Kaiyan Zhang, Ermo Hua, Xingtai Lv, Ning Ding, Zhouhan Lin^#, Zilong Zheng^#, and Bowen Zhou^#, in ICML, 2025.

Abs arXiv Bib X

Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-{n} models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.
@inproceedings{zhu2025toedit, title={How to Synthesize Text Data without Model Collapse?}, author={Zhu, Xuekai and Cheng, Daixuan and Li, Hengli and Zhang, Kaiyan and Hua, Ermo and Lv, Xingtai and Ding, Ning and Lin, Zhouhan and Zheng, Zilong and Zhou, Bowen}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, year={2025} }
MCU: An Evaluation Framework for Open-Ended Game Agents ICML'25 Spotlight Paper

Xinyue Zheng^*, Haowei Lin^*, Kaichen He, Zihao Wang, Zilong Zheng^#, and Yitao Liang^#, in ICML, 2025.

Abs arXiv Bib Code Website

Developing AI agents capable of interacting with open-world environments to solve diverse tasks is a compelling challenge. However, evaluating such open-ended agents remains difficult, with current benchmarks facing scalability limitations. To address this, we introduce Minecraft Universe (MCU), a comprehensive evaluation framework set within the open-world video game Minecraft. MCU incorporates three key components: (1) an expanding collection of 3,452 composable atomic tasks that encompasses 11 major categories and 41 subcategories of challenges; (2) a task composition mechanism capable of generating infinite diverse tasks with varying difficulty; and (3) a general evaluation framework that achieves 91.5% alignment with human ratings for open-ended task assessment. Empirical results reveal that even state-of-the-art foundation agents struggle with the increasing diversity and complexity of tasks. These findings highlight the necessity of MCU as a robust benchmark to drive progress in AI agent development within open-ended environments.
@inproceedings{zheng2025mcu, title={MCU: An Evaluation Framework for Open-Ended Game Agents}, author={Zheng, Xinyue and Lin, Haowei and He, Kaichen and Wang, Zihao and Zheng, Zilong and Liang, Yitao}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, year={2025} }
TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation ICML'25

Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, and Zilong Zheng^#, in ICML, 2025.

Abs arXiv Bib Code X

Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model's inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at this URL.
@inproceedings{wu2025tokenswift, title={TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation}, author={Wu, Tong and Shen, Junzhe and Jia, Zixia and Wang, Yuxuan and Zheng, Zilong}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, year={2025} }
VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges ICCV'25

Yuxuan Wang^*, Yiqi Song^*, Cihang Xie, Yang Liu, and Zilong Zheng^#, in ICCV, 2025.

Abs arXiv Bib Code Website

Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers. In this work, we introduce VideoLLaMB, a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences alongside historical visual data, effectively preserving semantic continuity and enhancing model performance across various tasks. This approach includes recurrent memory tokens and a SceneTilling algorithm, which segments videos into independent semantic units to preserve semantic integrity. Empirically, VideoLLaMB significantly outstrips existing video-language models, demonstrating a 5.5 points improvement over its competitors across three VideoQA benchmarks, and 2.06 points on egocentric planning. Comprehensive results on the MVBench show that VideoLLaMB-7B achieves markedly better results than previous 7B models of same LLM. Remarkably, it maintains robust performance as PLLaVA even as video length increases up to 8 times. Besides, the frame retrieval results on our specialized Needle in a Video Haystack (NIAVH) benchmark, further validate VideoLLaMB's prowess in accurately identifying specific frames within lengthy videos. Our SceneTilling algorithm also enables the generation of streaming video captions directly, without necessitating additional training. In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU with linear GPU memory scaling, ensuring both high performance and cost-effectiveness, thereby setting a new foundation for long-form video-language models in both academic and practical applications.
@inproceedings{wang2025videollamb, title={VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges}, author={Yuxuan Wang and Cihang Xie and Yang Liu and Zilong Zheng}, year={2025}, booktitle={International Conference on Computer Vision}, url={https://arxiv.org/abs/2409.01071}, }
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts CVPR'25

Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng^#, in CVPR, 2025.

Abs arXiv Bib Code Website

The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 real-world interactive videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enhance real-time interactive reasoning with minimum finetuning on pre-trained MLLMs. Extensive experimental results reveal that the existing MLLMs fall short in interactive streaming understanding, particularly struggling with proactive tasks and multi-turn queries. Our proposed M4, though lightweight, demonstrates a significant improvement in handling proactive tasks and real-time interactions.
@inproceedings{cvpr25omnimmi, title={OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts}, author={Wang, Yuxuan and Wang, Yueqian and Chen, Bo and Wu, Tong and Zhao, Dongyan and Zheng, Zilong}, booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)}, year={2025} }
Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng^#, and Gao Huang^#, Preprint, 2025.

Abs arXiv Bib Code Website Model X HF Papers

Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-{n} models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.
@misc{zhao2025absolutezeroreinforcedselfplay, title={Absolute Zero: Reinforced Self-play Reasoning with Zero Data}, author={Andrew Zhao and Yiran Wu and Yang Yue and Tong Wu and Quentin Xu and Yang Yue and Matthieu Lin and Shenzhi Wang and Qingyun Wu and Zilong Zheng and Gao Huang}, year={2025}, eprint={2505.03335}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2505.03335}, }
RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling

Yang Liu, Jiaqi Li, and Zilong Zheng^#, Preprint, 2025.

Abs arXiv Bib Code Model Data

Rule-based reasoning has been acknowledged as one of the fundamental problems in reasoning, while deviations in rule formats, types, and complexity in real-world applications pose severe challenges. Recent studies have shown that large reasoning models (LRMs) have remarkable reasoning capabilities, and their performance is substantially enhanced by reinforcement learning (RL). However, it remains an open question whether small reasoning models (SRMs) can learn rule-based reasoning effectively with robust generalization across diverse tasks and domains. To address this, we introduce Reinforced Rule-based Reasoning, a.k.a. RuleReasoner, a simple yet effective method to conduct rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach. Specifically, RuleReasoner resamples each training batch by updating the sampling weights of different domains based on historical rewards. This facilitates domain augmentation and flexible online learning schedules for RL, obviating the need for pre-hoc human-engineered mix-training recipes used in existing methods. Empirical evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin (Δ4.1% average points on eight ID tasks and Δ10.4% average points on three OOD tasks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior dynamic sampling methods for RL.
@misc{liu2025rulereasoner, title={RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling}, author={Yang Liu and Jiaqi Li and Zilong Zheng}, year={2025}, eprint={2506.08672}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.08672}, }
Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space

Hengli Li, Chenxi Li, Tong Wu, Xuekai Zhu, Yuxuan Wang, Zhaoxin Yu, Eric Hanchen Jiang, Song-Chun Zhu, Zixia Jia, Ying Nian Wu^#, and Zilong Zheng^#, Preprint, 2025.

Abs arXiv Bib Code Website

Reasoning ability, a core component of human intelligence, continues to pose a significant challenge for Large Language Models (LLMs) in the pursuit of AGI. Although model performance has improved under the training scaling law, significant challenges remain, particularly with respect to training algorithms, such as catastrophic forgetting, and the limited availability of novel training data. As an alternative, test-time scaling enhances reasoning performance by increasing test-time computation without parameter updating. Unlike prior methods in this paradigm focused on token space, we propose leveraging latent space for more effective reasoning and better adherence to the test-time scaling law. We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space. Specifically, LatentSeek leverages policy gradient to iteratively update latent representations, guided by self-generated reward signals. LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures. Results show that LatentSeek consistently outperforms strong baselines, such as Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our analysis demonstrates that LatentSeek is highly efficient, typically converging within a few iterations for problems of average complexity, while also benefiting from additional iterations, thereby highlighting the potential of test-time scaling in the latent space. These findings position LatentSeek as a lightweight, scalable, and effective solution for enhancing the reasoning capabilities of LLMs.
@misc{li2025seekdarkreasoningtesttime, title={Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space}, author={Hengli Li and Chenxi Li and Tong Wu and Xuekai Zhu and Yuxuan Wang and Zhaoxin Yu and Eric Hanchen Jiang and Song-Chun Zhu and Zixia Jia and Ying Nian Wu and Zilong Zheng}, year={2025}, eprint={2505.13308}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2505.13308}, }
In situ bidirectional human-robot value alignment ScienceRobotics

Luyao Yuan^*#, Xiaofeng Gao^*, Zilong Zheng^*, Mark Edmonds^#, Ying Nian Wu, Federico Rossano, Hongjing Lu^#, Yixin Zhu^#, and Song-Chun Zhu^#, Science Robotics, 2022.

Abs DOI Bib Supp Video Code Website TechXplore 科技日报/新华网

A prerequisite for social coordination is bidirectional communication between teammates, each playing two roles simultaneously: as receptive listeners and expressive speakers. For robots working with humans in complex situations with multiple goals that differ in importance, failure to fulfill the expectation of either role could undermine group performance due to misalignment of values between humans and robots. Specifically, a robot needs to serve as an effective listener to infer human users’ intents from instructions and feedback and as an expressive speaker to explain its decision processes to users. Here, we investigate how to foster effective bidirectional human-robot communications in the context of value alignment—collaborative robots and users form an aligned understanding of the importance of possible task goals. We propose an explainable artificial intelligence (XAI) system in which a group of robots predicts users’ values by taking in situ feedback into consideration while communicating their decision processes to users through explanations. To learn from human feedback, our XAI system integrates a cooperative communication model for inferring human values associated with multiple desirable goals. To be interpretable to humans, the system simulates human mental dynamics and predicts optimal explanations using graphical models. We conducted psychological experiments to examine the core components of the proposed computational framework. Our results show that real-time human-robot mutual understanding in complex cooperative tasks is achievable with a learning model based on bidirectional communication. We believe that this interaction framework can shed light on bidirectional value alignment in communicative XAI systems and, more broadly, in future human-machine teaming systems. An explainable artificial intelligence collaboration framework enables in situ bidirectional human-robot value alignment.
@article{ doi:10.1126/scirobotics.abm4183, author = {Luyao Yuan and Xiaofeng Gao and Zilong Zheng and Mark Edmonds and Ying Nian Wu and Federico Rossano and Hongjing Lu and Yixin Zhu and Song-Chun Zhu }, title = {In situ bidirectional human-robot value alignment}, journal = {Science Robotics}, volume = {7}, number = {68}, pages = {eabm4183}, year = {2022}, doi = {10.1126/scirobotics.abm4183}, URL = {https://www.science.org/doi/abs/10.1126/scirobotics.abm4183}, eprint = {https://www.science.org/doi/pdf/10.1126/scirobotics.abm4183} }
Patchwise Generative ConvNet: Training Energy-Based Models from a Single Natural Image for Internal Learning CVPR'21 Oral

Zilong Zheng, Jianwen Xie, and Ping Li, in CVPR, 2021.

Abs Bib PDF Supp Code Website

Exploiting internal statistics of a single natural image has long been recognized as a significant research paradigm where the goal is to learn the internal distribution of patches within the image without relying on external training data. Different from prior works that model such a distribution implicitly with a top-down latent variable model (e.g., generator), this paper proposes to explicitly represent the statistical distribution within a single natural image by using an energy-based generative framework, where a pyramid of energy functions, each parameterized by a bottom-up deep neural network, are used to capture the distributions of patches at different resolutions. Meanwhile, a coarse-to-fine sequential training and sampling strategy is presented to train the model efficiently. Besides learning to generate random samples from white noise, the model can learn in parallel with a self-supervised task (e.g., recover the input image from its corrupted version), which can further improve the descriptive power of the learned model. The proposed model is simple and natural in that it does not require an auxiliary model (e.g., discriminator) to assist the training. Besides, it also unifies internal statistics learning and image generation in a single framework. Experimental results presented on various image generation and manipulation tasks, including super-resolution, image editing, harmonization, style transfer, etc., have demonstrated the effectiveness of our model for internal learning.
@inproceedings{zheng2021patchgencn, title={Patchwise Generative ConvNet: Training Energy-Based Models from a Single Natural Image for Internal Learning}, author={Zheng, Zilong and Xie, Jianwen and Li, Ping}, booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)}, year={2021} }
Reasoning Visual Dialogs with Structural and Partial Observations CVPR'19 Oral

Zilong Zheng^*, Wenguan Wang^*, Siyuan Qi^*, and Song-Chun Zhu, in CVPR, 2019.

Abs arXiv Bib Code

We propose a novel model to address the task of Visual Dialog which exhibits complex dialog structures. To obtain a reasonable answer based on the current question and the dialog history, the underlying semantic dependencies between dialog entities are essential. In this paper, we explicitly formalize this task as inference in a graphical model with partially observed nodes and unknown graph structures (relations in dialog). The given dialog entities are viewed as the observed nodes. The answer to a given question is represented by a node with missing value. We first introduce an Expectation Maximization algorithm to infer both the underlying dialog structures and the missing node values (desired answers). Based on this, we proceed to propose a differentiable graph neural network (GNN) solution that approximates this process. Experiment results on the VisDial and VisDial-Q datasets show that our model outperforms comparative methods. It is also observed that our method can infer the underlying dialog structure for better dialog reasoning.
@inproceedings{zheng2019reasoning, title={Reasoning Visual Dialogs with Structural and Partial Observations}, author={Zheng, Zilong and Wang, Wenguan and Qi, Siyuan and Zhu, Song-Chun}, booktitle={Computer Vision and Pattern Recognition (CVPR), 2019 IEEE Conference on}, year={2019} }
Learning Descriptor Networks for 3D Shape Synthesis and Analysis CVPR'18 Oral

Jianwen Xie^*, Zilong Zheng^*, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, and Ying Nian Wu, in CVPR, 2018.

Abs arXiv Bib Code Website

This paper proposes a 3D shape descriptor network, which is a deep convolutional energy-based model, for modeling volumetric shape patterns. The maximum likelihood training of the model follows an “analysis by synthesis” scheme and can be interpreted as a mode seeking and mode shifting process. The model can synthesize 3D shape patterns by sampling from the probability distribution via MCMC such as Langevin dynamics. The model can be used to train a 3D generator network via MCMC teaching. The conditional version of the 3D shape descriptor net can be used for 3D object recovery and 3D object super-resolution. Experiments demonstrate that the proposed model can generate realistic 3D shape patterns and can be useful for 3D shape analysis.
@inproceedings{xie2018learning, title={Learning Descriptor Networks for 3D Shape Synthesis and Analysis}, author={Xie, Jianwen and Zheng, Zilong and Gao, Ruiqi and Wang, Wenguan and Zhu, Song-Chun and Wu, Ying Nian}, booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)}, pages={8629--8638}, year={2018} }