Zilong Zheng's Homepage

Preprint

VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges

Yuxuan Wang, Cihang Xie, Yang Liu, and Zilong Zheng^✉.

Abs arXiv Code Bibtex

Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers. In this work, we introduce VideoLLaMB, a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences alongside historical visual data, effectively preserving semantic continuity and enhancing model performance across various tasks. This approach includes recurrent memory tokens and a SceneTilling algorithm, which segments videos into independent semantic units to preserve semantic integrity. Empirically, VideoLLaMB significantly outstrips existing video-language models, demonstrating a 5.5 points improvement over its competitors across three VideoQA benchmarks, and 2.06 points on egocentric planning. Comprehensive results on the MVBench show that VideoLLaMB-7B achieves markedly better results than previous 7B models of same LLM. Remarkably, it maintains robust performance as PLLaVA even as video length increases up to 8 times. Besides, the frame retrieval results on our specialized Needle in a Video Haystack (NIAVH) benchmark, further validate VideoLLaMB's prowess in accurately identifying specific frames within lengthy videos. Our SceneTilling algorithm also enables the generation of streaming video captions directly, without necessitating additional training. In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU with linear GPU memory scaling, ensuring both high performance and cost-effectiveness, thereby setting a new foundation for long-form video-language models in both academic and practical applications.

@article{wang2024videollamb, title={VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges}, author={Wang, Yuxuan and Xie, Cihang and Liu, Yang and Zheng, Zilong}, journal = {arXiv preprint arXiv: 2409.01071}, year={2024} }
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

Chao Lou, Zixia Jia, Zilong Zheng^✉, and Kewei Tu^✉.

Abs arXiv Bibtex

Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements inherent in self-attention mechanisms. In this work, we introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome these computational and memory obstacles while maintaining performance. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query, thereby enabling gradient-based optimization. As a result, SPARSEK Attention offers linear time complexity and constant memory footprint during generation. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods and provides significant speed improvements during both training and inference, particularly in language modeling and downstream tasks. Furthermore, our method can be seamlessly integrated into pre-trained Large Language Models (LLMs) with minimal fine-tuning, offering a practical solution for effectively managing long-range dependencies in diverse applications.

@article{lou2024sparsek, title={Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers}, author={Lou, Chao and Jia, Zixia and Zheng, Zilong and Tu, Kewei}, journal = {arXiv preprint arXiv: 2406.16747}, year={2024} }
Large language models are in-context semantic reasoners rather than symbolic reasoners

Xiaojuan Tang*, Zilong Zheng*, Jiaqi Li, Fanxu Meng, Song-Chun Zhu, Yitao Liang, and Muhan Zhang.

Abs arXiv Code Bibtex

The emergent few-shot reasoning capabilities of Large Language Models (LLMs) have excited the natural language and machine learning community over recent years. Despite of numerous successful applications, the underlying mechanism of such in-context capabilities still remains unclear. In this work, we hypothesize that the learned \textit{semantics} of language tokens do the most heavy lifting during the reasoning process. Different from human's symbolic reasoning process, the semantic representations of LLMs could create strong connections among tokens, thus composing a superficial logical chain. To test our hypothesis, we decouple semantics from the language reasoning process and evaluate three kinds of reasoning abilities, i.e., deduction, induction and abduction. Our findings reveal that semantics play a vital role in LLMs' in-context reasoning -- LLMs perform significantly better when semantics are consistent with commonsense but struggle to solve symbolic or counter-commonsense reasoning tasks by leveraging in-context new knowledge. The surprising observations question whether modern LLMs have mastered the inductive, deductive and abductive reasoning abilities as in human intelligence, and motivate research on unveiling the magic existing within the black-box LLMs. On the whole, our analysis provides a novel perspective on the role of semantics in developing and evaluating language models' reasoning abilities.

@article{tang2023icsr, title = {Large language models are in-context semantic reasoners rather than symbolic reasoners}, author = {Tang, Xiaojuan and Zheng, Zilong and Li, Jiaqi and Meng, Fanxu and Zhu, Song-Chun and Liang, Yitao and Zhang, Muhan}, year = {2023}, journal = {arXiv preprint arXiv: 2305.14825} }

2025

Lossless Acceleration of Ultra Long Sequence Generation ICML'25

Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, and Zilong Zheng^✉, in ICML, 2025.

Abs arXiv Code Bibtex X

Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model's inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at this URL.

@article{wu2025tokenswift, title={Lossless Acceleration of Ultra Long Sequence Generation}, author={Wu, Tong and Shen, Junzhe and Jia, Zixia and Wang, Yuxuan and Zheng, Zilong}, journal = {Forty-Second International Conference on Machine Learning}, year={2025} }
How to Synthesize Text Data without Model Collapse? ICML'25

Xuekai Zhu, Daixuan Cheng, Hengli Li, Kaiyan Zhang, Ermo Hua, Xingtai Lv, Ning Ding, Zhouhan Lin^✉, Zilong Zheng^✉, and Bowen Zhou^✉, in ICML, 2025.

Abs arXiv Bibtex X

Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-{n} models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.

@article{zhu2025toedit, title={How to Synthesize Text Data without Model Collapse?}, author={Zhu, Xuekai and Cheng, Daixuan and Li, Hengli and Zhang, Kaiyan and Hua, Ermo and Lv, Xingtai and Ding, Ning and Lin, Zhouhan and Zheng, Zilong and Zhou, Bowen}, journal = {Forty-Second International Conference on Machine Learning}, year={2025} }
MCU: An Evaluation Framework for Open-Ended Game Agents Spotlight ICML'25

Xinyue Zheng*, Haowei Lin*, Kaichen He, Zihao Wang, Zilong Zheng, and Yitao Liang, in ICML, 2025.

Abs Bibtex

Developing AI agents capable of interacting with open-world environments to solve diverse tasks is a compelling challenge. However, evaluating such open-ended agents remains difficult, with current benchmarks facing scalability limitations. To address this, we introduce Minecraft Universe (MCU), a comprehensive evaluation framework set within the open-world video game Minecraft. MCU incorporates three key components: (1) an expanding collection of 3,452 composable atomic tasks that encompasses 11 major categories and 41 subcategories of challenges; (2) a task composition mechanism capable of generating infinite diverse tasks with varying difficulty; and (3) a general evaluation framework that achieves 91.5% alignment with human ratings for open-ended task assessment. Empirical results reveal that even state-of-the-art foundation agents struggle with the increasing diversity and complexity of tasks. These findings highlight the necessity of MCU as a robust benchmark to drive progress in AI agent development within open-ended environments.

@article{zheng2025mcu, title={MCU: An Evaluation Framework for Open-Ended Game Agents}, author={Zheng, Xinyue and Lin, Haowei and He, Kaichen and Wang, Zihao and Zheng, Zilong and Liang, Yitao}, journal = {Forty-Second International Conference on Machine Learning}, year={2025} }
Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs ICLR'25

Zhaowei Zhang, Fengshuo Bai, Qizhi Chen, Chengdong Ma, Mingzhi Wang, Haoran Sun, Zilong Zheng^✉, and Yaodong Yang^✉, in ICLR, 2025.

Abs PDF Bibtex

How to align large language models (LLMs) with user preferences from a static general dataset has been frequently studied. However, user preferences are usually personalized, changing, and diverse. This leads to the problem that the actual user preferences often do not coincide with those trained by the model developers in the practical use of LLMs. Since we cannot collect enough data and retrain for every demand, researching efficient real-time preference adaptation methods based on the backbone LLMs during test time is important. To this end, we introduce Amulet, a novel, training-free framework that formulates the decoding process of every token as a separate online learning problem with the guidance of simple user-provided prompts, thus enabling real-time optimization to satisfy users' personalized preferences. To reduce the computational cost brought by this optimization process for each token, we additionally provide a closed-form solution for each iteration step of the optimization process, thereby reducing the computational time cost to a negligible level. The detailed experimental results demonstrate that Amulet can achieve significant performance improvements in rich settings with combinations of different LLMs, datasets, and user preferences, while maintaining acceptable computational efficiency.

@inproceedings{zhang2025amulet, title={Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs}, author={Zhang, Zhaowei and Bai, Fengshuo and Chen, Qizhi and Ma, Chengdong and Wang, Mingzhi and Sun, Haoran and Zheng, Zilong and Yang, Yaodong}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025} }
MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge ICLR'25

Yuntao Du, Kailin Jiang, Zhi Gao, Chenrui Shi, Zilong Zheng^✉, Siyuan Qi, and Qing Li^✉, in ICLR, 2025.

Abs PDF Bibtex

Knowledge editing techniques have emerged as essential tools for updating the factual knowledge of large language models (LLMs) and multimodal models (LMMs), allowing them to correct outdated or inaccurate information without retraining from scratch. However, existing benchmarks for multimodal knowledge editing primarily focus on entity-level knowledge represented as simple triplets, which fail to capture the complexity of real-world multimodal information. To address this issue, we introduce MMKE-Bench, a comprehensive MultiModal Knowledge Editing Benchmark, designed to evaluate the ability of LMMs to edit diverse visual knowledge in real-world scenarios. MMKE-Bench addresses these limitations by incorporating three types of editing tasks: visual entity editing, visual semantic editing, and user-specific editing. Besides, MMKE-Bench uses free-form natural language to represent and edit knowledge, offering a more flexible and effective format. The benchmark consists of 2,940 pieces of knowledge and 7,229 images across 110 fine-grained types, with evaluation questions automatically generated and human-verified. We assess five state-of-the-art knowledge editing methods on three prominent LMMs, revealing that no method excels across all criteria, and that visual and user-specific edits are particularly challenging. MMKE-Bench sets a new standard for evaluating the robustness of multimodal knowledge editing techniques, driving progress in this rapidly evolving field.

@inproceedings{du2025mmke, title={MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge}, author={Du, Yuntao and Jiang, Kailin and Gao, Zhi and Shi, Chenrui and Zheng, Zilong and Qi, Siyuan and Li, Qing}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025} }
In-Context Editing: Learning Knowledge from Self-Induced Distributions ICLR'25

Siyuan Qi^✉, Bangcheng Yang, Kailin Jiang, Xiaobo Wang, Jiaqi Li, Yifan Zhong, Yaodong Yang, and Zilong Zheng^✉, in ICLR, 2025.

Abs arXiv Code Bibtex

The existing fine-tuning paradigm for language models is brittle in knowledge editing scenarios, where the model must incorporate new information without extensive retraining. This brittleness often results in overfitting, reduced performance, and unnatural language generation. To address this, we propose Consistent In-Context Editing (ICE), a novel approach that leverages the model's in-context learning capability to tune toward a contextual distribution rather than a one-hot target. ICE introduces a straightforward optimization framework that includes both a target and a procedure, enhancing the robustness and effectiveness of gradient-based tuning methods. We provide analytical insights into ICE across four critical aspects of knowledge editing: accuracy, locality, generalization, and linguistic quality, showing its advantages. Experimental results across four datasets confirm the effectiveness of ICE and demonstrate its potential for continual editing, ensuring that updated information is incorporated while preserving the integrity of the model.

@inproceedings{qi2025ice, title={In-Context Editing: Learning Knowledge from Self-Induced Distributions}, author={Qi, Siyuan and Yang, Bangcheng, and Jiang, Kailin and Wang, Xiaobo and Li, Jiaqi and Zhong, Yifan and Yang, Yaodong and Zheng, Zilong}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025} }
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts CVPR'25

Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng^✉, in CVPR, 2025.

Abs arXiv Code Website Bibtex

The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 real-world interactive videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enhance real-time interactive reasoning with minimum finetuning on pre-trained MLLMs. Extensive experimental results reveal that the existing MLLMs fall short in interactive streaming understanding, particularly struggling with proactive tasks and multi-turn queries. Our proposed M4, though lightweight, demonstrates a significant improvement in handling proactive tasks and real-time interactions.

@inproceedings{cvpr25omnimmi, title={OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts}, author={Wang, Yuxuan and Wang, Yueqian and Chen, Bo and Wu, Tong and Zhao, Dongyan and Zheng, Zilong}, booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)}, year={2025} }
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints Oral AAAI'25

Andrew Zhao, Quentin Xu, Matthieu Liu, Shenzhi Wang, Yong-jin Liu, Zilong Zheng^✉, and Gao Huang^✉, in AAAI, 2025.

Abs arXiv Code Website Bibtex

Recent advances in large language models (LLMs) have made them indispensable, raising significant concerns over managing their safety. Automated red teaming offers a promising alternative to the labor-intensive and error-prone manual probing for vulnerabilities, providing more consistent and scalable safety evaluations. However, existing approaches often compromise diversity by focusing on maximizing attack success rate. Additionally, methods that decrease the cosine similarity from historical embeddings with semantic diversity rewards lead to novelty stagnation as history grows. To address these issues, we introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity. Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization. Project details and code can be found at https://andrewzh112.github.io/#diverct.

@article{zhao2025diverct, title={DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints}, author={Zhao, Andrew and Xu, Quentin and Liu, Matthieu and Wang, Shenzhi and Liu, Yong-jin and Zheng, Zilong and Huang, Gao}, journal={Proceedings of the AAAI Conference on Artificial Intelligence}, volume={39}, year={2025} }

2024

JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models TPAMI'24

Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, and Yitao Liang, TPAMI, 2024.

Abs arXiv Website Bibtex

Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. JARVIS-1 is the existing most general agent in Minecraft, capable of completing over 200 different tasks using control and observation space similar to humans. These tasks range from short-horizon tasks, e.g., "chopping trees" to long-horizon tasks, e.g., "obtaining a diamond pickaxe". JARVIS-1 performs exceptionally well in short-horizon tasks, achieving nearly perfect performance. In the classic long-term task of ObtainDiamondPickaxe, JARVIS-1 surpasses the reliability of current state-of-the-art agents by 5 times and can successfully complete longer-horizon and more challenging tasks.

@article{wang2023jarvis1, title = {JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models}, author = {Zihao Wang and Shaofei Cai and Anji Liu and Yonggang Jin and Jinbing Hou and Bowei Zhang and Haowei Lin and Zhaofeng He and Zilong Zheng and Yaodong Yang and Xiaojian Ma and Yitao Liang}, year = {2023}, journal = {arXiv preprint arXiv: 2311.05997} }
MindDial: Belief Dynamics Tracking with Theory-of-Mind Modeling for Situated Neural Dialogue Generation Oral SIGDIAL'24

Shuwen Qiu, Mingdian Liu, Hengli Li, Song-Chun Zhu, and Zilong Zheng^✉, in SIGDIAL, 2024. (also in Workshop on Theory-of-Mind at ICML 2023)

Abs arXiv Bibtex

Humans talk in daily conversations while aligning and negotiating the expressed meanings or common ground. Despite the impressive conversational abilities of the large generative language models, they do not consider the individual differences in contextual understanding in a shared situated environment. In this work, we propose MindDial, a novel conversational framework that can generate situated free-form responses to align and negotiate common ground. We design an explicit mind module that can track three-level beliefs -- the speaker's belief, the speaker's prediction of the listener's belief, and the belief gap between the first two. Then the next response is generated to resolve the belief difference and take task-related action. Our framework is applied to both prompting and fine-tuning-based models, and is evaluated across scenarios involving both common ground alignment and negotiation. Experiments show that models with mind modeling can generate more human-like responses when aligning and negotiating common ground. The ablation study further validates the three-level belief design can aggregate information and improve task outcomes in both cooperative and negotiating settings.

@inproceedings{qiu2023minddial, title={MindDial: Belief Dynamics Tracking with Theory-of-Mind Modeling for Situated Neural Dialogue Generation}, author={Qiu, Shuwen and Liu, Mingdian and Li, Hengli and Zhu, Song-Chun and Zheng, Zilong}, booktitle={Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue}, year={2023} }
Mars: Situated Inductive Reasoning in an Open-World Environment NeurIPS'24

Xiaojuan Tang, Jiaqi Li, Yitao Liang, Muhan Zhang, and Zilong Zheng^✉, in NeurIPS D&B Track, 2024.

Abs arXiv Code Website Bibtex

Large Language Models (LLMs) trained on massive corpora have shown remarkable success in knowledge-intensive tasks. Yet, most of them rely on pre-stored knowledge. Inducing new general knowledge from a specific environment and performing reasoning with the acquired knowledge---situated inductive reasoning, is crucial and challenging for machine intelligence. In this paper, we design Mars, an interactive environment devised for situated inductive reasoning. It introduces counter-commonsense game mechanisms by modifying terrain, survival setting and task dependency while adhering to certain principles. In Mars, agents need to actively interact with their surroundings, derive useful rules and perform decision-making tasks in specific contexts. We conduct experiments on various RL-based and LLM-based methods, finding that they all struggle on this challenging situated inductive reasoning benchmark. Furthermore, we explore Induction from Reflection, where we instruct agents to perform inductive reasoning from history trajectory. The superior performance underscores the importance of inductive reasoning in Mars. Through Mars, we aim to galvanize advancements in situated inductive reasoning and set the stage for developing the next generation of AI systems that can reason in an adaptive and context-sensitive way.

@inproceedings{tang2024mars, title={Mars: Situated Inductive Reasoning in an Open-World Environment}, author={Tang, Xiaojuan and Li, Jiaqi and Liang, Yitao and Zhang, Muhan and Zheng, Zilong}, booktitle={38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks}, year={2024} }
An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding NeurIPS'24

Tong Wu, Yanpeng Zhao, and Zilong Zheng^✉, in NeurIPS, 2024.

Abs arXiv Code Bibtex

Recently, many methods have been developed to extend the context length of pre-trained large language models (LLMs), but they often require fine-tuning at the target length (>> 4K) and struggle to effectively utilize information from the middle part of the context. To address these issues, we propose Continuity-Relativity indExing with gAussian Middle (CREAM), which interpolates positional encodings by manipulating position indices. Apart from being simple, CREAM is training-efficient: it only requires fine-tuning at the pre-trained context window (e.g., Llama 2-4K) and can extend LLMs to a much longer target context length (e.g., 256K). To ensure that the model focuses more on the information in the middle, we introduce a truncated Gaussian to encourage sampling from the middle part of the context during fine-tuning, thus alleviating the "Lost-in-the-Middle" problem faced by long-context LLMs. Experimental results show that CREAM successfully extends LLMs to the target length for both Base and Chat versions of Llama2-7B with "Never Miss A Beat".

@inproceedings{wu2024cream, title={An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding}, author={Tong Wu, Yanpeng Zhao, Zilong Zheng}, booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, volume = {37}, year={2024} }
Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge for Long Video Understanding EMNLP'24

Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao, Yang Liu, and Zilong Zheng^✉, in EMNLP, 2024.

Abs arXiv Code Bibtex

Despite progress in multimodal large language models~(MLLMs), the challenge of interpreting long-form videos in response to linguistic queries persists, largely due to the inefficiency in temporal grounding and limited pre-trained context window size. In this work, we introduce Temporal Grounding Bridge (TGB), a novel framework that bootstraps MLLMs with advanced temporal grounding capabilities and broadens their contextual scope. Our framework significantly enhances the temporal capabilities of current MLLMs through three key innovations: an efficient multi-span temporal grounding algorithm applied to low-dimension temporal features projected from flow; a multimodal length extrapolation training paradigm that utilizes low-dimension temporal features to extend the training context window size; and a bootstrapping framework that bridges our model with pluggable MLLMs without requiring annotation. We validate TGB across seven video benchmarks and demonstrate substantial performance improvements compared with prior MLLMs. Notably, our model, initially trained on sequences of four frames, effectively handles sequences up to 16x longer without sacrificing performance, highlighting its scalability and effectiveness in real-world applications. Our code is publicly available.

@inproceedings{wang2024videotgb, title={Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge}, author={Wang, Yuxuan and Wang, Yueqian and Wu, Pengfei and Liang, Jianxin and Zhao, Dongyan and Liu, Yang and Zheng, Zilong}, booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year={2024} }
Varying Sentence Representations via Condition-Specified Routers EMNLP'24

Ziyong Lin, Quansen Wang, Zixia Jia^✉, and Zilong Zheng^✉, in EMNLP, 2024.

Abs PDF Bibtex

Semantic similarity between two sentences is inherently subjective and can vary significantly based on the specific aspects emphasized. Consequently, traditional sentence encoders must be capable of generating conditioned sentence representations that account for diverse conditions or aspects. In this paper, we propose a novel yet efficient framework based on transformer-based language models that facilitates advanced conditioned sentence representation while maintaining model parameters and computational efficiency. Empirical evaluations on the Conditional Semantic Textual Similarity task demonstrate the superiority of our proposed framework.

@inproceedings{lin2024csr, title={Varying Sentence Representations via Condition-Specified Routers}, author={Lin, Ziyong and Wang, Quansen and Jia, Zixia and Zheng, Zilong}, booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year={2024} }
ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning CoLM'24

Yuxuan Wang, Alan Yuille, Zhuowan Li^✉, and Zilong Zheng^✉, in CoLM, 2024.

Abs arXiv PDF Code Bibtex

Compositional visual reasoning methods, which translate a complex query into a structured composition of feasible visual tasks, have exhibited a strong potential in complicated multimodal tasks like visual question answering, language-guided image editing, etc. Empowered by recent advances in large language models~(LLMs), this multimodal challenge has been brought to a new stage by treating LLMs as few-shot/zero-shot planners, i.e., visual-language programming.Such methods, despite their numerous merits, suffer from challenges due to LLM planning mistakes or inaccuracy of visual execution modules, lagging behind the non-compositional models.In this work, we devise a ``plug-and-play" method, ExoViP, to correct the errors at both the planning and execution stages through introspective verification. We employ verification modules as ``exoskeletons" to enhance current vision-language programming schemes. Specifically, our proposed verification module utilizes a mixture of three sub-verifiers to validate predictions after each reasoning step, subsequently calibrating the visual module predictions and refining the reasoning trace planned by LLMs. Experimental results on two representative vision-language programming methods showcase consistent improvements on five compositional reasoning tasks on standard benchmarks. In light of this, we believe ExoViP can foster better performance and generalization on open-domain multimodal challenges.

@inproceedings{wang2024exovip, title={ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning}, author={Wang, Yuxuan and Yuille, Alan and Li, Zhuowan and Zheng Zilong}, booktitle={The first Conference on Language Modeling (CoLM)}, year={2024} }
Boosting LLM Agents with Recursive Contemplation for Effective Deception Handling ACL'24

Shenzhi Wang, Chang Liu, Zilong Zheng^✉, Siyuan Qi, Shuo Chen, Qisen Yang, Andrew Zhao, Shaofei Wang, Shiji Song, and Gao Huang^✉, in ACL Findings, 2024.

Abs arXiv Code Bibtex

Recent breakthroughs in large language models (LLMs) have brought remarkable success in the field of LLM-as-Agent. Nevertheless, a prevalent assumption is that the information processed by LLMs is consistently honest, neglecting the pervasive deceptive or misleading information in human society and AI-generated content. This oversight makes LLMs susceptible to malicious manipulations, potentially resulting in detrimental outcomes. This study utilizes the intricate Avalon game as a testbed to explore LLMs' potential in deceptive environments. Avalon, full of misinformation and requiring sophisticated logic, manifests as a "Game-of-Thoughts". Inspired by the efficacy of humans' recursive thinking and perspective-taking in the Avalon game, we introduce a novel framework, Recursive Contemplation (ReCon), to enhance LLMs' ability to identify and counteract deceptive information. ReCon combines formulation and refinement contemplation processes; formulation contemplation produces initial thoughts and speech, while refinement contemplation further polishes them. Additionally, we incorporate first-order and second-order perspective transitions into these processes respectively. Specifically, the first-order allows an LLM agent to infer others' mental states, and the second-order involves understanding how others perceive the agent's mental state. After integrating ReCon with different LLMs, extensive experiment results from the Avalon game indicate its efficacy in aiding LLMs to discern and maneuver around deceptive information without extra fine-tuning and data. Finally, we offer a possible explanation for the efficacy of ReCon and explore the current limitations of LLMs in terms of safety, reasoning, speaking style, and format, potentially furnishing insights for subsequent research.

@inproceedings{wang2024avalon, title={Boosting LLM Agents with Recursive Contemplation for Effective Deception Handling}, author={Wang, Shenzhi and Liu, Chang and Zheng, Zilong and Qi, Siyuan and Chen, Shuo and Yang, Qisen and Zhao, Andrew and Wang, Shaofei and Song, Shiji and Huang, Gao}, booktitle={Findings of the Association for Computational Linguistics: ACL-Findings}, year={2024} }
LooGLE: Can Long-Context Language Models Understand Long Contexts? ACL'24

Jiaqi Li, Mengmeng Wang, Zilong Zheng^✉, and Muhan Zhang^✉, in ACL, 2024.

Abs arXiv Code Bibtex

Large language models (LLMs), despite their impressive performance in various language tasks, are typically limited to processing texts within context-window size. This limitation has spurred significant research efforts to enhance LLMs' long-context understanding with high-quality long-sequence benchmarks. However, prior datasets in this regard suffer from shortcomings, such as short context length compared to the context window of modern LLMs; outdated documents that have data leakage problems; and an emphasis on short dependency tasks rather than long dependency tasks. In this paper, we present LooGLE, a Long Context Generic Language Evaluation benchmark for LLMs' long context understanding. LooGLE features relatively new documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning diverse domains. Human annotators meticulously crafted more than 1,100 high-quality question-answer pairs to meet the long dependency requirements. These pairs underwent thorough cross-validation, yielding the most precise assessment of LLMs' long dependency capabilities. The evaluation of eight state-of-the-art LLMs on LooGLE revealed key findings: (i) commercial models outperformed open-sourced models; (ii) LLMs excelled in short dependency tasks like short question-answering and cloze tasks but struggled with more intricate long dependency tasks; (iii) in-context learning and chaining thoughts offered only marginal improvements; (iv) retrieval-based techniques demonstrated substantial benefits for short question-answering, while strategies for extending context window length had limited impact on long context understanding. As such, LooGLE not only provides a systematic and comprehensive evaluation schema on long-context LLMs, but also sheds light on future development of enhanced models towards "true long-context understanding".

@inproceedings{li2024loogle, title={LooGLE: Can Long-Context Language Models Understand Long Contexts?}, author={Li, Jiaqi and Wang, Mengmeng and Zheng, Zilong and Zhang, Muhan}, booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, year={2024} }
LangSuit⋅E: Controlling, Planning, and Interacting with Large Language Models in Embodied Text Environments ACL'24

Zixia Jia, Mengmeng Wang, Baichen Tong, Song-Chun Zhu, and Zilong Zheng^✉, in ACL Findings, 2024. (also in SpLU-RoboNLP Workshop at ACL 2024)

Abs arXiv Code Bibtex

Recent advances in Large Language Models (LLMs) have shown inspiring achievements in constructing autonomous agents that rely on language descriptions as inputs. However, it remains unclear how well LLMs can function as few-shot or zero-shot embodied agents in dynamic interactive environments. To address this gap, we introduce LangSuit·E, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds. Compared with previous LLM-based testbeds, LangSuit·E (i) offers adaptability to diverse environments without multiple simulation engines, (ii) evaluates agents’ capacity to develop “internalized world knowledge” with embodied observations, and (iii) allows easy customization of communication and action strategies. To address the embodiment challenge, we devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information. Comprehensive benchmark results illustrate challenges and insights of embodied planning. LangSuit·E represents a significant step toward building embodied generalists in the context of language models.

@inproceedings{jia2024langsuite, title={LangSuit$\cdot$E: Controlling, Planning, and Interacting with Large Language Models in Embodied Text Environments}, author={Jia, Zixia and Wang, Mengmeng and Tong, Baichen and Zhu, Song-Chun and Zheng, Zilong}, booktitle={Findings of the Association for Computational Linguistics: ACL-Findings 2024}, year={2024} }
Combining Supervised Learning and Reinforcement Learning for Multi-Label Classification Tasks with Partial Labels ACL'24

Zixia Jia, Junpeng Li, Shichuan Zhang, and Zilong Zheng^✉, in ACL, 2024.

Abs arXiv PDF Code Bibtex

Traditional supervised learning heavily relies on human-annotated datasets, especially in data-hungry neural approaches. However, various tasks, especially multi-label tasks like document-level relation extraction, pose challenges in fully manual annotation due to the specific domain knowledge and large class sets. Therefore, we address the multi-label positive-unlabelled learning (MLPUL) problem, where only a subset of positive classes is annotated. We propose Mixture Learner for Partially Annotated Classification (MLPAC), an RL-based framework combining the exploration ability of reinforcement learning and the exploitation ability of supervised learning. Experimental results across various tasks, including document-level relation extraction, multi-label image classification, and binary PU learning, demonstrate the generalization and effectiveness of our framework.

@inproceedings{jia2024combining, title={Combining Supervised Learning and Reinforcement Learning for Multi-Label Classification Tasks with Partial Labels}, author={Jia, Zixia and Li, Junpeng and Zhang, Shichuan and Zheng, Zilong}, booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, year={2024} }
MindAgent: Emergent Gaming Interaction

Ran Gong, Qiuyuan Huang, Xiaojian Ma, Hoi Vo, Zane Durante, Yusuke Noda, Zilong Zheng, Demetri Terzopoulos, Fei-Fei Li, and Jianfeng Gao, in NAACL Findings, 2024.

Abs arXiv Website Bibtex

Large Language Models (LLMs) have the capacity of performing complex scheduling in a multi-agent system and can coordinate these agents into completing sophisticated tasks that require extensive collaboration. However, despite the introduction of numerous gaming frameworks, the community has insufficient benchmarks towards building general multi-agents collaboration infrastructure that encompass both LLM and human-NPCs collaborations. In this work, we propose a novel infrastructure - MindAgent - to evaluate planning and coordination emergent capabilities for gaming interaction. In particular, our infrastructure leverages existing gaming framework, to i) require understanding of the coordinator for a multi-agent system, ii) collaborate with human players via un-finetuned proper instructions, and iii) establish an in-context learning on few-shot prompt with feedback. Furthermore, we introduce CUISINEWORLD, a new gaming scenario and related benchmark that dispatch a multi-agent collaboration efficiency and supervise multiple agents playing the game simultaneously. We conduct comprehensive evaluations with new auto-metric CoS for calculating the collaboration efficiency. Finally, our infrastructure can be deployed into real-world gaming scenarios in a customized VR version of CUISINEWORLD and adapted in existing broader Minecraft gaming domain. We hope our findings on LLMs and the new infrastructure for general-purpose scheduling and coordination can help shed light on how such skills can be obtained by learning from large language corpora.

@inproceedings{gong2024mindagent, title={Mindagent: Emergent gaming interaction}, author={Gong, Ran and Huang, Qiuyuan and Ma, Xiaojian and Vo, Hoi and Durante, Zane and Noda, Yusuke and Zheng, Zilong and Terzopoulos, Demetri and Li, Fei-Fei and Gao, Jianfeng}, booktitle={Findings of the North American Chapter of the Association for Computational Linguistics: NAACL-Findings}, year={2024} }

2023

ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab NeurIPS'23

Jieming Cui*, Ziren Gong*, Baoxiong Jia*, Siyuan Huang, Zilong Zheng^✉, Jianzhu Ma^✉, and Yixin Zhu^✉, in NeurIPS D&B Track, 2023.

Abs Code Website Bibtex

The challenge of replicating research results has posed a significant impediment to the field of molecular biology. The advent of modern intelligent systems has led to notable progress in various domains. Consequently, we embarked on an investigation of intelligent monitoring systems as a means of tackling the issue of the reproducibility crisis. Specifically, we first curate a comprehensive multimodal dataset, named ProBio, as an initial step towards this objective. This dataset comprises fine-grained hierarchical annotations intended for the purpose of studying activity understanding in Molecular Biology Lab (BioLab). Next, we devise two challenging benchmarks, transparent solution tracking and multimodal action recognition, to emphasize the unique characteristics and difficulties associated with activity understanding in BioLab settings. Finally, we provide a thorough experimental evaluation of contemporary video understanding models and highlight their limitations in this specialized domain to identify potential avenues for future research. We hope ProBio with associated benchmarks may garner increased focus on modern AI techniques in the realm of molecular biology.

@inproceedings{cui2023probio, title={ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab}, author={Cui, Jieming and Gong, Ziren and Jia, Baoxiong and Huang, Siyuan and Zheng, Zilong and Ma, Jianzhu and Zhu, Yixin}, booktitle={The Thirty-Seventh Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS D&B 2023)}, year={2023} }
DiPlomat: A Dialogue Dataset for Situated Pragmatic Reasoning NeurIPS'23

Hengli Li, Song-Chun Zhu, and Zilong Zheng^✉, in NeurIPS D&B Track, 2023.

Abs arXiv Code Website Bibtex

Pragmatic reasoning plays a pivotal role in deciphering implicit meanings that frequently arise in real-life conversations and is essential for the development of communicative social agents. In this paper, we introduce a novel challenge, DiPlomat, aiming at benchmarking machines’ capabilities on pragmatic reasoning and situated conversational understanding. Compared with previous works that treat different figurative expressions (e.g. metaphor, sarcasm) as individual tasks, DiPlomat provides a cohesive framework towards general pragmatic understanding. Our dataset is created through the utilization of Amazon Mechanical Turk ( AMT ), resulting in a total of 4, 177 multi-turn dialogues. In conjunction with the dataset, we propose two tasks, Pragmatic Identification and Reasoning (PIR) and Conversational Question Answering (CQA). Experimental results with state-of-the-art (SOTA) neural architectures reveal several significant findings: 1) large language models (LLMs) exhibit poor performance in tackling this subjective domain; 2) comprehensive comprehension of context emerges as a critical factor for establishing benign human-machine interactions; 3) current models defect in the application of pragmatic reasoning. As a result, we call on more attention to improve the ability of context understanding, reasoning, and implied meaning modeling.

@inproceedings{li2023diplomat, title={DiPlomat: A Dialogue Dataset for Situated Pragmatic Reasoning}, author={Li, Hengli and Zhu, Song-Chun and Zheng, Zilong}, booktitle={The Thirty-Seventh Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS D&B 2023)}, year={2023} }
SQA3D: Situated Question Answering in 3D Scenes ICLR'23

Xiaojian Ma*, Silong Yong*, Zilong Zheng^✉, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang^✉, in ICLR, 2023.

Abs arXiv PDF Code Website Bibtex

We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D). Given a scene context (e.g., 3D scan), SQA3D requires the tested agent to first understand its situation (position, orientation, etc.) in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation. Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations. These questions examine a wide spectrum of reasoning capabilities for an intelligent agent, ranging from spatial relation comprehension to commonsense understanding, navigation, and multi-hop reasoning. SQA3D imposes a significant challenge to current multi-modal especially 3D reasoning models. We evaluate various state-of-the-art approaches and find that the best one only achieves an overall score of 47.20%, while amateur human participants can reach 90.06%. We believe SQA3D could facilitate future embodied AI research with stronger situation understanding and reasoning capability.

@inproceedings{ma2022sqa3d, title={SQA3D: Situated Question Answering in 3D Scenes}, author={Ma, Xiaojian and Yong, Silong and Zheng, Zilong and Li, Qing and Liang, Yitao and Zhu, Song-Chun and Huang, Siyuan}, booktitle={International Conference on Learning Representations}, year={2023}, url={https://openreview.net/forum?id=IDJx97BC38} }
Semi-automatic Data Enhancement for Document-Level Relation Extraction with Distant Supervision from Large Language Models EMNLP'23

Junpeng Li*, Zixia Jia*, and Zilong Zheng^✉, in EMNLP, 2023.

Abs arXiv PDF Code Bibtex

Document-level Relation Extraction (DocRE), which aims to extract relations from a long context, is a critical challenge in achieving fine-grained structural comprehension and generating interpretable document representations. Inspired by recent advances in in-context learning capabilities emergent from large language models (LLMs), such as ChatGPT, we aim to design an automated annotation method with minimum human effort. Unfortunately, vanilla in-context learning is infeasible for document-level Relation Extraction ( RE) due to the plenty of predefined fine-grained relation types and the uncontrolled generations of LLMs. To tackle this issue, we propose a method integrating a large language model (LLM) and a natural language inference (NLI) module to generate external relation triples, thereby augmenting document-level relation datasets. We demonstrate the effectiveness of our approach by introducing an enhanced dataset known as DocGNRE, which excels in re-annotating numerous long-tail relation types. We are confident that our method holds the potential for broader applications in domain-specific relation type definitions and offers tangible benefits in advancing generalized language semantic comprehension.

@inproceedings{li2023docngre, title={Semi-automatic Data Enhancement for Document-Level Relation Extraction with Distant Supervision from Large Language Models}, author={Li, Junpeng and Jia, Zixia and Zheng, Zilong}, booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year={2023} }
VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions ACL'23

Yuxuan Wang, Zilong Zheng^✉, Xueliang Zhao, Jinpeng Li, Yueqian Wang, and Dongyan Zhao^✉, in ACL, 2023.

Abs arXiv PDF Code Website Bibtex

Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues. Most existing benchmarks treat both modalities the same as a frame-independent visual understanding task, while neglecting the intrinsic attributes in multimodal dialogues, such as scene and topic transitions. In this paper, we present Video-grounded Scene&Topic AwaRe dialogue (VSTAR) dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series. Based on VSTAR, we propose two benchmarks for video-grounded dialogue understanding: scene segmentation and topic segmentation, and one benchmark for video-grounded dialogue generation. Comprehensive experiments are performed on these benchmarks to demonstrate the importance of multimodal information and segments in video-grounded dialogue understanding and generation.

@inproceedings{wang2023vstar, title={VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions}, author={Wang, Yuxuan and Zheng, Zilong and Zhao, Xueliang and Li, Jinpeng and Wang, Yueqian, and Zhao, Dongyan}, booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)}, year={2023} }
Modeling Instance Interactions for Joint Information Extraction with Neural High-Order Conditional Random Field ACL'23

Zixia Jia, Zhaohui Yan, Wenjuan Han, Zilong Zheng^✉, and Kewei Tu^✉, in ACL, 2023.

Abs PDF Code Bibtex

Prior works on joint Information Extraction (IE) typically model instance (e.g., event triggers, entities, roles, relations) interactions by representation enhancement, type dependencies scoring, or global decoding. We find that the previous models generally consider binary type dependency scoring of a pair of instances, and leverage local search such as beam search to approximate global solutions. To better integrate cross-instance interactions, in this work, we introduce a joint IE framework (CRFIE) that formulates joint IE as a high-order Conditional Random Field. Specifically, we design binary factors and ternary factors to directly model interactions between not only a pair of instances but also triplets. Then, these factors are utilized to jointly predict labels of all instances. To address the intractability problem of exact high-order inference, we incorporate a high-order neural decoder that is unfolded from a mean-field variational inference method, which achieves consistent learning and inference. The experimental results show that our approach achieves consistent improvements on three IE tasks compared with our baseline and prior work.

@inproceedings{jia2023joint, title={Modeling Instance Interactions for Joint Information Extraction with Neural High-Order Conditional Random Field}, author={Jia, Zixia and Yan, Zhaohui and Han, Wenjuan and Zheng, Zilong and Tu, Kewei}, booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)}, year={2023} }
Shuō Wén Jiě Zì: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training ACL'23

Yuxuan Wang, Jianghui Wang, Dongyan Zhao^✉, and Zilong Zheng^✉, in ACL-Findings, 2023.

Abs arXiv Code Bibtex

We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters. We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries and Jiezi refers to the process of enhancing characters' glyph representations with structure understanding. To facilitate dictionary understanding, we propose three pre-training tasks, i.e., Masked Entry Modeling, Contrastive Learning for Synonym and Antonym, and Example Learning. We evaluate our method on both modern Chinese understanding benchmark CLUE and ancient Chinese benchmark CCLUE. Moreover, we propose a new polysemy discrimination task PolyMRC based on the collected dictionary of ancient Chinese. Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks. Moreover, our approach yields significant boosting on few-shot setting of ancient Chinese understanding.

@inproceedings{wang2023shuo, title={Shu\={o} W\'{e}n Ji\v{e} Z\`{i}: \\ Rethinking Dictionaries and Glyphs for Chinese Language Pre-training}, author={Wang, Yuxuan and Wang, Jianghui and Zhao, Dongyan and Zheng, Zilong}, booktitle={Findings of the Association for Computational Linguistics: ACL-Findings}, year={2023} }

2022

In situ bidirectional human-robot value alignment ScienceRobotics

Luyao Yuan*^✉, Xiaofeng Gao*, Zilong Zheng*, Mark Edmonds, Ying Nian Wu, Federico Rossano, Hongjing Lu^✉, Yixin Zhu^✉, and Song-Chun Zhu^✉, Science Robotics, 2022.

Abs Supp Code Video Website Bibtex TechXplore 科技日报/新华网

A prerequisite for social coordination is bidirectional communication between teammates, each playing two roles simultaneously: as receptive listeners and expressive speakers. For robots working with humans in complex situations with multiple goals that differ in importance, failure to fulfill the expectation of either role could undermine group performance due to misalignment of values between humans and robots. Specifically, a robot needs to serve as an effective listener to infer human users’ intents from instructions and feedback and as an expressive speaker to explain its decision processes to users. Here, we investigate how to foster effective bidirectional human-robot communications in the context of value alignment—collaborative robots and users form an aligned understanding of the importance of possible task goals. We propose an explainable artificial intelligence (XAI) system in which a group of robots predicts users’ values by taking in situ feedback into consideration while communicating their decision processes to users through explanations. To learn from human feedback, our XAI system integrates a cooperative communication model for inferring human values associated with multiple desirable goals. To be interpretable to humans, the system simulates human mental dynamics and predicts optimal explanations using graphical models. We conducted psychological experiments to examine the core components of the proposed computational framework. Our results show that real-time human-robot mutual understanding in complex cooperative tasks is achievable with a learning model based on bidirectional communication. We believe that this interaction framework can shed light on bidirectional value alignment in communicative XAI systems and, more broadly, in future human-machine teaming systems. An explainable artificial intelligence collaboration framework enables in situ bidirectional human-robot value alignment.

@article{ doi:10.1126/scirobotics.abm4183, author = {Luyao Yuan and Xiaofeng Gao and Zilong Zheng and Mark Edmonds and Ying Nian Wu and Federico Rossano and Hongjing Lu and Yixin Zhu and Song-Chun Zhu }, title = {In situ bidirectional human-robot value alignment}, journal = {Science Robotics}, volume = {7}, number = {68}, pages = {eabm4183}, year = {2022}, doi = {10.1126/scirobotics.abm4183}, URL = {https://www.science.org/doi/abs/10.1126/scirobotics.abm4183}, eprint = {https://www.science.org/doi/pdf/10.1126/scirobotics.abm4183} }
SHARP: Search-Based Adversarial Attack for Structured Prediction NAACL'22

Liwen Zhang, Zixia Jia, Wenjuan Han, Zilong Zheng, and Kewei Tu, in NAACL Findings, 2022.

Abs PDF Bibtex

Understanding what we genuinely mean instead of what we literally say in conversations is challenging for both humans and machines; yet, this direction is mostly left untouched in modern open-ended dialogue systems. To fill in this gap, we present a grammar-based dialogue dataset, GRICE, designed to bring implicature into pragmatic reasoning in the context of conversations. Our design of GRICE also incorporates other essential aspects of modern dialogue modeling (e.g., coreference). The entire dataset is systematically generated using a hierarchical grammar model, such that each dialogue context has intricate implicatures and is temporally consistent. We further present two tasks, the implicature recovery task followed by the pragmatic reasoning task in conversation, to evaluate the model's reasoning capability. In experiments, we adopt baseline methods that claimed to have pragmatics reasoning capability; the results show a large performance gap between baseline methods and human performance. After integrating a simple module that explicitly reasons about implicature, the model shows an overall performance boost in conversational reasoning. These observations demonstrate the significance of implicature recovery for open-ended dialogue reasoning and call for future research in conversational implicature and conversational reasoning.

@inproceedings{zhang2022sharp, title={SHARP: Search-Based Adversarial Attack for Structured Prediction}, author={Zhang, Liwen and Jia, Zixia and Han, Wenjuan and Zheng, Zilong and Tu, Kewei}, booktitle={Findings of the Association for Computational Linguistics: NAACL-Findings}, year={2021} }
VGStore: A Multimodal Extension to SPARQL for Querying RDF Scene Graph ISWC'22

Yanzeng Li, Zilong Zheng, Wenjuan Han, and Lei Zou, in ISWC Poster & Demo Track, 2022.

Abs arXiv Bibtex

Semantic Web technology has successfully facilitated many RDF models with rich data representation methods. It also has the potential ability to represent and store multimodal knowledge bases such as multimodal scene graphs. However, most existing query languages, especially SPARQL, barely explore the implicit multimodal relationships like semantic similarity, spatial relations, etc. We first explored this issue by organizing a large-scale scene graph dataset, namely Visual Genome, in the RDF graph database. Based on the proposed RDF-stored multimodal scene graph, we extended SPARQL queries to answer questions containing relational reasoning about color, spatial, etc. Further demo (i.e., VGStore) shows the effectiveness of customized queries and displaying multimodal data.

@inproceedings{vgstore22iswc, title={VGStore: A Multimodal Extension to SPARQL for Querying RDF Scene Graph}, author={Li, Yanzeng and Zheng, Zilong and Han, Wenjuan and Zou, Lei}, booktitle={The 21st International Semantic Web Conference (ISWC) Poster & Demo Track}, year={2022} }
Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling Oral ICLR'22

Bo Wan, Wenjuan Han, Zilong Zheng, and Tinne Tuytelaars, in ICLR, 2022.

Abs PDF Bibtex

We introduce a new task, unsupervised vision-language (VL) grammar induction. Given an image-caption pair, the goal is to extract a shared hierarchical structure for both image and language simultaneously. We argue that such structured output, grounded in both modalities, is a clear step towards the high-level understanding of multimodal information. Besides challenges existing in conventional visually grounded grammar induction tasks, VL grammar induction requires a model to capture contextual semantics and perform a fine-grained alignment. To address these challenges, we propose a novel method, CLIORA, which constructs a shared vision-language constituency tree structure with context-dependent semantics for all constituents in different levels of the tree. It computes a matching score between each constituent and image region, trained via contrastive learning. It integrates two levels of fusion, namely at feature-level and at score-level, so as to allow fine-grained alignment. We introduce a new evaluation metric: Critical Concept Recall Rate (CCRR) to explicitly evaluate VL grammar induction, and show a 2.6% improvement over a strong baseline on Flickr30k Entities. We also evaluate our model via two derived tasks, i.e., language grammar induction and phrase grounding, and improve over the state-of-the-art for both.

@article{wan2022unsupervised, title={Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling}, author={Wan, Bo and Han, Wenjuan and Zheng, Zilong and Tuytelaars, Tinne}, journal={The Tenth International Conference on Learning Representations (ICLR)}, year={2022} }
Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships CVPR'22

Chao Lou*, Wenjuan Han, Yuhuan Lin, and Zilong Zheng*, in CVPR, 2022.

Abs arXiv Bibtex

Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for visual scenes (e.g., scene graphs) and natural languages (e.g., dependency trees), individually. However, how to construct a joint vision-language (VL) structure has barely been investigated. More challenging but worthwhile, we introduce a new task that targets on inducing such a joint VL structure in an unsupervised manner. Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly. Due to the lack of VL structural data, we start by building a new dataset VLParse. Rather than using labor-intensive labeling from scratch, we propose an automatic alignment procedure to produce coarse structures followed by human refinement to produce high-quality ones. Moreover, we benchmark our dataset by proposing a contrastive learning (CL)-based framework VLGAE, short for Vision-Language Graph Autoencoder. Our model obtains superior performance on two derived tasks, i.e., language grammar induction and VL phrase grounding. Ablations show the effectiveness of both visual cues and dependency relationships on fine-grained VL structure construction.

@inproceedings{lou2022unsupervised, title={Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships}, author={Lou, Chao and Han, Wenjuan and Lin, Yuhuan and Zheng, Zilong}, journal={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)}, year={2022} }
Energy-Based Generative Cooperative Saliency Prediction Oral AAAI'22

Jing Zhang, Jianwen Xie, Zilong Zheng, and Nick Barnes, in AAAI, 2022.

Abs arXiv Code Bibtex

Conventional saliency prediction models typically learn a deterministic mapping from images to the corresponding ground truth saliency maps. In this paper, we study the saliency prediction problem from the perspective of generative models by learning a conditional probability distribution over saliency maps given an image, and treating the prediction as a sampling process. Specifically, we propose a generative cooperative saliency prediction framework based on the generative cooperative networks, where a conditional latent variable model and a conditional energy-based model are jointly trained to predict saliency in a cooperative manner. We call our model the SalCoopNets. The latent variable model serves as a fast but coarse predictor to efficiently produce an initial prediction, which is then refined by the iterative Langevin revision of the energy-based model that serves as a fine predictor. Such a coarse-to-fine cooperative saliency prediction strategy offers the best of both worlds. Moreover, we generalize our framework to the scenario of weakly supervised saliency prediction, where saliency annotation of training images is partially observed, by proposing a cooperative learning while recovering strategy. Lastly, we show that the learned energy function can serve as a refinement module that can refine the results of other pretrained saliency prediction models. Experimental results show that our generative model can achieve state-of-the-art performance.

@article{zhang2022energy, title = {Energy-Based Generative Cooperative Saliency Prediction}, author = {Zhang, Jing and Xie, Jianwen and Zheng, Zilong and Barnes, Nick}, journal={The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI)}, year = {2022} }

2021

Cooperative Training of Fast Thinking Initializer and Slow Thinking Solver for Multi-Modal Conditional Learning TPAMI

Jianwen Xie*, Zilong Zheng*, Xiaolin Fang, Song-Chun Zhu, and Ying Nian Wu, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021.

Abs arXiv PDF Bibtex

This paper studies the supervised learning of the conditional distribution of a high-dimensional output given an input, where the output and input may belong to two different modalities, e.g., the output is an photo image and the input is a sketch image. We solve this problem by cooperative training of a fast thinking initializer and slow thinking solver. The initializer generates the output directly by a non-linear transformation of the input as well as a noise vector that accounts for latent variability in the output. The slow thinking solver learns an objective function in the form of a conditional energy function, so that the output can be generated by optimizing the objective function, or more rigorously by sampling from the conditional energy-based model. We propose to learn the two models jointly, where the fast thinking initializer serves to initialize the sampling of the slow thinking solver, and the solver refines the initial output by an iterative algorithm. The solver learns from the difference between the refined output and the observed output, while the initializer learns from how the solver refines its initial output. We demonstrate the effectiveness of the proposed method on various multi-modal conditional learning tasks, e.g., class-to-image generation, image-to-image translation, and image recovery.

@article{xie2021cooperative, title={Cooperative Training of Fast Thinking Initializer and Slow Thinking Solver for Multi-Modal Conditional Learning}, author={Xie, Jianwen and Zheng, Zilong and Fang, Xiaolin and Zhu, Song-Chun and Wu, Ying Nian}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)}, abbr={TPAMI}, year={2021} }
Learning Triadic Belief Dynamics in Nonverbal Communication from Videos Oral CVPR'21

Lifeng Fan, Shuwen Qiu, Zilong Zheng, Tao Gao, Song-Chun Zhu, and Yixin Zhu, in CVPR, 2021.

Abs arXiv PDF Supp Code Video Bibtex

Humans possess a unique social cognition capability; nonverbal communication can convey rich social information among agents. In contrast, such crucial social characteristics are mostly missing in the existing scene understanding literature. In this paper, we incorporate different nonverbal communication cues (e.g., gaze, human poses, and gestures) to represent, model, learn, and infer agents’ mental states from pure visual inputs. Crucially, such a mental representation takes the agent’s belief into account so that it represents what the true world state is and infers the beliefs in each agent’s mental state, which may differ from the true world states. By aggregating different beliefs and true world states, our model essentially forms “five minds” during the interactions between two agents. This “five minds” model differs from prior works that infer beliefs in an infinite recursion; instead, agents’ beliefs are converged into a “common mind”. Based on this representation, we further devise a hierarchical energybased model that jointly tracks and predicts all five minds. From this new perspective, a social event is interpreted by a series of nonverbal communication and belief dynamics, which transcends the classic keyframe video summary. In the experiments, we demonstrate that using such a social account provides a better video summary on videos with rich social interactions compared with state-of-the-art keyframe video summary methods

@inproceedings{fan2021learning, title = {Learning Tradic Belief Dynamics in Nonverbal Communication from Videos}, author = {Lifeng Fan and Shuwen Qiu and Zilong Zheng and Tao Gao and Song-Chun Zhu and Yixin Zhu}, year = {2021}, booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)} }
Patchwise Generative ConvNet: Training Energy-Based Models from a Single Natural Image for Internal Learning Oral CVPR'21

Zilong Zheng, Jianwen Xie, and Ping Li, in CVPR, 2021.

Abs PDF Supp Code Website Bibtex

Exploiting internal statistics of a single natural image has long been recognized as a significant research paradigm where the goal is to learn the internal distribution of patches within the image without relying on external training data. Different from prior works that model such a distribution implicitly with a top-down latent variable model (e.g., generator), this paper proposes to explicitly represent the statistical distribution within a single natural image by using an energy-based generative framework, where a pyramid of energy functions, each parameterized by a bottom-up deep neural network, are used to capture the distributions of patches at different resolutions. Meanwhile, a coarse-to-fine sequential training and sampling strategy is presented to train the model efficiently. Besides learning to generate random samples from white noise, the model can learn in parallel with a self-supervised task (e.g., recover the input image from its corrupted version), which can further improve the descriptive power of the learned model. The proposed model is simple and natural in that it does not require an auxiliary model (e.g., discriminator) to assist the training. Besides, it also unifies internal statistics learning and image generation in a single framework. Experimental results presented on various image generation and manipulation tasks, including super-resolution, image editing, harmonization, style transfer, etc., have demonstrated the effectiveness of our model for internal learning.

@inproceedings{zheng2021patchgencn, title={Patchwise Generative ConvNet: Training Energy-Based Models from a Single Natural Image for Internal Learning}, author={Zheng, Zilong and Xie, Jianwen and Li, Ping}, booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)}, year={2021} }
Generative PointNet: Deep Energy-Based Learning on Unordered Point Sets for 3D Generation, Reconstruction and Classification CVPR'21

Jianwen Xie, Yifei Xu, Zilong Zheng, Song-Chun Zhu, and Ying Nian Wu, in CVPR, 2021.

Abs arXiv PDF Website Bibtex

We propose a generative model of unordered point sets, such as point clouds, in the forms of an energy-based model, where the energy function is parameterized by an input-permutation-invariant bottom-up neural network. The energy function learns a coordinate encoding of each point and then aggregates all individual point features into energy for the whole point cloud. We show that our model can be derived from the discriminative PointNet. The model can be trained by MCMC-based maximum likelihood learning (as well as its variants), without the help of any assisting networks like those in GANs and VAEs. Unlike most point cloud generator that relys on hand-crafting distance metrics, our model does not rely on hand-crafting distance metric for point cloud generation, because it synthesizes point clouds by matching observed examples in terms of statistical property defined by the energy function. Furthermore, we can learn a short-run MCMC toward the energy-based model as a flow-like generator for point cloud reconstruction and interpretation. The learned point cloud representation can be also useful for point cloud classification. Experiments demonstrate the advantages of the proposed generative model of point clouds.

@inproceedings{xie2021GPointent, title={Generative PointNet: Deep Energy-Based Learning on Unordered Point Sets for 3D Generation, Reconstruction and Classification}, author={Xie, Jianwen and Xu, Yifei and Zheng, Zilong and Zhu, Song-Chun and Wu, Ying Nian}, booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)}, year={2021} }
GRICE: A Grammar-based Dataset for Recovering Implicature and Conversational rEasoning ACL'21

Zilong Zheng, Shuwen Qiu, Lifeng Fan, Yixin Zhu, and Song-Chun Zhu, in ACL Findings, 2021.

Abs PDF Code Bibtex

Understanding what we genuinely mean instead of what we literally say in conversations is challenging for both humans and machines; yet, this direction is mostly left untouched in modern open-ended dialogue systems. To fill in this gap, we present a grammar-based dialogue dataset, GRICE, designed to bring implicature into pragmatic reasoning in the context of conversations. Our design of GRICE also incorporates other essential aspects of modern dialogue modeling (e.g., coreference). The entire dataset is systematically generated using a hierarchical grammar model, such that each dialogue context has intricate implicatures and is temporally consistent. We further present two tasks, the implicature recovery task followed by the pragmatic reasoning task in conversation, to evaluate the model's reasoning capability. In experiments, we adopt baseline methods that claimed to have pragmatics reasoning capability; the results show a large performance gap between baseline methods and human performance. After integrating a simple module that explicitly reasons about implicature, the model shows an overall performance boost in conversational reasoning. These observations demonstrate the significance of implicature recovery for open-ended dialogue reasoning and call for future research in conversational implicature and conversational reasoning.

@inproceedings{zheng2021grice, title={GRICE: A Grammar-based Dataset for Recovering Implicature and Conversational Reasoning}, author={Zheng, Zilong and Qiu, Shuwen and Fan, Lifeng and Zhu, Yixin and Zhu, Song-Chun}, booktitle={Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021}, year={2021}, pages = {2074--2085} }
Learning Energy-Based Model with Variational Auto-Encoder as Amortized Sampler AAAI'21

Jianwen Xie, Zilong Zheng, and Ping Li, in AAAI, 2021.

Abs arXiv Bibtex

Due to the intractable partition function, training energybased models (EBMs) by maximum likelihood requires Markov chain Monte Carlo (MCMC) sampling to approximate the gradient of the Kullback–Leibler divergence between data and model distributions. However, it is non-trivial to sample from an EBM because of the difficulty of mixing between modes. In this paper, we propose to learn a variational auto-encoder (VAE) to initialize the finite-step MCMC, such as Langevin dynamics that is derived from the energy function, for efficient amortized sampling of the EBM. With these amortized MCMC samples, the EBM can be trained by maximum likelihood, which follows an “analysis by synthesis” scheme; while the variational auto-encoder learns from these MCMC samples via variational Bayes. We call this joint training algorithm the variational MCMC teaching, in which the VAE chases the EBM toward data distribution. We interpret the learning algorithm as a dynamic alternating projection in the context of information geometry. Our proposed models can generate samples comparable to GANs and EBMs. Additionally, we demonstrate that our models can learn effective probabilistic distribution toward supervised conditional learning experiments.

@article{xie2021vaeebm, title={Learning Energy-Based Model with Variational Auto-Encoder as Amortized Sampler}, author={Xie, Jianwen and Zheng, Zilong and Li, Ping}, journal={The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI)}, year={2021} }
Learning Cycle-Consistent Cooperative Networks via Alternating MCMC Teaching for Unsupervised Cross-Domain Translation AAAI'21

Jianwen Xie*, Zilong Zheng*, Xiaolin Fang, Song-Chun Zhu, and Ying Nian Wu, in AAAI, 2021.

Abs arXiv PDF Website Bibtex

This paper studies the unsupervised cross-domain translation problem by proposing a generative framework, in which the probability distribution of each domain is represented by a generative cooperative network that consists of an energy-based model and a latent variable model. The use of generative cooperative network enables maximum likelihood learning of the domain model by MCMC teaching, where the energy-based model seeks to fit the data distribution of domain and distills its knowledge to the latent variable model via MCMC. Specifically, in the MCMC teaching process, the latent variable model parameterized by an encoder-decoder maps examples from the source domain to the target domain, while the energy-based model further refines the mapped results by Langevin revision such that the revised results match to the examples in the target domain in terms of the statistical properties, which are defined by the learned energy function. For the purpose of building up a correspondence between two unpaired domains, the proposed framework simultaneously learns a pair of cooperative networks with cycle consistency, accounting for a two-way translation between two domains, by alternating MCMC teaching. Experiments show that the proposed framework is useful for unsupervised image-to-image translation and unpaired image sequence translation.

@article{xie2021cycle, title={Learning Cycle-Consistent Cooperative Networks via Alternating MCMC Teaching for Unsupervised Cross-Domain Translation}, author={Xie, Jianwen and Zheng, Zilong and Fang, Xiaolin and Zhu, Song-Chun and Wu, Ying Nian}, journal={The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI)}, year={2021} }

2020

Generative VoxelNet: Learning Energy-Based Models for 3D Shape Synthesis and Analysis TPAMI

Jianwen Xie*, Zilong Zheng*, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, and Ying Nian Wu, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020.

Abs PDF Website Bibtex

3D data that contains rich geometry information of objects and scenes is a valuable asset for understanding 3D physical world. With the recent emergence of large-scale 3D datasets, it becomes increasingly crucial to have a powerful 3D generative model for 3D shape synthesis and analysis. This paper proposes a 3D shape descriptor network, which is a deep 3D convolutional energy-based model, for representing volumetric shape patterns. The maximum likelihood training of the model follows an “analysis by synthesis” scheme. The benefits of the proposed model are five-fold: first, unlike GANs and VAEs, the training of the model does not rely on any auxiliary models; second, the model can synthesize realistic 3D shapes by sampling from the probability distribution via MCMC, such as Langevin dynamics; third, the conditional version of the model can be applied to 3D object recovery and super-resolution; fourth, the model can be used to train a 3D generator network via MCMC teaching; fifth, the unsupervisedly trained model provides a powerful feature extractor for 3D data, which can be useful for 3D object classification. Experiments demonstrate that the proposed model can generate high-quality 3D shape patterns and can be useful for a wide variety of 3D shape analysis.

@article{xie2020gvoxelnet, title={Generative VoxelNet: Learning Energy-Based Models for 3D Shape Synthesis and Analysis}, author= {Xie, Jianwen and Zheng, Zilong and Gao, Ruiqi and Wang, Wenguan and Zhu, Song-Chun and Wu, Ying Nian}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)}, year={2020} }
Joint Inference of States, Robot Knowledge, and Human (False-)Beliefs ICRA'20

Tao Yuan, Hangxin Liu, Lifeng Fan, Zilong Zheng, Tao Gao, Yixin Zhu, and Song-Chun Zhu, in ICRA, 2020.

Abs PDF Video Bibtex

Aiming to understand how human (false-)belief—a core socio-cognitive ability—would affect human interactions with robots, this paper proposes to adopt a graphical model to unify the representation of object states, robot knowledge, and human (false-)beliefs. Specifically, a parse graph (PG) is learned from a single-view spatiotemporal parsing by aggregating various object states along the time; such a learned representation is accumulated as the robot’s knowledge. An inference algorithm is derived to fuse individual PG from all robots across multi-views into a joint PG, which affords more effective reasoning and inference capability to overcome the errors originated from a single view. In the experiments, through the joint inference over PGs, the system correctly recognizes human (false-)belief in various settings and achieves better cross-view accuracy on a challenging small object tracking dataset.

@inproceedings{yuan2020joint, title={Joint Inference of States, Robot Knowledge, and Human (False-)Beliefs}, author={Yuan, Tao and Liu, Hangxin and Fan, Lifeng and Zheng, Zilong and Gao, Tao and Zhu, Yixin and Zhu, Song-Chun}, booktitle={Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)}, year={2020} }
Motion-Based Generator Model: Unsupervised Disentanglement of Appearance, Trackable and Intrackable Motions in Dynamic Patterns Oral AAAI'20

Jianwen Xie*, Ruiqi Gao*, Zilong Zheng, Song-Chun Zhu, and Ying Nian Wu, in AAAI, 2020.

Abs arXiv Code Website Bibtex

Dynamic patterns are characterized by complex spatial and motion patterns. Understanding dynamic patterns requires a disentangled representational model that separates the factorial components. A commonly used model for dynamic patterns is the state space model, where the state evolves over time according to a transition model and the state generates the observed image frames according to an emission model. To model the motions explicitly, it is natural for the model to be based on the motions or the displacement fields of the pixels. Thus in the emission model, we let the hidden state generate the displacement field, which warps the trackable component in the previous image frame to generate the next frame while adding a simultaneously emitted residual image to account for the change that cannot be explained by the deformation. The warping of the previous image is about the trackable part of the change of image frame, while the residual image is about the intrackable part of the image. We use a maximum likelihood algorithm to learn the model parameters that iterates between inferring latent noise vectors that drive the transition model and updating the parameters given the inferred latent vectors. Meanwhile we adopt a regularization term to penalize the norms of the residual images to encourage the model to explain the change of image frames by trackable motion. Unlike existing methods on dynamic patterns, we learn our model in unsupervised setting without ground truth displacement fields or optical flows. In addition, our model defines a notion of intrackability by the separation of warped component and residual component in each image frame. We show that our method can synthesize realistic dynamic pattern, and disentangling appearance, trackable and intrackable motions. The learned models can be useful for motion transfer, and it is natural to adopt it to define and measure intrackability of a dynamic pattern.

@article{xie2020motion, title={Motion-Based Generator Model: Unsupervised Disentanglement of Appearance, Trackable and Intrackable Motions in Dynamic Patterns}, author={Xie, Jianwen and Gao, Ruiqi and Zheng, Zilong and Zhu, Song-Chun and Wu, Ying Nian}, journal={The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI)}, year={2020} }

2019

Reasoning Visual Dialogs with Structural and Partial Observations Oral CVPR'19

Zilong Zheng*, Wenguan Wang*, Siyuan Qi*, and Song-Chun Zhu, in CVPR, 2019.

Abs arXiv Code Bibtex

We propose a novel model to address the task of Visual Dialog which exhibits complex dialog structures. To obtain a reasonable answer based on the current question and the dialog history, the underlying semantic dependencies between dialog entities are essential. In this paper, we explicitly formalize this task as inference in a graphical model with partially observed nodes and unknown graph structures (relations in dialog). The given dialog entities are viewed as the observed nodes. The answer to a given question is represented by a node with missing value. We first introduce an Expectation Maximization algorithm to infer both the underlying dialog structures and the missing node values (desired answers). Based on this, we proceed to propose a differentiable graph neural network (GNN) solution that approximates this process. Experiment results on the VisDial and VisDial-Q datasets show that our model outperforms comparative methods. It is also observed that our method can infer the underlying dialog structure for better dialog reasoning.

@inproceedings{zheng2019reasoning, title={Reasoning Visual Dialogs with Structural and Partial Observations}, author={Zheng, Zilong and Wang, Wenguan and Qi, Siyuan and Zhu, Song-Chun}, booktitle={Computer Vision and Pattern Recognition (CVPR), 2019 IEEE Conference on}, year={2019} }
Learning Dynamic Generator Model by Alternating Back-Propagation Through Time Spotlight AAAI'19

Jianwen Xie*, Ruiqi Gao*, Zilong Zheng, Song-Chun Zhu, and Ying Nian Wu, in AAAI, 2019.

Abs arXiv Code Website Bibtex

This paper studies the dynamic generator model for spatial-temporal processes such as dynamic textures and action sequences in video data. In this model, each time frame of the video sequence is generated by a generator model, which is a non-linear transformation of a latent state vector, where the non-linear transformation is parametrized by a top-down neural network. The sequence of latent state vectors follows a non-linear auto-regressive model, where the state vector of the next frame is a non-linear transformation of the state vector of the current frame as well as an independent noise vector that provides randomness in the transition. The non-linear transformation of this transition model can be parametrized by a feedforward neural network. We show that this model can be learned by an alternating back-propagation through time algorithm that iteratively samples the noise vectors and updates the parameters in the transition model and the generator model. We show that our training method can learn realistic models for dynamic textures and action patterns.

@article{xie2019DG, title = {Learning Dynamic Generator Model by Alternating Back-Propagation Through Time}, author = {Xie, Jianwen and Gao, Ruiqi and Zheng, Zilong and Zhu, Song-Chun and Wu, Ying Nian}, journal={The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI)}, year = {2019} }

2018

Learning Descriptor Networks for 3D Shape Synthesis and Analysis Oral CVPR'18

Jianwen Xie*, Zilong Zheng*, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, and Ying Nian Wu, in CVPR, 2018.

Abs arXiv Code Website Bibtex

This paper proposes a 3D shape descriptor network, which is a deep convolutional energy-based model, for modeling volumetric shape patterns. The maximum likelihood training of the model follows an “analysis by synthesis” scheme and can be interpreted as a mode seeking and mode shifting process. The model can synthesize 3D shape patterns by sampling from the probability distribution via MCMC such as Langevin dynamics. The model can be used to train a 3D generator network via MCMC teaching. The conditional version of the 3D shape descriptor net can be used for 3D object recovery and 3D object super-resolution. Experiments demonstrate that the proposed model can generate realistic 3D shape patterns and can be useful for 3D shape analysis.

@inproceedings{xie2018learning, title={Learning Descriptor Networks for 3D Shape Synthesis and Analysis}, author={Xie, Jianwen and Zheng, Zilong and Gao, Ruiqi and Wang, Wenguan and Zhu, Song-Chun and Wu, Ying Nian}, booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)}, pages={8629--8638}, year={2018} }

Publications

Preprint

2025

2024

2023

2022

2021

2020

2019

2018