publication
*
: Equal contribution
, #
: Corresponding author
2025
- Navi2Gaze: Leveraging Foundation Models for Navigation and Target Gazing IROS'25 OralJun Zhu, Zihao Du, Haotian Xu, Fengbo Lan, Zilong Zheng, Bo Ma, Shengjie Wang, and Tao Zhang, in IROS, 2025.
Task-aware navigation continues to be a challenging area of research, especially in scenarios involving open vocabulary. Previous studies primarily focus on finding suitable locations for task completion, often overlooking the importance of the robot's pose. However, the robot's orientation is crucial for successfully completing tasks because of how objects are arranged (e.g., to open a refrigerator door). Humans intuitively navigate to objects with the right orientation using semantics and common sense. For instance, when opening a refrigerator, we naturally stand in front of it rather than to the side. Recent advances suggest that Vision-Language Models (VLMs) can provide robots with similar common sense. Therefore, we develop a VLM-driven method called Navigation-to-Gaze (Navi2Gaze) for efficient navigation and object gazing based on task descriptions. This method uses the VLM to score and select the best pose from numerous candidates automatically. In evaluations on multiple photorealistic simulation benchmarks, Navi2Gaze significantly outperforms existing approaches by precisely determining the optimal orientation relative to target objects, resulting in a 68.8% reduction in Distance to Goal (DTG).
@inproceedings{zhu2025navi2gaze, title={Navi2Gaze: Leveraging Foundation Models for Navigation and Target Gazing}, author={Jun Zhu and Zihao Du and Haotian Xu and Fengbo Lan and Zilong Zheng and Bo Ma and Shengjie Wang and Tao Zhang}, booktitle={IEEE/RSJ International Conference on Intelligent Robotics and Systems (IROS)}, year={2025} }
- In-situ Value-aligned Human-Robot Interactions with Physical Constraints IROS'25 Oral
Equipped with large language models (LLMs), human-centered robots are now capable of performing a wide range of tasks that were previously deemed challenging or unattainable. However, merely completing tasks is insufficient for cognitive robots, who should learn and apply human preferences to future scenarios. In this work, we propose a framework that combines human preferences with physical constraints, requiring robots to complete tasks while considering both. Firstly, we developed a benchmark of everyday household activities, which are often evaluated based on specific preferences. We then introduced in-context learning from human feedback (ICLHF), where human feedback comes from direct instructions and adjustments made intentionally or unintentionally in daily life. Extensive sets of experiments, testing the ICLHF to generate task plans and balance physical constraints with preferences, have demonstrated the efficiency of our approach.
@inproceedings{li2025iclhf, title={In-situ Value-aligned Human-Robot Interactions with Physical Constraints}, author={Hongtao Li and Ziyuan Jiao and Xiaofeng Liu and Hangxin Liu and Zilong Zheng}, booktitle={IEEE/RSJ International Conference on Intelligent Robotics and Systems (IROS)}, year={2025} }
- Xuekai Zhu, Daixuan Cheng, Hengli Li, Kaiyan Zhang, Ermo Hua, Xingtai Lv, Ning Ding, Zhouhan Lin#, Zilong Zheng#, and Bowen Zhou#, in ICML, 2025.
Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-{n} models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.
@inproceedings{zhu2025toedit, title={How to Synthesize Text Data without Model Collapse?}, author={Zhu, Xuekai and Cheng, Daixuan and Li, Hengli and Zhang, Kaiyan and Hua, Ermo and Lv, Xingtai and Ding, Ning and Lin, Zhouhan and Zheng, Zilong and Zhou, Bowen}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, year={2025} }
-
Developing AI agents capable of interacting with open-world environments to solve diverse tasks is a compelling challenge. However, evaluating such open-ended agents remains difficult, with current benchmarks facing scalability limitations. To address this, we introduce Minecraft Universe (MCU), a comprehensive evaluation framework set within the open-world video game Minecraft. MCU incorporates three key components: (1) an expanding collection of 3,452 composable atomic tasks that encompasses 11 major categories and 41 subcategories of challenges; (2) a task composition mechanism capable of generating infinite diverse tasks with varying difficulty; and (3) a general evaluation framework that achieves 91.5% alignment with human ratings for open-ended task assessment. Empirical results reveal that even state-of-the-art foundation agents struggle with the increasing diversity and complexity of tasks. These findings highlight the necessity of MCU as a robust benchmark to drive progress in AI agent development within open-ended environments.
@inproceedings{zheng2025mcu, title={MCU: An Evaluation Framework for Open-Ended Game Agents}, author={Zheng, Xinyue and Lin, Haowei and He, Kaichen and Wang, Zihao and Zheng, Zilong and Liang, Yitao}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, year={2025} }
-
Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model's inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at this URL.
@inproceedings{wu2025tokenswift, title={Lossless Acceleration of Ultra Long Sequence Generation}, author={Wu, Tong and Shen, Junzhe and Jia, Zixia and Wang, Yuxuan and Zheng, Zilong}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, year={2025} }
- Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs ICLR'25Zhaowei Zhang, Fengshuo Bai, Qizhi Chen, Chengdong Ma, Mingzhi Wang, Haoran Sun, Zilong Zheng#, and Yaodong Yang#, in ICLR, 2025.
How to align large language models (LLMs) with user preferences from a static general dataset has been frequently studied. However, user preferences are usually personalized, changing, and diverse. This leads to the problem that the actual user preferences often do not coincide with those trained by the model developers in the practical use of LLMs. Since we cannot collect enough data and retrain for every demand, researching efficient real-time preference adaptation methods based on the backbone LLMs during test time is important. To this end, we introduce Amulet, a novel, training-free framework that formulates the decoding process of every token as a separate online learning problem with the guidance of simple user-provided prompts, thus enabling real-time optimization to satisfy users' personalized preferences. To reduce the computational cost brought by this optimization process for each token, we additionally provide a closed-form solution for each iteration step of the optimization process, thereby reducing the computational time cost to a negligible level. The detailed experimental results demonstrate that Amulet can achieve significant performance improvements in rich settings with combinations of different LLMs, datasets, and user preferences, while maintaining acceptable computational efficiency.
@inproceedings{zhang2025amulet, title={Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs}, author={Zhang, Zhaowei and Bai, Fengshuo and Chen, Qizhi and Ma, Chengdong and Wang, Mingzhi and Sun, Haoran and Zheng, Zilong and Yang, Yaodong}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025} }
- In-Context Editing: Learning Knowledge from Self-Induced Distributions ICLR'25Siyuan Qi#, Bangcheng Yang, Kailin Jiang, Xiaobo Wang, Jiaqi Li, Yifan Zhong, Yaodong Yang, and Zilong Zheng#, in ICLR, 2025.
The existing fine-tuning paradigm for language models is brittle in knowledge editing scenarios, where the model must incorporate new information without extensive retraining. This brittleness often results in overfitting, reduced performance, and unnatural language generation. To address this, we propose Consistent In-Context Editing (ICE), a novel approach that leverages the model's in-context learning capability to tune toward a contextual distribution rather than a one-hot target. ICE introduces a straightforward optimization framework that includes both a target and a procedure, enhancing the robustness and effectiveness of gradient-based tuning methods. We provide analytical insights into ICE across four critical aspects of knowledge editing: accuracy, locality, generalization, and linguistic quality, showing its advantages. Experimental results across four datasets confirm the effectiveness of ICE and demonstrate its potential for continual editing, ensuring that updated information is incorporated while preserving the integrity of the model.
@inproceedings{qi2025ice, title={In-Context Editing: Learning Knowledge from Self-Induced Distributions}, author={Qi, Siyuan and Yang, Bangcheng, and Jiang, Kailin and Wang, Xiaobo and Li, Jiaqi and Zhong, Yifan and Yang, Yaodong and Zheng, Zilong}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025} }
- MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge ICLR'25Yuntao Du*, Kailin Jiang*, Zhi Gao, Chenrui Shi, Zilong Zheng#, Siyuan Qi, and Qing Li#, in ICLR, 2025.
Knowledge editing techniques have emerged as essential tools for updating the factual knowledge of large language models (LLMs) and multimodal models (LMMs), allowing them to correct outdated or inaccurate information without retraining from scratch. However, existing benchmarks for multimodal knowledge editing primarily focus on entity-level knowledge represented as simple triplets, which fail to capture the complexity of real-world multimodal information. To address this issue, we introduce MMKE-Bench, a comprehensive MultiModal Knowledge Editing Benchmark, designed to evaluate the ability of LMMs to edit diverse visual knowledge in real-world scenarios. MMKE-Bench addresses these limitations by incorporating three types of editing tasks: visual entity editing, visual semantic editing, and user-specific editing. Besides, MMKE-Bench uses free-form natural language to represent and edit knowledge, offering a more flexible and effective format. The benchmark consists of 2,940 pieces of knowledge and 7,229 images across 110 fine-grained types, with evaluation questions automatically generated and human-verified. We assess five state-of-the-art knowledge editing methods on three prominent LMMs, revealing that no method excels across all criteria, and that visual and user-specific edits are particularly challenging. MMKE-Bench sets a new standard for evaluating the robustness of multimodal knowledge editing techniques, driving progress in this rapidly evolving field.
@inproceedings{du2025mmke, title={MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge}, author={Du, Yuntao and Jiang, Kailin and Gao, Zhi and Shi, Chenrui and Zheng, Zilong and Qi, Siyuan and Li, Qing}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025} }
- Probing and Inducing Combinational Creativity in Vision-Language Models CogSci'25 OralYongqian Peng*, Yuxi Ma*, Mengmeng Wang, Yuxuan Wang, Yizhou Wang, Chi Zhang, Yixin Zhu#, and Zilong Zheng#, in CogSci, 2025.
The ability to combine existing concepts into novel ideas stands as a fundamental hallmark of human intelligence. Recent advances in VLMs like GPT-4V and DALLE-3 have sparked debate about whether their outputs reflect combinational creativity---defined by M. A. Boden(1998) as synthesizing novel ideas through combining existing concepts---or sophisticated pattern matching of training data. Drawing inspiration from cognitive science, we investigate the combinational creativity of VLMs from the lens of concept blending. We propose the Identification-Explanation-Implication(IEI) framework, which decomposes creative processes into three levels: identifying input spaces, extracting shared attributes, and deriving novel semantic implications. To validate this framework, we curate CreativeMashup, a high-quality dataset of 666 artist-generated visual mashups annotated according to the IEI framework. Through extensive experiments, we demonstrate that in comprehension tasks, best VLMs have surpassed average human performance while falling short of expert-level understanding; in generation tasks, incorporating our IEI framework into the generation pipeline significantly enhances the creative quality of VLMs outputs. Our findings establish both a theoretical foundation for evaluating artificial creativity and practical guidelines for improving creative generation in VLMs.
@inproceedings{peng2025creativity, title={Probing and Inducing Combinational Creativity in Vision-Language Models}, author={Peng, Yongqian and Ma, Yuxi and Wang, Mengmeng and Wang, Yuxuan and Wang, Yizhou and Zhang, Chi and Zhu, Yixin and Zheng, Zilong}, booktitle={The 47th Annual Meeting of the Cognitive Science Society (CogSci)}, year={2025} }
-
The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 real-world interactive videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enhance real-time interactive reasoning with minimum finetuning on pre-trained MLLMs. Extensive experimental results reveal that the existing MLLMs fall short in interactive streaming understanding, particularly struggling with proactive tasks and multi-turn queries. Our proposed M4, though lightweight, demonstrates a significant improvement in handling proactive tasks and real-time interactions.
@inproceedings{cvpr25omnimmi, title={OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts}, author={Wang, Yuxuan and Wang, Yueqian and Chen, Bo and Wu, Tong and Zhao, Dongyan and Zheng, Zilong}, booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)}, year={2025} }
- Look Both Ways and No Sink: Converting LLMs into Text Encoders without Training ACL'25
Recent advancements have demonstrated the advantage of converting pretrained large language models into powerful text encoders by enabling bidirectional attention in transformer layers. However, existing methods often require extensive training on large-scale datasets, posing challenges in domain-specific scenarios. In this work, we show that a domain-specific pretrained large language model can be converted into a strong domain-specific text encoder without additional training. We first conduct a comprehensive empirical study to investigate different conversion strategies and identify the impact of the attention sink phenomenon on the performance of converted encoder models. Based on our findings, we propose a novel approach that enables bidirectional attention and suppresses the attention sink phenomenon, resulting in superior performance. Extensive experiments on multiple domains demonstrate the effectiveness of our approach. Our work provides new insights into the training-free conversion of text encoders in low-resource scenarios and contributes to the advancement of domain-specific text representation generation.
@inproceedings{lin2025converting, title={Look Both Ways and No Sink: Converting LLMs into Text Encoders without Training}, author={Li, Jiaqi and Wang, Mengmeng and Zheng, Zilong and Zhang, Muhan}, booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, year={2024} }
- ReflectEvo: Improving Meta Introspection of Small LLMs by Learning Self-Reflection ACL'25Jiaqi Li, Xinyi Dong, Yang Liu, Zhizhuo Yang, Quansen Wang, Xiaobo Wang, SongChun Zhu, Zixia Jia#, and Zilong Zheng#, in ACL Findings, 2025.
We present a novel pipeline, ReflectEvo, to demonstrate that small language models (SLMs) can enhance meta introspection through reflection learning. This process iteratively generates self-reflection for self-training, fostering a continuous and self-evolving process. Leveraging this pipeline, we construct ReflectEvo-460k, a large-scale, comprehensive, self-generated reflection dataset with broadened instructions and diverse multi-domain tasks. Building upon this dataset, we demonstrate the effectiveness of reflection learning to improve SLMs' reasoning abilities using SFT and DPO with remarkable performance, substantially boosting Llama-3 from 52.4% to 71.2% and Mistral from 44.4% to 71.1%. It validates that ReflectEvo can rival or even surpass the reasoning capability of the three prominent open-sourced models on BIG-bench without distillation from superior models or fine-grained human annotation. We further conduct a deeper analysis of the high quality of self-generated reflections and their impact on error localization and correction. Our work highlights the potential of continuously enhancing the reasoning performance of SLMs through iterative reflection learning in the long run.
@inproceedings{li2025reflectevo, title={ReflectEvo: Improving Meta Introspection of Small LLMs by Learning Self-Reflection}, author={Li, Jiaqi and Dong, Xinyi and Liu, Yang and Yang, Zhizhuo and Wang, Quansen and Wang, Xiaobo and Zhu, SongChun and Jia, Zixia and Zheng, Zilong}, booktitle={Findings of the Association for Computational Linguistics: ACL-Findings 2025}, year={2025} }
- Are the Values of LLMs Structurally Aligned with Humans? A Causal Perspective ACL'25Yipeng Kang, Junqi Wang, Yexin Li, Mengmeng Wang, Wenming Tu, Quansen Wang, Hengli Li, Tingjun Wu, Xue Feng, Fangwei Zhong, and Zilong Zheng#, in ACL Findings, 2025.
As large language models (LLMs) become increasingly integrated into critical applications, aligning their behavior with human values presents significant challenges. Current methods, such as Reinforcement Learning from Human Feedback (RLHF), typically focus on a limited set of coarse-grained values and are resource-intensive. Moreover, the correlations between these values remain implicit, leading to unclear explanations for value-steering outcomes. Our work argues that a latent causal value graph underlies the value dimensions of LLMs and that, despite alignment training, this structure remains significantly different from human value systems. We leverage these causal value graphs to guide two lightweight value-steering methods: role-based prompting and sparse autoencoder (SAE) steering, effectively mitigating unexpected side effects. Furthermore, SAE provides a more fine-grained approach to value steering. Experiments on Gemma-2B-IT and Llama3-8B-IT demonstrate the effectiveness and controllability of our methods.
@inproceedings{kang2025valuecausal, title={Are the Values of LLMs Structurally Aligned with Humans? A Causal Perspective}, author={Kang, Yipeng and Wang, Junqi and Li, Yexin and Wang, Mengmeng and Tu, Wenming and Wang, Quansen and Li, Hengli and Wu, Tingjun and Feng, Xue and Zhong, Fangwei and Zheng, Zilong}, booktitle={Findings of the Association for Computational Linguistics: ACL-Findings 2025}, year={2025} }
- DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints AAAI'25 OralAndrew Zhao, Quentin Xu, Matthieu Liu, Shenzhi Wang, Yong-jin Liu, Zilong Zheng#, and Gao Huang#, in AAAI, 2025.
Recent advances in large language models (LLMs) have made them indispensable, raising significant concerns over managing their safety. Automated red teaming offers a promising alternative to the labor-intensive and error-prone manual probing for vulnerabilities, providing more consistent and scalable safety evaluations. However, existing approaches often compromise diversity by focusing on maximizing attack success rate. Additionally, methods that decrease the cosine similarity from historical embeddings with semantic diversity rewards lead to novelty stagnation as history grows. To address these issues, we introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity. Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization. Project details and code can be found at https://andrewzh112.github.io/#diverct.
@article{zhao2025diverct, title={DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints}, author={Zhao, Andrew and Xu, Quentin and Liu, Matthieu and Wang, Shenzhi and Liu, Yong-jin and Zheng, Zilong and Huang, Gao}, journal={Proceedings of the AAAI Conference on Artificial Intelligence}, volume={39}, year={2025} }
- Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng#, and Gao Huang#, Preprint, 2025.
Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-{n} models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.
@misc{zhao2025absolutezeroreinforcedselfplay, title={Absolute Zero: Reinforced Self-play Reasoning with Zero Data}, author={Andrew Zhao and Yiran Wu and Yang Yue and Tong Wu and Quentin Xu and Yang Yue and Matthieu Lin and Shenzhi Wang and Qingyun Wu and Zilong Zheng and Gao Huang}, year={2025}, eprint={2505.03335}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2505.03335}, }
- Yang Liu, Jiaqi Li, and Zilong Zheng#, Preprint, 2025.
Rule-based reasoning has been acknowledged as one of the fundamental problems in reasoning, while deviations in rule formats, types, and complexity in real-world applications pose severe challenges. Recent studies have shown that large reasoning models (LRMs) have remarkable reasoning capabilities, and their performance is substantially enhanced by reinforcement learning (RL). However, it remains an open question whether small reasoning models (SRMs) can learn rule-based reasoning effectively with robust generalization across diverse tasks and domains. To address this, we introduce Reinforced Rule-based Reasoning, a.k.a. RuleReasoner, a simple yet effective method to conduct rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach. Specifically, RuleReasoner resamples each training batch by updating the sampling weights of different domains based on historical rewards. This facilitates domain augmentation and flexible online learning schedules for RL, obviating the need for pre-hoc human-engineered mix-training recipes used in existing methods. Empirical evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin (Δ4.1% average points on eight ID tasks and Δ10.4% average points on three OOD tasks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior dynamic sampling methods for RL.
@misc{liu2025rulereasoner, title={RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling}, author={Yang Liu and Jiaqi Li and Zilong Zheng}, year={2025}, eprint={2506.08672}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.08672}, }
- Hengli Li, Chenxi Li, Tong Wu, Xuekai Zhu, Yuxuan Wang, Zhaoxin Yu, Eric Hanchen Jiang, Song-Chun Zhu, Zixia Jia, Ying Nian Wu#, and Zilong Zheng#, Preprint, 2025.
Reasoning ability, a core component of human intelligence, continues to pose a significant challenge for Large Language Models (LLMs) in the pursuit of AGI. Although model performance has improved under the training scaling law, significant challenges remain, particularly with respect to training algorithms, such as catastrophic forgetting, and the limited availability of novel training data. As an alternative, test-time scaling enhances reasoning performance by increasing test-time computation without parameter updating. Unlike prior methods in this paradigm focused on token space, we propose leveraging latent space for more effective reasoning and better adherence to the test-time scaling law. We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space. Specifically, LatentSeek leverages policy gradient to iteratively update latent representations, guided by self-generated reward signals. LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures. Results show that LatentSeek consistently outperforms strong baselines, such as Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our analysis demonstrates that LatentSeek is highly efficient, typically converging within a few iterations for problems of average complexity, while also benefiting from additional iterations, thereby highlighting the potential of test-time scaling in the latent space. These findings position LatentSeek as a lightweight, scalable, and effective solution for enhancing the reasoning capabilities of LLMs.
@misc{li2025seekdarkreasoningtesttime, title={Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space}, author={Hengli Li and Chenxi Li and Tong Wu and Xuekai Zhu and Yuxuan Wang and Zhaoxin Yu and Eric Hanchen Jiang and Song-Chun Zhu and Zixia Jia and Ying Nian Wu and Zilong Zheng}, year={2025}, eprint={2505.13308}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2505.13308}, }
2024
- JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models TPAMI'24Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma#, and Yitao Liang#, TPAMI, 2024.
Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. JARVIS-1 is the existing most general agent in Minecraft, capable of completing over 200 different tasks using control and observation space similar to humans. These tasks range from short-horizon tasks, e.g., "chopping trees" to long-horizon tasks, e.g., "obtaining a diamond pickaxe". JARVIS-1 performs exceptionally well in short-horizon tasks, achieving nearly perfect performance. In the classic long-term task of ObtainDiamondPickaxe, JARVIS-1 surpasses the reliability of current state-of-the-art agents by 5 times and can successfully complete longer-horizon and more challenging tasks.
@article{wang2023jarvis1, title = {JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models}, author = {Zihao Wang and Shaofei Cai and Anji Liu and Yonggang Jin and Jinbing Hou and Bowei Zhang and Haowei Lin and Zhaofeng He and Zilong Zheng and Yaodong Yang and Xiaojian Ma and Yitao Liang}, year = {2023}, journal = {arXiv preprint arXiv: 2311.05997} }
- MindDial: Belief Dynamics Tracking with Theory-of-Mind Modeling for Situated Neural Dialogue Generation SIGDIAL'24 Oral
Humans talk in daily conversations while aligning and negotiating the expressed meanings or common ground. Despite the impressive conversational abilities of the large generative language models, they do not consider the individual differences in contextual understanding in a shared situated environment. In this work, we propose MindDial, a novel conversational framework that can generate situated free-form responses to align and negotiate common ground. We design an explicit mind module that can track three-level beliefs -- the speaker's belief, the speaker's prediction of the listener's belief, and the belief gap between the first two. Then the next response is generated to resolve the belief difference and take task-related action. Our framework is applied to both prompting and fine-tuning-based models, and is evaluated across scenarios involving both common ground alignment and negotiation. Experiments show that models with mind modeling can generate more human-like responses when aligning and negotiating common ground. The ablation study further validates the three-level belief design can aggregate information and improve task outcomes in both cooperative and negotiating settings.
@inproceedings{qiu2023minddial, title={MindDial: Belief Dynamics Tracking with Theory-of-Mind Modeling for Situated Neural Dialogue Generation}, author={Qiu, Shuwen and Liu, Mingdian and Li, Hengli and Zhu, Song-Chun and Zheng, Zilong}, booktitle={Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue}, year={2023} }
-
Recently, many methods have been developed to extend the context length of pre-trained large language models (LLMs), but they often require fine-tuning at the target length (>> 4K) and struggle to effectively utilize information from the middle part of the context. To address these issues, we propose Continuity-Relativity indExing with gAussian Middle (CREAM), which interpolates positional encodings by manipulating position indices. Apart from being simple, CREAM is training-efficient: it only requires fine-tuning at the pre-trained context window (e.g., Llama 2-4K) and can extend LLMs to a much longer target context length (e.g., 256K). To ensure that the model focuses more on the information in the middle, we introduce a truncated Gaussian to encourage sampling from the middle part of the context during fine-tuning, thus alleviating the "Lost-in-the-Middle" problem faced by long-context LLMs. Experimental results show that CREAM successfully extends LLMs to the target length for both Base and Chat versions of Llama2-7B with "Never Miss A Beat".
@inproceedings{wu2024cream, title={An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding}, author={Tong Wu, Yanpeng Zhao, Zilong Zheng}, booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, volume = {37}, year={2024} }
- Mars: Situated Inductive Reasoning in an Open-World Environment NeurIPS'24
Large Language Models (LLMs) trained on massive corpora have shown remarkable success in knowledge-intensive tasks. Yet, most of them rely on pre-stored knowledge. Inducing new general knowledge from a specific environment and performing reasoning with the acquired knowledge---situated inductive reasoning, is crucial and challenging for machine intelligence. In this paper, we design Mars, an interactive environment devised for situated inductive reasoning. It introduces counter-commonsense game mechanisms by modifying terrain, survival setting and task dependency while adhering to certain principles. In Mars, agents need to actively interact with their surroundings, derive useful rules and perform decision-making tasks in specific contexts. We conduct experiments on various RL-based and LLM-based methods, finding that they all struggle on this challenging situated inductive reasoning benchmark. Furthermore, we explore Induction from Reflection, where we instruct agents to perform inductive reasoning from history trajectory. The superior performance underscores the importance of inductive reasoning in Mars. Through Mars, we aim to galvanize advancements in situated inductive reasoning and set the stage for developing the next generation of AI systems that can reason in an adaptive and context-sensitive way.
@inproceedings{tang2024mars, title={Mars: Situated Inductive Reasoning in an Open-World Environment}, author={Tang, Xiaojuan and Li, Jiaqi and Liang, Yitao and Zhang, Muhan and Zheng, Zilong}, booktitle={38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks}, year={2024} }
- Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge for Long Video Understanding EMNLP'24Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao, Yang Liu, and Zilong Zheng#, in EMNLP, 2024.
Despite progress in multimodal large language models~(MLLMs), the challenge of interpreting long-form videos in response to linguistic queries persists, largely due to the inefficiency in temporal grounding and limited pre-trained context window size. In this work, we introduce Temporal Grounding Bridge (TGB), a novel framework that bootstraps MLLMs with advanced temporal grounding capabilities and broadens their contextual scope. Our framework significantly enhances the temporal capabilities of current MLLMs through three key innovations: an efficient multi-span temporal grounding algorithm applied to low-dimension temporal features projected from flow; a multimodal length extrapolation training paradigm that utilizes low-dimension temporal features to extend the training context window size; and a bootstrapping framework that bridges our model with pluggable MLLMs without requiring annotation. We validate TGB across seven video benchmarks and demonstrate substantial performance improvements compared with prior MLLMs. Notably, our model, initially trained on sequences of four frames, effectively handles sequences up to 16x longer without sacrificing performance, highlighting its scalability and effectiveness in real-world applications. Our code is publicly available.
@inproceedings{wang2024videotgb, title={Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge}, author={Wang, Yuxuan and Wang, Yueqian and Wu, Pengfei and Liang, Jianxin and Zhao, Dongyan and Liu, Yang and Zheng, Zilong}, booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year={2024} }
- Varying Sentence Representations via Condition-Specified Routers EMNLP'24
Semantic similarity between two sentences is inherently subjective and can vary significantly based on the specific aspects emphasized. Consequently, traditional sentence encoders must be capable of generating conditioned sentence representations that account for diverse conditions or aspects. In this paper, we propose a novel yet efficient framework based on transformer-based language models that facilitates advanced conditioned sentence representation while maintaining model parameters and computational efficiency. Empirical evaluations on the Conditional Semantic Textual Similarity task demonstrate the superiority of our proposed framework.
@inproceedings{lin2024csr, title={Varying Sentence Representations via Condition-Specified Routers}, author={Lin, Ziyong and Wang, Quansen and Jia, Zixia and Zheng, Zilong}, booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year={2024} }
- ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning CoLM'24
Compositional visual reasoning methods, which translate a complex query into a structured composition of feasible visual tasks, have exhibited a strong potential in complicated multimodal tasks like visual question answering, language-guided image editing, etc. Empowered by recent advances in large language models~(LLMs), this multimodal challenge has been brought to a new stage by treating LLMs as few-shot/zero-shot planners, i.e., visual-language programming.Such methods, despite their numerous merits, suffer from challenges due to LLM planning mistakes or inaccuracy of visual execution modules, lagging behind the non-compositional models.In this work, we devise a ``plug-and-play" method, ExoViP, to correct the errors at both the planning and execution stages through introspective verification. We employ verification modules as ``exoskeletons" to enhance current vision-language programming schemes. Specifically, our proposed verification module utilizes a mixture of three sub-verifiers to validate predictions after each reasoning step, subsequently calibrating the visual module predictions and refining the reasoning trace planned by LLMs. Experimental results on two representative vision-language programming methods showcase consistent improvements on five compositional reasoning tasks on standard benchmarks. In light of this, we believe ExoViP can foster better performance and generalization on open-domain multimodal challenges.
@inproceedings{wang2024exovip, title={ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning}, author={Wang, Yuxuan and Yuille, Alan and Li, Zhuowan and Zheng Zilong}, booktitle={The first Conference on Language Modeling (CoLM)}, year={2024} }
- Boosting LLM Agents with Recursive Contemplation for Effective Deception Handling ACL'24Shenzhi Wang, Chang Liu, Zilong Zheng#, Siyuan Qi, Shuo Chen, Qisen Yang, Andrew Zhao, Shaofei Wang, Shiji Song, and Gao Huang#, in ACL Findings, 2024.
Recent breakthroughs in large language models (LLMs) have brought remarkable success in the field of LLM-as-Agent. Nevertheless, a prevalent assumption is that the information processed by LLMs is consistently honest, neglecting the pervasive deceptive or misleading information in human society and AI-generated content. This oversight makes LLMs susceptible to malicious manipulations, potentially resulting in detrimental outcomes. This study utilizes the intricate Avalon game as a testbed to explore LLMs' potential in deceptive environments. Avalon, full of misinformation and requiring sophisticated logic, manifests as a "Game-of-Thoughts". Inspired by the efficacy of humans' recursive thinking and perspective-taking in the Avalon game, we introduce a novel framework, Recursive Contemplation (ReCon), to enhance LLMs' ability to identify and counteract deceptive information. ReCon combines formulation and refinement contemplation processes; formulation contemplation produces initial thoughts and speech, while refinement contemplation further polishes them. Additionally, we incorporate first-order and second-order perspective transitions into these processes respectively. Specifically, the first-order allows an LLM agent to infer others' mental states, and the second-order involves understanding how others perceive the agent's mental state. After integrating ReCon with different LLMs, extensive experiment results from the Avalon game indicate its efficacy in aiding LLMs to discern and maneuver around deceptive information without extra fine-tuning and data. Finally, we offer a possible explanation for the efficacy of ReCon and explore the current limitations of LLMs in terms of safety, reasoning, speaking style, and format, potentially furnishing insights for subsequent research.
@inproceedings{wang2024avalon, title={Boosting LLM Agents with Recursive Contemplation for Effective Deception Handling}, author={Wang, Shenzhi and Liu, Chang and Zheng, Zilong and Qi, Siyuan and Chen, Shuo and Yang, Qisen and Zhao, Andrew and Wang, Shaofei and Song, Shiji and Huang, Gao}, booktitle={Findings of the Association for Computational Linguistics: ACL-Findings}, year={2024} }
- LooGLE: Can Long-Context Language Models Understand Long Contexts? ACL'24
Large language models (LLMs), despite their impressive performance in various language tasks, are typically limited to processing texts within context-window size. This limitation has spurred significant research efforts to enhance LLMs' long-context understanding with high-quality long-sequence benchmarks. However, prior datasets in this regard suffer from shortcomings, such as short context length compared to the context window of modern LLMs; outdated documents that have data leakage problems; and an emphasis on short dependency tasks rather than long dependency tasks. In this paper, we present LooGLE, a Long Context Generic Language Evaluation benchmark for LLMs' long context understanding. LooGLE features relatively new documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning diverse domains. Human annotators meticulously crafted more than 1,100 high-quality question-answer pairs to meet the long dependency requirements. These pairs underwent thorough cross-validation, yielding the most precise assessment of LLMs' long dependency capabilities. The evaluation of eight state-of-the-art LLMs on LooGLE revealed key findings: (i) commercial models outperformed open-sourced models; (ii) LLMs excelled in short dependency tasks like short question-answering and cloze tasks but struggled with more intricate long dependency tasks; (iii) in-context learning and chaining thoughts offered only marginal improvements; (iv) retrieval-based techniques demonstrated substantial benefits for short question-answering, while strategies for extending context window length had limited impact on long context understanding. As such, LooGLE not only provides a systematic and comprehensive evaluation schema on long-context LLMs, but also sheds light on future development of enhanced models towards "true long-context understanding".
@inproceedings{li2024loogle, title={LooGLE: Can Long-Context Language Models Understand Long Contexts?}, author={Li, Jiaqi and Wang, Mengmeng and Zheng, Zilong and Zhang, Muhan}, booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, year={2024} }
- LangSuit⋅E: Controlling, Planning, and Interacting with Large Language Models in Embodied Text Environments ACL'24
Recent advances in Large Language Models (LLMs) have shown inspiring achievements in constructing autonomous agents that rely on language descriptions as inputs. However, it remains unclear how well LLMs can function as few-shot or zero-shot embodied agents in dynamic interactive environments. To address this gap, we introduce LangSuit·E, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds. Compared with previous LLM-based testbeds, LangSuit·E (i) offers adaptability to diverse environments without multiple simulation engines, (ii) evaluates agents’ capacity to develop “internalized world knowledge” with embodied observations, and (iii) allows easy customization of communication and action strategies. To address the embodiment challenge, we devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information. Comprehensive benchmark results illustrate challenges and insights of embodied planning. LangSuit·E represents a significant step toward building embodied generalists in the context of language models.
@inproceedings{jia2024langsuite, title={LangSuit$\cdot$E: Controlling, Planning, and Interacting with Large Language Models in Embodied Text Environments}, author={Jia, Zixia and Wang, Mengmeng and Tong, Baichen and Zhu, Song-Chun and Zheng, Zilong}, booktitle={Findings of the Association for Computational Linguistics: ACL-Findings 2024}, year={2024} }
- Combining Supervised Learning and Reinforcement Learning for Multi-Label Classification Tasks with Partial Labels ACL'24
Traditional supervised learning heavily relies on human-annotated datasets, especially in data-hungry neural approaches. However, various tasks, especially multi-label tasks like document-level relation extraction, pose challenges in fully manual annotation due to the specific domain knowledge and large class sets. Therefore, we address the multi-label positive-unlabelled learning (MLPUL) problem, where only a subset of positive classes is annotated. We propose Mixture Learner for Partially Annotated Classification (MLPAC), an RL-based framework combining the exploration ability of reinforcement learning and the exploitation ability of supervised learning. Experimental results across various tasks, including document-level relation extraction, multi-label image classification, and binary PU learning, demonstrate the generalization and effectiveness of our framework.
@inproceedings{jia2024combining, title={Combining Supervised Learning and Reinforcement Learning for Multi-Label Classification Tasks with Partial Labels}, author={Jia, Zixia and Li, Junpeng and Zhang, Shichuan and Zheng, Zilong}, booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, year={2024} }
- MindAgent: Emergent Gaming InteractionRan Gong, Qiuyuan Huang, Xiaojian Ma, Hoi Vo, Zane Durante, Yusuke Noda, Zilong Zheng, Demetri Terzopoulos, Fei-Fei Li, and Jianfeng Gao, in NAACL Findings, 2024.
Large Language Models (LLMs) have the capacity of performing complex scheduling in a multi-agent system and can coordinate these agents into completing sophisticated tasks that require extensive collaboration. However, despite the introduction of numerous gaming frameworks, the community has insufficient benchmarks towards building general multi-agents collaboration infrastructure that encompass both LLM and human-NPCs collaborations. In this work, we propose a novel infrastructure - MindAgent - to evaluate planning and coordination emergent capabilities for gaming interaction. In particular, our infrastructure leverages existing gaming framework, to i) require understanding of the coordinator for a multi-agent system, ii) collaborate with human players via un-finetuned proper instructions, and iii) establish an in-context learning on few-shot prompt with feedback. Furthermore, we introduce CUISINEWORLD, a new gaming scenario and related benchmark that dispatch a multi-agent collaboration efficiency and supervise multiple agents playing the game simultaneously. We conduct comprehensive evaluations with new auto-metric CoS for calculating the collaboration efficiency. Finally, our infrastructure can be deployed into real-world gaming scenarios in a customized VR version of CUISINEWORLD and adapted in existing broader Minecraft gaming domain. We hope our findings on LLMs and the new infrastructure for general-purpose scheduling and coordination can help shed light on how such skills can be obtained by learning from large language corpora.
@inproceedings{gong2024mindagent, title={Mindagent: Emergent gaming interaction}, author={Gong, Ran and Huang, Qiuyuan and Ma, Xiaojian and Vo, Hoi and Durante, Zane and Noda, Yusuke and Zheng, Zilong and Terzopoulos, Demetri and Li, Fei-Fei and Gao, Jianfeng}, booktitle={Findings of the North American Chapter of the Association for Computational Linguistics: NAACL-Findings}, year={2024} }
2023
- ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab NeurIPS'23Jieming Cui*, Ziren Gong*, Baoxiong Jia*, Siyuan Huang, Zilong Zheng#, Jianzhu Ma#, and Yixin Zhu#, in NeurIPS D&B Track, 2023.
The challenge of replicating research results has posed a significant impediment to the field of molecular biology. The advent of modern intelligent systems has led to notable progress in various domains. Consequently, we embarked on an investigation of intelligent monitoring systems as a means of tackling the issue of the reproducibility crisis. Specifically, we first curate a comprehensive multimodal dataset, named ProBio, as an initial step towards this objective. This dataset comprises fine-grained hierarchical annotations intended for the purpose of studying activity understanding in Molecular Biology Lab (BioLab). Next, we devise two challenging benchmarks, transparent solution tracking and multimodal action recognition, to emphasize the unique characteristics and difficulties associated with activity understanding in BioLab settings. Finally, we provide a thorough experimental evaluation of contemporary video understanding models and highlight their limitations in this specialized domain to identify potential avenues for future research. We hope ProBio with associated benchmarks may garner increased focus on modern AI techniques in the realm of molecular biology.
@inproceedings{cui2023probio, title={ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab}, author={Cui, Jieming and Gong, Ziren and Jia, Baoxiong and Huang, Siyuan and Zheng, Zilong and Ma, Jianzhu and Zhu, Yixin}, booktitle={The Thirty-Seventh Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS D&B 2023)}, year={2023} }
- DiPlomat: A Dialogue Dataset for Situated Pragmatic Reasoning NeurIPS'23
Pragmatic reasoning plays a pivotal role in deciphering implicit meanings that frequently arise in real-life conversations and is essential for the development of communicative social agents. In this paper, we introduce a novel challenge, DiPlomat, aiming at benchmarking machines’ capabilities on pragmatic reasoning and situated conversational understanding. Compared with previous works that treat different figurative expressions (e.g. metaphor, sarcasm) as individual tasks, DiPlomat provides a cohesive framework towards general pragmatic understanding. Our dataset is created through the utilization of Amazon Mechanical Turk ( AMT ), resulting in a total of 4, 177 multi-turn dialogues. In conjunction with the dataset, we propose two tasks, Pragmatic Identification and Reasoning (PIR) and Conversational Question Answering (CQA). Experimental results with state-of-the-art (SOTA) neural architectures reveal several significant findings: 1) large language models (LLMs) exhibit poor performance in tackling this subjective domain; 2) comprehensive comprehension of context emerges as a critical factor for establishing benign human-machine interactions; 3) current models defect in the application of pragmatic reasoning. As a result, we call on more attention to improve the ability of context understanding, reasoning, and implied meaning modeling.
@inproceedings{li2023diplomat, title={DiPlomat: A Dialogue Dataset for Situated Pragmatic Reasoning}, author={Li, Hengli and Zhu, Song-Chun and Zheng, Zilong}, booktitle={The Thirty-Seventh Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS D&B 2023)}, year={2023} }
- SQA3D: Situated Question Answering in 3D Scenes ICLR'23Xiaojian Ma*, Silong Yong*, Zilong Zheng#, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang#, in ICLR, 2023.
We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D). Given a scene context (e.g., 3D scan), SQA3D requires the tested agent to first understand its situation (position, orientation, etc.) in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation. Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations. These questions examine a wide spectrum of reasoning capabilities for an intelligent agent, ranging from spatial relation comprehension to commonsense understanding, navigation, and multi-hop reasoning. SQA3D imposes a significant challenge to current multi-modal especially 3D reasoning models. We evaluate various state-of-the-art approaches and find that the best one only achieves an overall score of 47.20%, while amateur human participants can reach 90.06%. We believe SQA3D could facilitate future embodied AI research with stronger situation understanding and reasoning capability.
@inproceedings{ma2022sqa3d, title={SQA3D: Situated Question Answering in 3D Scenes}, author={Ma, Xiaojian and Yong, Silong and Zheng, Zilong and Li, Qing and Liang, Yitao and Zhu, Song-Chun and Huang, Siyuan}, booktitle={International Conference on Learning Representations}, year={2023}, url={https://openreview.net/forum?id=IDJx97BC38} }
- Semi-automatic Data Enhancement for Document-Level Relation Extraction with Distant Supervision from Large Language Models EMNLP'23
Document-level Relation Extraction (DocRE), which aims to extract relations from a long context, is a critical challenge in achieving fine-grained structural comprehension and generating interpretable document representations. Inspired by recent advances in in-context learning capabilities emergent from large language models (LLMs), such as ChatGPT, we aim to design an automated annotation method with minimum human effort. Unfortunately, vanilla in-context learning is infeasible for document-level Relation Extraction ( RE) due to the plenty of predefined fine-grained relation types and the uncontrolled generations of LLMs. To tackle this issue, we propose a method integrating a large language model (LLM) and a natural language inference (NLI) module to generate external relation triples, thereby augmenting document-level relation datasets. We demonstrate the effectiveness of our approach by introducing an enhanced dataset known as DocGNRE, which excels in re-annotating numerous long-tail relation types. We are confident that our method holds the potential for broader applications in domain-specific relation type definitions and offers tangible benefits in advancing generalized language semantic comprehension.
@inproceedings{li2023docngre, title={Semi-automatic Data Enhancement for Document-Level Relation Extraction with Distant Supervision from Large Language Models}, author={Li, Junpeng and Jia, Zixia and Zheng, Zilong}, booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year={2023} }
- VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions ACL'23Yuxuan Wang, Zilong Zheng#, Xueliang Zhao, Jinpeng Li, Yueqian Wang, and Dongyan Zhao#, in ACL, 2023.
Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues. Most existing benchmarks treat both modalities the same as a frame-independent visual understanding task, while neglecting the intrinsic attributes in multimodal dialogues, such as scene and topic transitions. In this paper, we present Video-grounded Scene&Topic AwaRe dialogue (VSTAR) dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series. Based on VSTAR, we propose two benchmarks for video-grounded dialogue understanding: scene segmentation and topic segmentation, and one benchmark for video-grounded dialogue generation. Comprehensive experiments are performed on these benchmarks to demonstrate the importance of multimodal information and segments in video-grounded dialogue understanding and generation.
@inproceedings{wang2023vstar, title={VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions}, author={Wang, Yuxuan and Zheng, Zilong and Zhao, Xueliang and Li, Jinpeng and Wang, Yueqian, and Zhao, Dongyan}, booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)}, year={2023} }
- Modeling Instance Interactions for Joint Information Extraction with Neural High-Order Conditional Random Field ACL'23
Prior works on joint Information Extraction (IE) typically model instance (e.g., event triggers, entities, roles, relations) interactions by representation enhancement, type dependencies scoring, or global decoding. We find that the previous models generally consider binary type dependency scoring of a pair of instances, and leverage local search such as beam search to approximate global solutions. To better integrate cross-instance interactions, in this work, we introduce a joint IE framework (CRFIE) that formulates joint IE as a high-order Conditional Random Field. Specifically, we design binary factors and ternary factors to directly model interactions between not only a pair of instances but also triplets. Then, these factors are utilized to jointly predict labels of all instances. To address the intractability problem of exact high-order inference, we incorporate a high-order neural decoder that is unfolded from a mean-field variational inference method, which achieves consistent learning and inference. The experimental results show that our approach achieves consistent improvements on three IE tasks compared with our baseline and prior work.
@inproceedings{jia2023joint, title={Modeling Instance Interactions for Joint Information Extraction with Neural High-Order Conditional Random Field}, author={Jia, Zixia and Yan, Zhaohui and Han, Wenjuan and Zheng, Zilong and Tu, Kewei}, booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)}, year={2023} }
- Shuō Wén Jiě Zì: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training ACL'23
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters. We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries and Jiezi refers to the process of enhancing characters' glyph representations with structure understanding. To facilitate dictionary understanding, we propose three pre-training tasks, i.e., Masked Entry Modeling, Contrastive Learning for Synonym and Antonym, and Example Learning. We evaluate our method on both modern Chinese understanding benchmark CLUE and ancient Chinese benchmark CCLUE. Moreover, we propose a new polysemy discrimination task PolyMRC based on the collected dictionary of ancient Chinese. Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks. Moreover, our approach yields significant boosting on few-shot setting of ancient Chinese understanding.
@inproceedings{wang2023shuo, title={Shu\={o} W\'{e}n Ji\v{e} Z\`{i}: \\ Rethinking Dictionaries and Glyphs for Chinese Language Pre-training}, author={Wang, Yuxuan and Wang, Jianghui and Zhao, Dongyan and Zheng, Zilong}, booktitle={Findings of the Association for Computational Linguistics: ACL-Findings}, year={2023} }
2022
- In situ bidirectional human-robot value alignment ScienceRoboticsLuyao Yuan*#, Xiaofeng Gao*, Zilong Zheng*, Mark Edmonds#, Ying Nian Wu, Federico Rossano, Hongjing Lu#, Yixin Zhu#, and Song-Chun Zhu#, Science Robotics, 2022.
A prerequisite for social coordination is bidirectional communication between teammates, each playing two roles simultaneously: as receptive listeners and expressive speakers. For robots working with humans in complex situations with multiple goals that differ in importance, failure to fulfill the expectation of either role could undermine group performance due to misalignment of values between humans and robots. Specifically, a robot needs to serve as an effective listener to infer human users’ intents from instructions and feedback and as an expressive speaker to explain its decision processes to users. Here, we investigate how to foster effective bidirectional human-robot communications in the context of value alignment—collaborative robots and users form an aligned understanding of the importance of possible task goals. We propose an explainable artificial intelligence (XAI) system in which a group of robots predicts users’ values by taking in situ feedback into consideration while communicating their decision processes to users through explanations. To learn from human feedback, our XAI system integrates a cooperative communication model for inferring human values associated with multiple desirable goals. To be interpretable to humans, the system simulates human mental dynamics and predicts optimal explanations using graphical models. We conducted psychological experiments to examine the core components of the proposed computational framework. Our results show that real-time human-robot mutual understanding in complex cooperative tasks is achievable with a learning model based on bidirectional communication. We believe that this interaction framework can shed light on bidirectional value alignment in communicative XAI systems and, more broadly, in future human-machine teaming systems. An explainable artificial intelligence collaboration framework enables in situ bidirectional human-robot value alignment.
@article{ doi:10.1126/scirobotics.abm4183, author = {Luyao Yuan and Xiaofeng Gao and Zilong Zheng and Mark Edmonds and Ying Nian Wu and Federico Rossano and Hongjing Lu and Yixin Zhu and Song-Chun Zhu }, title = {In situ bidirectional human-robot value alignment}, journal = {Science Robotics}, volume = {7}, number = {68}, pages = {eabm4183}, year = {2022}, doi = {10.1126/scirobotics.abm4183}, URL = {https://www.science.org/doi/abs/10.1126/scirobotics.abm4183}, eprint = {https://www.science.org/doi/pdf/10.1126/scirobotics.abm4183} }
- SHARP: Search-Based Adversarial Attack for Structured Prediction NAACL'22
Understanding what we genuinely mean instead of what we literally say in conversations is challenging for both humans and machines; yet, this direction is mostly left untouched in modern open-ended dialogue systems. To fill in this gap, we present a grammar-based dialogue dataset, GRICE, designed to bring implicature into pragmatic reasoning in the context of conversations. Our design of GRICE also incorporates other essential aspects of modern dialogue modeling (e.g., coreference). The entire dataset is systematically generated using a hierarchical grammar model, such that each dialogue context has intricate implicatures and is temporally consistent. We further present two tasks, the implicature recovery task followed by the pragmatic reasoning task in conversation, to evaluate the model's reasoning capability. In experiments, we adopt baseline methods that claimed to have pragmatics reasoning capability; the results show a large performance gap between baseline methods and human performance. After integrating a simple module that explicitly reasons about implicature, the model shows an overall performance boost in conversational reasoning. These observations demonstrate the significance of implicature recovery for open-ended dialogue reasoning and call for future research in conversational implicature and conversational reasoning.
@inproceedings{zhang2022sharp, title={SHARP: Search-Based Adversarial Attack for Structured Prediction}, author={Zhang, Liwen and Jia, Zixia and Han, Wenjuan and Zheng, Zilong and Tu, Kewei}, booktitle={Findings of the Association for Computational Linguistics: NAACL-Findings}, year={2021} }
- VGStore: A Multimodal Extension to SPARQL for Querying RDF Scene Graph ISWC'22
Semantic Web technology has successfully facilitated many RDF models with rich data representation methods. It also has the potential ability to represent and store multimodal knowledge bases such as multimodal scene graphs. However, most existing query languages, especially SPARQL, barely explore the implicit multimodal relationships like semantic similarity, spatial relations, etc. We first explored this issue by organizing a large-scale scene graph dataset, namely Visual Genome, in the RDF graph database. Based on the proposed RDF-stored multimodal scene graph, we extended SPARQL queries to answer questions containing relational reasoning about color, spatial, etc. Further demo (i.e., VGStore) shows the effectiveness of customized queries and displaying multimodal data.
@inproceedings{vgstore22iswc, title={VGStore: A Multimodal Extension to SPARQL for Querying RDF Scene Graph}, author={Li, Yanzeng and Zheng, Zilong and Han, Wenjuan and Zou, Lei}, booktitle={The 21st International Semantic Web Conference (ISWC) Poster & Demo Track}, year={2022} }
- Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling ICLR'22 Oral
We introduce a new task, unsupervised vision-language (VL) grammar induction. Given an image-caption pair, the goal is to extract a shared hierarchical structure for both image and language simultaneously. We argue that such structured output, grounded in both modalities, is a clear step towards the high-level understanding of multimodal information. Besides challenges existing in conventional visually grounded grammar induction tasks, VL grammar induction requires a model to capture contextual semantics and perform a fine-grained alignment. To address these challenges, we propose a novel method, CLIORA, which constructs a shared vision-language constituency tree structure with context-dependent semantics for all constituents in different levels of the tree. It computes a matching score between each constituent and image region, trained via contrastive learning. It integrates two levels of fusion, namely at feature-level and at score-level, so as to allow fine-grained alignment. We introduce a new evaluation metric: Critical Concept Recall Rate (CCRR) to explicitly evaluate VL grammar induction, and show a 2.6% improvement over a strong baseline on Flickr30k Entities. We also evaluate our model via two derived tasks, i.e., language grammar induction and phrase grounding, and improve over the state-of-the-art for both.
@article{wan2022unsupervised, title={Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling}, author={Wan, Bo and Han, Wenjuan and Zheng, Zilong and Tuytelaars, Tinne}, journal={The Tenth International Conference on Learning Representations (ICLR)}, year={2022} }
- Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships CVPR'22
Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for visual scenes (e.g., scene graphs) and natural languages (e.g., dependency trees), individually. However, how to construct a joint vision-language (VL) structure has barely been investigated. More challenging but worthwhile, we introduce a new task that targets on inducing such a joint VL structure in an unsupervised manner. Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly. Due to the lack of VL structural data, we start by building a new dataset VLParse. Rather than using labor-intensive labeling from scratch, we propose an automatic alignment procedure to produce coarse structures followed by human refinement to produce high-quality ones. Moreover, we benchmark our dataset by proposing a contrastive learning (CL)-based framework VLGAE, short for Vision-Language Graph Autoencoder. Our model obtains superior performance on two derived tasks, i.e., language grammar induction and VL phrase grounding. Ablations show the effectiveness of both visual cues and dependency relationships on fine-grained VL structure construction.
@inproceedings{lou2022unsupervised, title={Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships}, author={Lou, Chao and Han, Wenjuan and Lin, Yuhuan and Zheng, Zilong}, journal={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)}, year={2022} }
- Energy-Based Generative Cooperative Saliency Prediction AAAI'22 Oral
Conventional saliency prediction models typically learn a deterministic mapping from images to the corresponding ground truth saliency maps. In this paper, we study the saliency prediction problem from the perspective of generative models by learning a conditional probability distribution over saliency maps given an image, and treating the prediction as a sampling process. Specifically, we propose a generative cooperative saliency prediction framework based on the generative cooperative networks, where a conditional latent variable model and a conditional energy-based model are jointly trained to predict saliency in a cooperative manner. We call our model the SalCoopNets. The latent variable model serves as a fast but coarse predictor to efficiently produce an initial prediction, which is then refined by the iterative Langevin revision of the energy-based model that serves as a fine predictor. Such a coarse-to-fine cooperative saliency prediction strategy offers the best of both worlds. Moreover, we generalize our framework to the scenario of weakly supervised saliency prediction, where saliency annotation of training images is partially observed, by proposing a cooperative learning while recovering strategy. Lastly, we show that the learned energy function can serve as a refinement module that can refine the results of other pretrained saliency prediction models. Experimental results show that our generative model can achieve state-of-the-art performance.
@article{zhang2022energy, title = {Energy-Based Generative Cooperative Saliency Prediction}, author = {Zhang, Jing and Xie, Jianwen and Zheng, Zilong and Barnes, Nick}, journal={The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI)}, year = {2022} }
2021
- Cooperative Training of Fast Thinking Initializer and Slow Thinking Solver for Multi-Modal Conditional Learning TPAMIJianwen Xie*, Zilong Zheng*, Xiaolin Fang, Song-Chun Zhu, and Ying Nian Wu, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021.
This paper studies the supervised learning of the conditional distribution of a high-dimensional output given an input, where the output and input may belong to two different modalities, e.g., the output is an photo image and the input is a sketch image. We solve this problem by cooperative training of a fast thinking initializer and slow thinking solver. The initializer generates the output directly by a non-linear transformation of the input as well as a noise vector that accounts for latent variability in the output. The slow thinking solver learns an objective function in the form of a conditional energy function, so that the output can be generated by optimizing the objective function, or more rigorously by sampling from the conditional energy-based model. We propose to learn the two models jointly, where the fast thinking initializer serves to initialize the sampling of the slow thinking solver, and the solver refines the initial output by an iterative algorithm. The solver learns from the difference between the refined output and the observed output, while the initializer learns from how the solver refines its initial output. We demonstrate the effectiveness of the proposed method on various multi-modal conditional learning tasks, e.g., class-to-image generation, image-to-image translation, and image recovery.
@article{xie2021cooperative, title={Cooperative Training of Fast Thinking Initializer and Slow Thinking Solver for Multi-Modal Conditional Learning}, author={Xie, Jianwen and Zheng, Zilong and Fang, Xiaolin and Zhu, Song-Chun and Wu, Ying Nian}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)}, year={2021} }
- Patchwise Generative ConvNet: Training Energy-Based Models from a Single Natural Image for Internal Learning CVPR'21 Oral
Exploiting internal statistics of a single natural image has long been recognized as a significant research paradigm where the goal is to learn the internal distribution of patches within the image without relying on external training data. Different from prior works that model such a distribution implicitly with a top-down latent variable model (e.g., generator), this paper proposes to explicitly represent the statistical distribution within a single natural image by using an energy-based generative framework, where a pyramid of energy functions, each parameterized by a bottom-up deep neural network, are used to capture the distributions of patches at different resolutions. Meanwhile, a coarse-to-fine sequential training and sampling strategy is presented to train the model efficiently. Besides learning to generate random samples from white noise, the model can learn in parallel with a self-supervised task (e.g., recover the input image from its corrupted version), which can further improve the descriptive power of the learned model. The proposed model is simple and natural in that it does not require an auxiliary model (e.g., discriminator) to assist the training. Besides, it also unifies internal statistics learning and image generation in a single framework. Experimental results presented on various image generation and manipulation tasks, including super-resolution, image editing, harmonization, style transfer, etc., have demonstrated the effectiveness of our model for internal learning.
@inproceedings{zheng2021patchgencn, title={Patchwise Generative ConvNet: Training Energy-Based Models from a Single Natural Image for Internal Learning}, author={Zheng, Zilong and Xie, Jianwen and Li, Ping}, booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)}, year={2021} }
- Learning Triadic Belief Dynamics in Nonverbal Communication from Videos CVPR'21 Oral
Humans possess a unique social cognition capability; nonverbal communication can convey rich social information among agents. In contrast, such crucial social characteristics are mostly missing in the existing scene understanding literature. In this paper, we incorporate different nonverbal communication cues (e.g., gaze, human poses, and gestures) to represent, model, learn, and infer agents’ mental states from pure visual inputs. Crucially, such a mental representation takes the agent’s belief into account so that it represents what the true world state is and infers the beliefs in each agent’s mental state, which may differ from the true world states. By aggregating different beliefs and true world states, our model essentially forms “five minds” during the interactions between two agents. This “five minds” model differs from prior works that infer beliefs in an infinite recursion; instead, agents’ beliefs are converged into a “common mind”. Based on this representation, we further devise a hierarchical energybased model that jointly tracks and predicts all five minds. From this new perspective, a social event is interpreted by a series of nonverbal communication and belief dynamics, which transcends the classic keyframe video summary. In the experiments, we demonstrate that using such a social account provides a better video summary on videos with rich social interactions compared with state-of-the-art keyframe video summary methods
@inproceedings{fan2021learning, title = {Learning Tradic Belief Dynamics in Nonverbal Communication from Videos}, author = {Lifeng Fan and Shuwen Qiu and Zilong Zheng and Tao Gao and Song-Chun Zhu and Yixin Zhu}, year = {2021}, booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)} }
- Generative PointNet: Deep Energy-Based Learning on Unordered Point Sets for 3D Generation, Reconstruction and Classification CVPR'21
We propose a generative model of unordered point sets, such as point clouds, in the forms of an energy-based model, where the energy function is parameterized by an input-permutation-invariant bottom-up neural network. The energy function learns a coordinate encoding of each point and then aggregates all individual point features into energy for the whole point cloud. We show that our model can be derived from the discriminative PointNet. The model can be trained by MCMC-based maximum likelihood learning (as well as its variants), without the help of any assisting networks like those in GANs and VAEs. Unlike most point cloud generator that relys on hand-crafting distance metrics, our model does not rely on hand-crafting distance metric for point cloud generation, because it synthesizes point clouds by matching observed examples in terms of statistical property defined by the energy function. Furthermore, we can learn a short-run MCMC toward the energy-based model as a flow-like generator for point cloud reconstruction and interpretation. The learned point cloud representation can be also useful for point cloud classification. Experiments demonstrate the advantages of the proposed generative model of point clouds.
@inproceedings{xie2021GPointent, title={Generative PointNet: Deep Energy-Based Learning on Unordered Point Sets for 3D Generation, Reconstruction and Classification}, author={Xie, Jianwen and Xu, Yifei and Zheng, Zilong and Zhu, Song-Chun and Wu, Ying Nian}, booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)}, year={2021} }
- GRICE: A Grammar-based Dataset for Recovering Implicature and Conversational rEasoning ACL'21
Understanding what we genuinely mean instead of what we literally say in conversations is challenging for both humans and machines; yet, this direction is mostly left untouched in modern open-ended dialogue systems. To fill in this gap, we present a grammar-based dialogue dataset, GRICE, designed to bring implicature into pragmatic reasoning in the context of conversations. Our design of GRICE also incorporates other essential aspects of modern dialogue modeling (e.g., coreference). The entire dataset is systematically generated using a hierarchical grammar model, such that each dialogue context has intricate implicatures and is temporally consistent. We further present two tasks, the implicature recovery task followed by the pragmatic reasoning task in conversation, to evaluate the model's reasoning capability. In experiments, we adopt baseline methods that claimed to have pragmatics reasoning capability; the results show a large performance gap between baseline methods and human performance. After integrating a simple module that explicitly reasons about implicature, the model shows an overall performance boost in conversational reasoning. These observations demonstrate the significance of implicature recovery for open-ended dialogue reasoning and call for future research in conversational implicature and conversational reasoning.
@inproceedings{zheng2021grice, title={GRICE: A Grammar-based Dataset for Recovering Implicature and Conversational Reasoning}, author={Zheng, Zilong and Qiu, Shuwen and Fan, Lifeng and Zhu, Yixin and Zhu, Song-Chun}, booktitle={Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021}, year={2021}, pages = {2074--2085} }
- Learning Energy-Based Model with Variational Auto-Encoder as Amortized Sampler AAAI'21
Due to the intractable partition function, training energybased models (EBMs) by maximum likelihood requires Markov chain Monte Carlo (MCMC) sampling to approximate the gradient of the Kullback–Leibler divergence between data and model distributions. However, it is non-trivial to sample from an EBM because of the difficulty of mixing between modes. In this paper, we propose to learn a variational auto-encoder (VAE) to initialize the finite-step MCMC, such as Langevin dynamics that is derived from the energy function, for efficient amortized sampling of the EBM. With these amortized MCMC samples, the EBM can be trained by maximum likelihood, which follows an “analysis by synthesis” scheme; while the variational auto-encoder learns from these MCMC samples via variational Bayes. We call this joint training algorithm the variational MCMC teaching, in which the VAE chases the EBM toward data distribution. We interpret the learning algorithm as a dynamic alternating projection in the context of information geometry. Our proposed models can generate samples comparable to GANs and EBMs. Additionally, we demonstrate that our models can learn effective probabilistic distribution toward supervised conditional learning experiments.
@article{xie2021vaeebm, title={Learning Energy-Based Model with Variational Auto-Encoder as Amortized Sampler}, author={Xie, Jianwen and Zheng, Zilong and Li, Ping}, journal={The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI)}, year={2021} }
- Learning Cycle-Consistent Cooperative Networks via Alternating MCMC Teaching for Unsupervised Cross-Domain Translation AAAI'21
This paper studies the unsupervised cross-domain translation problem by proposing a generative framework, in which the probability distribution of each domain is represented by a generative cooperative network that consists of an energy-based model and a latent variable model. The use of generative cooperative network enables maximum likelihood learning of the domain model by MCMC teaching, where the energy-based model seeks to fit the data distribution of domain and distills its knowledge to the latent variable model via MCMC. Specifically, in the MCMC teaching process, the latent variable model parameterized by an encoder-decoder maps examples from the source domain to the target domain, while the energy-based model further refines the mapped results by Langevin revision such that the revised results match to the examples in the target domain in terms of the statistical properties, which are defined by the learned energy function. For the purpose of building up a correspondence between two unpaired domains, the proposed framework simultaneously learns a pair of cooperative networks with cycle consistency, accounting for a two-way translation between two domains, by alternating MCMC teaching. Experiments show that the proposed framework is useful for unsupervised image-to-image translation and unpaired image sequence translation.
@article{xie2021cycle, title={Learning Cycle-Consistent Cooperative Networks via Alternating MCMC Teaching for Unsupervised Cross-Domain Translation}, author={Xie, Jianwen and Zheng, Zilong and Fang, Xiaolin and Zhu, Song-Chun and Wu, Ying Nian}, journal={The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI)}, year={2021} }
2020
- Generative VoxelNet: Learning Energy-Based Models for 3D Shape Synthesis and Analysis TPAMIJianwen Xie*, Zilong Zheng*, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, and Ying Nian Wu, in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020.
3D data that contains rich geometry information of objects and scenes is a valuable asset for understanding 3D physical world. With the recent emergence of large-scale 3D datasets, it becomes increasingly crucial to have a powerful 3D generative model for 3D shape synthesis and analysis. This paper proposes a 3D shape descriptor network, which is a deep 3D convolutional energy-based model, for representing volumetric shape patterns. The maximum likelihood training of the model follows an “analysis by synthesis” scheme. The benefits of the proposed model are five-fold: first, unlike GANs and VAEs, the training of the model does not rely on any auxiliary models; second, the model can synthesize realistic 3D shapes by sampling from the probability distribution via MCMC, such as Langevin dynamics; third, the conditional version of the model can be applied to 3D object recovery and super-resolution; fourth, the model can be used to train a 3D generator network via MCMC teaching; fifth, the unsupervisedly trained model provides a powerful feature extractor for 3D data, which can be useful for 3D object classification. Experiments demonstrate that the proposed model can generate high-quality 3D shape patterns and can be useful for a wide variety of 3D shape analysis.
@article{xie2020gvoxelnet, title={Generative VoxelNet: Learning Energy-Based Models for 3D Shape Synthesis and Analysis}, author= {Xie, Jianwen and Zheng, Zilong and Gao, Ruiqi and Wang, Wenguan and Zhu, Song-Chun and Wu, Ying Nian}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)}, year={2020} }
- Joint Inference of States, Robot Knowledge, and Human (False-)Beliefs ICRA'20Tao Yuan, Hangxin Liu, Lifeng Fan, Zilong Zheng, Tao Gao, Yixin Zhu, and Song-Chun Zhu, in ICRA, 2020.
Aiming to understand how human (false-)belief—a core socio-cognitive ability—would affect human interactions with robots, this paper proposes to adopt a graphical model to unify the representation of object states, robot knowledge, and human (false-)beliefs. Specifically, a parse graph (PG) is learned from a single-view spatiotemporal parsing by aggregating various object states along the time; such a learned representation is accumulated as the robot’s knowledge. An inference algorithm is derived to fuse individual PG from all robots across multi-views into a joint PG, which affords more effective reasoning and inference capability to overcome the errors originated from a single view. In the experiments, through the joint inference over PGs, the system correctly recognizes human (false-)belief in various settings and achieves better cross-view accuracy on a challenging small object tracking dataset.
@inproceedings{yuan2020joint, title={Joint Inference of States, Robot Knowledge, and Human (False-)Beliefs}, author={Yuan, Tao and Liu, Hangxin and Fan, Lifeng and Zheng, Zilong and Gao, Tao and Zhu, Yixin and Zhu, Song-Chun}, booktitle={Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)}, year={2020} }
- Motion-Based Generator Model: Unsupervised Disentanglement of Appearance, Trackable and Intrackable Motions in Dynamic Patterns AAAI'20 Oral
Dynamic patterns are characterized by complex spatial and motion patterns. Understanding dynamic patterns requires a disentangled representational model that separates the factorial components. A commonly used model for dynamic patterns is the state space model, where the state evolves over time according to a transition model and the state generates the observed image frames according to an emission model. To model the motions explicitly, it is natural for the model to be based on the motions or the displacement fields of the pixels. Thus in the emission model, we let the hidden state generate the displacement field, which warps the trackable component in the previous image frame to generate the next frame while adding a simultaneously emitted residual image to account for the change that cannot be explained by the deformation. The warping of the previous image is about the trackable part of the change of image frame, while the residual image is about the intrackable part of the image. We use a maximum likelihood algorithm to learn the model parameters that iterates between inferring latent noise vectors that drive the transition model and updating the parameters given the inferred latent vectors. Meanwhile we adopt a regularization term to penalize the norms of the residual images to encourage the model to explain the change of image frames by trackable motion. Unlike existing methods on dynamic patterns, we learn our model in unsupervised setting without ground truth displacement fields or optical flows. In addition, our model defines a notion of intrackability by the separation of warped component and residual component in each image frame. We show that our method can synthesize realistic dynamic pattern, and disentangling appearance, trackable and intrackable motions. The learned models can be useful for motion transfer, and it is natural to adopt it to define and measure intrackability of a dynamic pattern.
@article{xie2020motion, title={Motion-Based Generator Model: Unsupervised Disentanglement of Appearance, Trackable and Intrackable Motions in Dynamic Patterns}, author={Xie, Jianwen and Gao, Ruiqi and Zheng, Zilong and Zhu, Song-Chun and Wu, Ying Nian}, journal={The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI)}, year={2020} }
2019
- Reasoning Visual Dialogs with Structural and Partial Observations CVPR'19 Oral
We propose a novel model to address the task of Visual Dialog which exhibits complex dialog structures. To obtain a reasonable answer based on the current question and the dialog history, the underlying semantic dependencies between dialog entities are essential. In this paper, we explicitly formalize this task as inference in a graphical model with partially observed nodes and unknown graph structures (relations in dialog). The given dialog entities are viewed as the observed nodes. The answer to a given question is represented by a node with missing value. We first introduce an Expectation Maximization algorithm to infer both the underlying dialog structures and the missing node values (desired answers). Based on this, we proceed to propose a differentiable graph neural network (GNN) solution that approximates this process. Experiment results on the VisDial and VisDial-Q datasets show that our model outperforms comparative methods. It is also observed that our method can infer the underlying dialog structure for better dialog reasoning.
@inproceedings{zheng2019reasoning, title={Reasoning Visual Dialogs with Structural and Partial Observations}, author={Zheng, Zilong and Wang, Wenguan and Qi, Siyuan and Zhu, Song-Chun}, booktitle={Computer Vision and Pattern Recognition (CVPR), 2019 IEEE Conference on}, year={2019} }
- Learning Dynamic Generator Model by Alternating Back-Propagation Through Time AAAI'19 Spotlight
This paper studies the dynamic generator model for spatial-temporal processes such as dynamic textures and action sequences in video data. In this model, each time frame of the video sequence is generated by a generator model, which is a non-linear transformation of a latent state vector, where the non-linear transformation is parametrized by a top-down neural network. The sequence of latent state vectors follows a non-linear auto-regressive model, where the state vector of the next frame is a non-linear transformation of the state vector of the current frame as well as an independent noise vector that provides randomness in the transition. The non-linear transformation of this transition model can be parametrized by a feedforward neural network. We show that this model can be learned by an alternating back-propagation through time algorithm that iteratively samples the noise vectors and updates the parameters in the transition model and the generator model. We show that our training method can learn realistic models for dynamic textures and action patterns.
@article{xie2019DG, title = {Learning Dynamic Generator Model by Alternating Back-Propagation Through Time}, author = {Xie, Jianwen and Gao, Ruiqi and Zheng, Zilong and Zhu, Song-Chun and Wu, Ying Nian}, journal={The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI)}, year = {2019} }
2018
- Learning Descriptor Networks for 3D Shape Synthesis and Analysis CVPR'18 OralJianwen Xie*, Zilong Zheng*, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, and Ying Nian Wu, in CVPR, 2018.
This paper proposes a 3D shape descriptor network, which is a deep convolutional energy-based model, for modeling volumetric shape patterns. The maximum likelihood training of the model follows an “analysis by synthesis” scheme and can be interpreted as a mode seeking and mode shifting process. The model can synthesize 3D shape patterns by sampling from the probability distribution via MCMC such as Langevin dynamics. The model can be used to train a 3D generator network via MCMC teaching. The conditional version of the 3D shape descriptor net can be used for 3D object recovery and 3D object super-resolution. Experiments demonstrate that the proposed model can generate realistic 3D shape patterns and can be useful for 3D shape analysis.
@inproceedings{xie2018learning, title={Learning Descriptor Networks for 3D Shape Synthesis and Analysis}, author={Xie, Jianwen and Zheng, Zilong and Gao, Ruiqi and Wang, Wenguan and Zhu, Song-Chun and Wu, Ying Nian}, booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)}, pages={8629--8638}, year={2018} }