Zilong Zheng's Homepage

2023

ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab NeurIPS'23

Jieming Cui*, Ziren Gong*, Baoxiong Jia*, Siyuan Huang, Zilong Zheng^✉, Jianzhu Ma^✉, and Yixin Zhu^✉

In The Thirty-Seventh Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS D&B Track) , 2023

Abs Code Website Bibtex

The challenge of replicating research results has posed a significant impediment to the field of molecular biology. The advent of modern intelligent systems has led to notable progress in various domains. Consequently, we embarked on an investigation of intelligent monitoring systems as a means of tackling the issue of the reproducibility crisis. Specifically, we first curate a comprehensive multimodal dataset, named ProBio, as an initial step towards this objective. This dataset comprises fine-grained hierarchical annotations intended for the purpose of studying activity understanding in Molecular Biology Lab (BioLab). Next, we devise two challenging benchmarks, transparent solution tracking and multimodal action recognition, to emphasize the unique characteristics and difficulties associated with activity understanding in BioLab settings. Finally, we provide a thorough experimental evaluation of contemporary video understanding models and highlight their limitations in this specialized domain to identify potential avenues for future research. We hope ProBio with associated benchmarks may garner increased focus on modern AI techniques in the realm of molecular biology.

@inproceedings{cui2023probio, title={ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab}, author={Cui, Jieming and Gong, Ziren and Jia, Baoxiong and Huang, Siyuan and Zheng, Zilong and Ma, Jianzhu and Zhu, Yixin}, booktitle={The Thirty-Seventh Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS D&B 2023)}, year={2023} }
DiPlomat: A Dialogue Dataset for Situated Pragmatic Reasoning NeurIPS'23

Hengli Li, Song-Chun Zhu, and Zilong Zheng^✉

In The Thirty-Seventh Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS D&B Track) , 2023

Abs arXiv Code Website Bibtex

Pragmatic reasoning plays a pivotal role in deciphering implicit meanings that frequently arise in real-life conversations and is essential for the development of communicative social agents. In this paper, we introduce a novel challenge, DiPlomat, aiming at benchmarking machines’ capabilities on pragmatic reasoning and situated conversational understanding. Compared with previous works that treat different figurative expressions (e.g. metaphor, sarcasm) as individual tasks, DiPlomat provides a cohesive framework towards general pragmatic understanding. Our dataset is created through the utilization of Amazon Mechanical Turk ( AMT ), resulting in a total of 4, 177 multi-turn dialogues. In conjunction with the dataset, we propose two tasks, Pragmatic Identification and Reasoning (PIR) and Conversational Question Answering (CQA). Experimental results with state-of-the-art (SOTA) neural architectures reveal several significant findings: 1) large language models (LLMs) exhibit poor performance in tackling this subjective domain; 2) comprehensive comprehension of context emerges as a critical factor for establishing benign human-machine interactions; 3) current models defect in the application of pragmatic reasoning. As a result, we call on more attention to improve the ability of context understanding, reasoning, and implied meaning modeling.

@inproceedings{li2023diplomat, title={DiPlomat: A Dialogue Dataset for Situated Pragmatic Reasoning}, author={Li, Hengli and Zhu, Song-Chun and Zheng, Zilong}, booktitle={The Thirty-Seventh Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS D&B 2023)}, year={2023} }
MindDial: Belief Dynamics Tracking with Theory-of-Mind Modeling for Situated Neural Dialogue Generation ICML'23

Shuwen Qiu, Song-Chun Zhu, and Zilong Zheng^✉

In Workshop on Theory-of-Mind at Fortieth International Conference on Machine Learning (ICML) , 2023

Abs arXiv Bibtex

Humans talk in free-form while negotiating the expressed meanings or common ground. Despite the impressive conversational abilities of the large generative language models, they do not consider the individual differences in contextual understanding in a shared situated environment. In this work, we propose MindDial, a novel conversational framework that can generate situated free-form responses to negotiate common ground. We design an explicit mind module that can track three-level beliefs -- the speaker's belief, the speaker's prediction of the listener's belief, and the common belief based on the gap between the first two. Then the speaking act classification head will decide to continue to talk, end this turn, or take task-related action. We augment a common ground alignment dataset MutualFriend with belief dynamics annotation, of which the goal is to find a single mutual friend based on the free chat between two agents. Experiments show that our model with mental state modeling can resemble human responses when aligning common ground meanwhile mimic the natural human conversation flow. The ablation study further validates the third-level common belief can aggregate information of the first and second-order beliefs and align common ground more efficiently.

@inproceedings{qiu2023minddial, title={MindDial: Belief Dynamics Tracking with Theory-of-Mind Modeling for Situated Neural Dialogue Generation}, author={Qiu, Shuwen and Zhu, Song-Chun and Zheng, Zilong}, booktitle={Workshop on Theory-of-Mind at ICML 2023}, year={2023} }
SQA3D: Situated Question Answering in 3D Scenes ICLR'23

Xiaojian Ma*, Silong Yong*, Zilong Zheng^✉, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang^✉

The Tenth International Conference on Learning Representations (ICLR) , 2023

Abs arXiv PDF Code Website Bibtex

We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D). Given a scene context (e.g., 3D scan), SQA3D requires the tested agent to first understand its situation (position, orientation, etc.) in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation. Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations. These questions examine a wide spectrum of reasoning capabilities for an intelligent agent, ranging from spatial relation comprehension to commonsense understanding, navigation, and multi-hop reasoning. SQA3D imposes a significant challenge to current multi-modal especially 3D reasoning models. We evaluate various state-of-the-art approaches and find that the best one only achieves an overall score of 47.20%, while amateur human participants can reach 90.06%. We believe SQA3D could facilitate future embodied AI research with stronger situation understanding and reasoning capability.

@inproceedings{ma2022sqa3d, title={SQA3D: Situated Question Answering in 3D Scenes}, author={Ma, Xiaojian and Yong, Silong and Zheng, Zilong and Li, Qing and Liang, Yitao and Zhu, Song-Chun and Huang, Siyuan}, booktitle={International Conference on Learning Representations}, year={2023}, url={https://openreview.net/forum?id=IDJx97BC38} }
Semi-automatic Data Enhancement for Document-Level Relation Extraction with Distant Supervision from Large Language Models EMNLP'23

Junpeng Li*, Zixia Jia*, and Zilong Zheng^✉

In The Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2023

Abs Bibtex

Document-level Relation Extraction (DocRE), which aims to extract relations from a long context, is a critical challenge in achieving fine-grained structural comprehension and generating interpretable document representations. Inspired by recent advances in in-context learning capabilities emergent from large language models (LLMs), such as ChatGPT, we aim to design an automated annotation method with minimum human effort. Unfortunately, vanilla in-context learning is infeasible for document-level Relation Extraction ( RE) due to the plenty of predefined fine-grained relation types and the uncontrolled generations of LLMs. To tackle this issue, we propose a method integrating a large language model (LLM) and a natural language inference (NLI) module to generate external relation triples, thereby augmenting document-level relation datasets. We demonstrate the effectiveness of our approach by introducing an enhanced dataset known as DocGNRE, which excels in re-annotating numerous long-tail relation types. We are confident that our method holds the potential for broader applications in domain-specific relation type definitions and offers tangible benefits in advancing generalized language semantic comprehension.

@inproceedings{li2023docngre, title={Semi-automatic Data Enhancement for Document-Level Relation Extraction with Distant Supervision from Large Language Models}, author={Li, Junpeng and Jia, Zixia and Zheng, Zilong}, booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year={2021} }
VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions ACL'23

Yuxuan Wang, Zilong Zheng^✉, Xueliang Zhao, Jinpeng Li, Yueqian Wang, and Dongyan Zhao^✉

In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , 2023

Abs arXiv PDF Code Website Bibtex

Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues. Most existing benchmarks treat both modalities the same as a frame-independent visual understanding task, while neglecting the intrinsic attributes in multimodal dialogues, such as scene and topic transitions. In this paper, we present Video-grounded Scene&Topic AwaRe dialogue (VSTAR) dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series. Based on VSTAR, we propose two benchmarks for video-grounded dialogue understanding: scene segmentation and topic segmentation, and one benchmark for video-grounded dialogue generation. Comprehensive experiments are performed on these benchmarks to demonstrate the importance of multimodal information and segments in video-grounded dialogue understanding and generation.

@inproceedings{wang2023vstar, title={VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions}, author={Wang, Yuxuan and Zheng, Zilong and Zhao, Xueliang and Li, Jinpeng and Wang, Yueqian, and Zhao, Dongyan}, booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)}, year={2023} }
Modeling Instance Interactions for Joint Information Extraction with Neural High-Order Conditional Random Field ACL'23

Zixia Jia, Zhaohui Yan, Wenjuan Han, Zilong Zheng^✉, and Kewei Tu^✉

In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , 2023

Abs PDF Code Bibtex

Prior works on joint Information Extraction (IE) typically model instance (e.g., event triggers, entities, roles, relations) interactions by representation enhancement, type dependencies scoring, or global decoding. We find that the previous models generally consider binary type dependency scoring of a pair of instances, and leverage local search such as beam search to approximate global solutions. To better integrate cross-instance interactions, in this work, we introduce a joint IE framework (CRFIE) that formulates joint IE as a high-order Conditional Random Field. Specifically, we design binary factors and ternary factors to directly model interactions between not only a pair of instances but also triplets. Then, these factors are utilized to jointly predict labels of all instances. To address the intractability problem of exact high-order inference, we incorporate a high-order neural decoder that is unfolded from a mean-field variational inference method, which achieves consistent learning and inference. The experimental results show that our approach achieves consistent improvements on three IE tasks compared with our baseline and prior work.

@inproceedings{jia2023joint, title={Modeling Instance Interactions for Joint Information Extraction with Neural High-Order Conditional Random Field}, author={Jia, Zixia and Yan, Zhaohui and Han, Wenjuan and Zheng, Zilong and Tu, Kewei}, booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)}, year={2023} }
Shuō Wén Jiě Zì: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training ACL'23

Yuxuan Wang, Jianghui Wang, Dongyan Zhao^✉, and Zilong Zheng^✉

In Findings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL-Findings) , 2023

Abs arXiv Code Bibtex

We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters. We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries and Jiezi refers to the process of enhancing characters' glyph representations with structure understanding. To facilitate dictionary understanding, we propose three pre-training tasks, i.e., Masked Entry Modeling, Contrastive Learning for Synonym and Antonym, and Example Learning. We evaluate our method on both modern Chinese understanding benchmark CLUE and ancient Chinese benchmark CCLUE. Moreover, we propose a new polysemy discrimination task PolyMRC based on the collected dictionary of ancient Chinese. Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks. Moreover, our approach yields significant boosting on few-shot setting of ancient Chinese understanding.

@inproceedings{wang2023shuo, title={Shu\={o} W\'{e}n Ji\v{e} Z\`{i}: \\ Rethinking Dictionaries and Glyphs for Chinese Language Pre-training}, author={Wang, Yuxuan and Wang, Jianghui and Zhao, Dongyan and Zheng, Zilong}, booktitle={Findings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)}, year={2023} }

2022

In situ bidirectional human-robot value alignment ScienceRobotics

Luyao Yuan*, Xiaofeng Gao*, Zilong Zheng*, Mark Edmonds, Ying Nian Wu, Federico Rossano, Hongjing Lu, Yixin Zhu, and Song-Chun Zhu

Science Robotics , 2022

Abs Supp Code Video Website Bibtex

A prerequisite for social coordination is bidirectional communication between teammates, each playing two roles simultaneously: as receptive listeners and expressive speakers. For robots working with humans in complex situations with multiple goals that differ in importance, failure to fulfill the expectation of either role could undermine group performance due to misalignment of values between humans and robots. Specifically, a robot needs to serve as an effective listener to infer human users’ intents from instructions and feedback and as an expressive speaker to explain its decision processes to users. Here, we investigate how to foster effective bidirectional human-robot communications in the context of value alignment—collaborative robots and users form an aligned understanding of the importance of possible task goals. We propose an explainable artificial intelligence (XAI) system in which a group of robots predicts users’ values by taking in situ feedback into consideration while communicating their decision processes to users through explanations. To learn from human feedback, our XAI system integrates a cooperative communication model for inferring human values associated with multiple desirable goals. To be interpretable to humans, the system simulates human mental dynamics and predicts optimal explanations using graphical models. We conducted psychological experiments to examine the core components of the proposed computational framework. Our results show that real-time human-robot mutual understanding in complex cooperative tasks is achievable with a learning model based on bidirectional communication. We believe that this interaction framework can shed light on bidirectional value alignment in communicative XAI systems and, more broadly, in future human-machine teaming systems. An explainable artificial intelligence collaboration framework enables in situ bidirectional human-robot value alignment.

@article{ doi:10.1126/scirobotics.abm4183, author = {Luyao Yuan and Xiaofeng Gao and Zilong Zheng and Mark Edmonds and Ying Nian Wu and Federico Rossano and Hongjing Lu and Yixin Zhu and Song-Chun Zhu }, title = {In situ bidirectional human-robot value alignment}, journal = {Science Robotics}, volume = {7}, number = {68}, pages = {eabm4183}, year = {2022}, doi = {10.1126/scirobotics.abm4183}, URL = {https://www.science.org/doi/abs/10.1126/scirobotics.abm4183}, eprint = {https://www.science.org/doi/pdf/10.1126/scirobotics.abm4183} }
SHARP: Search-Based Adversarial Attack for Structured Prediction NAACL'22

Liwen Zhang, Zixia Jia, Wenjuan Han, Zilong Zheng, and Kewei Tu

In Findings of Annual Conference of the North American Chapter of the Association for Computational Linguistics (NACCL) , 2022

Abs PDF Bibtex

Understanding what we genuinely mean instead of what we literally say in conversations is challenging for both humans and machines; yet, this direction is mostly left untouched in modern open-ended dialogue systems. To fill in this gap, we present a grammar-based dialogue dataset, GRICE, designed to bring implicature into pragmatic reasoning in the context of conversations. Our design of GRICE also incorporates other essential aspects of modern dialogue modeling (e.g., coreference). The entire dataset is systematically generated using a hierarchical grammar model, such that each dialogue context has intricate implicatures and is temporally consistent. We further present two tasks, the implicature recovery task followed by the pragmatic reasoning task in conversation, to evaluate the model's reasoning capability. In experiments, we adopt baseline methods that claimed to have pragmatics reasoning capability; the results show a large performance gap between baseline methods and human performance. After integrating a simple module that explicitly reasons about implicature, the model shows an overall performance boost in conversational reasoning. These observations demonstrate the significance of implicature recovery for open-ended dialogue reasoning and call for future research in conversational implicature and conversational reasoning.

@inproceedings{zhang2022sharp, title={SHARP: Search-Based Adversarial Attack for Structured Prediction}, author={Zhang, Liwen and Jia, Zixia and Han, Wenjuan and Zheng, Zilong and Tu, Kewei}, booktitle={Findings of Annual Conference of the North American Chapter of the Association for Computational Linguistics}, year={2021} }
VGStore: A Multimodal Extension to SPARQL for Querying RDF Scene Graph ISWC'22

Yanzeng Li, Zilong Zheng, Wenjuan Han, and Lei Zou

In The 21st International Semantic Web Conference (ISWC) Poster & Demo Track , 2022

Abs arXiv Bibtex

Semantic Web technology has successfully facilitated many RDF models with rich data representation methods. It also has the potential ability to represent and store multimodal knowledge bases such as multimodal scene graphs. However, most existing query languages, especially SPARQL, barely explore the implicit multimodal relationships like semantic similarity, spatial relations, etc. We first explored this issue by organizing a large-scale scene graph dataset, namely Visual Genome, in the RDF graph database. Based on the proposed RDF-stored multimodal scene graph, we extended SPARQL queries to answer questions containing relational reasoning about color, spatial, etc. Further demo (i.e., VGStore) shows the effectiveness of customized queries and displaying multimodal data.

@inproceedings{vgstore22iswc, title={VGStore: A Multimodal Extension to SPARQL for Querying RDF Scene Graph}, author={Li, Yanzeng and Zheng, Zilong and Han, Wenjuan and Zou, Lei}, booktitle={The 21st International Semantic Web Conference (ISWC) Poster & Demo Track}, year={2022} }
Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling Oral ICLR'22

Bo Wan, Wenjuan Han, Zilong Zheng, and Tinne Tuytelaars

The Tenth International Conference on Learning Representations (ICLR) , 2022

Abs PDF Bibtex

We introduce a new task, unsupervised vision-language (VL) grammar induction. Given an image-caption pair, the goal is to extract a shared hierarchical structure for both image and language simultaneously. We argue that such structured output, grounded in both modalities, is a clear step towards the high-level understanding of multimodal information. Besides challenges existing in conventional visually grounded grammar induction tasks, VL grammar induction requires a model to capture contextual semantics and perform a fine-grained alignment. To address these challenges, we propose a novel method, CLIORA, which constructs a shared vision-language constituency tree structure with context-dependent semantics for all constituents in different levels of the tree. It computes a matching score between each constituent and image region, trained via contrastive learning. It integrates two levels of fusion, namely at feature-level and at score-level, so as to allow fine-grained alignment. We introduce a new evaluation metric: Critical Concept Recall Rate (CCRR) to explicitly evaluate VL grammar induction, and show a 2.6% improvement over a strong baseline on Flickr30k Entities. We also evaluate our model via two derived tasks, i.e., language grammar induction and phrase grounding, and improve over the state-of-the-art for both.

@article{wan2022unsupervised, title={Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling}, author={Wan, Bo and Han, Wenjuan and Zheng, Zilong and Tuytelaars, Tinne}, journal={The Tenth International Conference on Learning Representations (ICLR)}, year={2022} }
Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships CVPR'22

Chao Lou*, Wenjuan Han, Yuhuan Lin, and Zilong Zheng*

In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) , 2022

Abs arXiv Bibtex

Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for visual scenes (e.g., scene graphs) and natural languages (e.g., dependency trees), individually. However, how to construct a joint vision-language (VL) structure has barely been investigated. More challenging but worthwhile, we introduce a new task that targets on inducing such a joint VL structure in an unsupervised manner. Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly. Due to the lack of VL structural data, we start by building a new dataset VLParse. Rather than using labor-intensive labeling from scratch, we propose an automatic alignment procedure to produce coarse structures followed by human refinement to produce high-quality ones. Moreover, we benchmark our dataset by proposing a contrastive learning (CL)-based framework VLGAE, short for Vision-Language Graph Autoencoder. Our model obtains superior performance on two derived tasks, i.e., language grammar induction and VL phrase grounding. Ablations show the effectiveness of both visual cues and dependency relationships on fine-grained VL structure construction.

@inproceedings{lou2022unsupervised, title={Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships}, author={Lou, Chao and Han, Wenjuan and Lin, Yuhuan and Zheng, Zilong}, journal={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)}, year={2022} }
Energy-Based Generative Cooperative Saliency Prediction Oral AAAI'22

Jing Zhang, Jianwen Xie, Zilong Zheng, and Nick Barnes

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI) , 2022

Abs arXiv Code Bibtex

Conventional saliency prediction models typically learn a deterministic mapping from images to the corresponding ground truth saliency maps. In this paper, we study the saliency prediction problem from the perspective of generative models by learning a conditional probability distribution over saliency maps given an image, and treating the prediction as a sampling process. Specifically, we propose a generative cooperative saliency prediction framework based on the generative cooperative networks, where a conditional latent variable model and a conditional energy-based model are jointly trained to predict saliency in a cooperative manner. We call our model the SalCoopNets. The latent variable model serves as a fast but coarse predictor to efficiently produce an initial prediction, which is then refined by the iterative Langevin revision of the energy-based model that serves as a fine predictor. Such a coarse-to-fine cooperative saliency prediction strategy offers the best of both worlds. Moreover, we generalize our framework to the scenario of weakly supervised saliency prediction, where saliency annotation of training images is partially observed, by proposing a cooperative learning while recovering strategy. Lastly, we show that the learned energy function can serve as a refinement module that can refine the results of other pretrained saliency prediction models. Experimental results show that our generative model can achieve state-of-the-art performance.

@article{zhang2022energy, title = {Energy-Based Generative Cooperative Saliency Prediction}, author = {Zhang, Jing and Xie, Jianwen and Zheng, Zilong and Barnes, Nick}, journal={The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI)}, year = {2022} }

2021

Cooperative Training of Fast Thinking Initializer and Slow Thinking Solver for Multi-Modal Conditional Learning TPAMI

Jianwen Xie*, Zilong Zheng*, Xiaolin Fang, Song-Chun Zhu, and Ying Nian Wu

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , 2021

Abs arXiv PDF Bibtex

This paper studies the supervised learning of the conditional distribution of a high-dimensional output given an input, where the output and input may belong to two different modalities, e.g., the output is an photo image and the input is a sketch image. We solve this problem by cooperative training of a fast thinking initializer and slow thinking solver. The initializer generates the output directly by a non-linear transformation of the input as well as a noise vector that accounts for latent variability in the output. The slow thinking solver learns an objective function in the form of a conditional energy function, so that the output can be generated by optimizing the objective function, or more rigorously by sampling from the conditional energy-based model. We propose to learn the two models jointly, where the fast thinking initializer serves to initialize the sampling of the slow thinking solver, and the solver refines the initial output by an iterative algorithm. The solver learns from the difference between the refined output and the observed output, while the initializer learns from how the solver refines its initial output. We demonstrate the effectiveness of the proposed method on various multi-modal conditional learning tasks, e.g., class-to-image generation, image-to-image translation, and image recovery.

@article{xie2021cooperative, title={Cooperative Training of Fast Thinking Initializer and Slow Thinking Solver for Multi-Modal Conditional Learning}, author={Xie, Jianwen and Zheng, Zilong and Fang, Xiaolin and Zhu, Song-Chun and Wu, Ying Nian}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)}, abbr={TPAMI}, year={2021} }
Learning Triadic Belief Dynamics in Nonverbal Communication from Videos Oral CVPR'21

Lifeng Fan, Shuwen Qiu, Zilong Zheng, Tao Gao, Song-Chun Zhu, and Yixin Zhu

In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) , 2021

Abs arXiv PDF Supp Code Video Bibtex

Humans possess a unique social cognition capability; nonverbal communication can convey rich social information among agents. In contrast, such crucial social characteristics are mostly missing in the existing scene understanding literature. In this paper, we incorporate different nonverbal communication cues (e.g., gaze, human poses, and gestures) to represent, model, learn, and infer agents’ mental states from pure visual inputs. Crucially, such a mental representation takes the agent’s belief into account so that it represents what the true world state is and infers the beliefs in each agent’s mental state, which may differ from the true world states. By aggregating different beliefs and true world states, our model essentially forms “five minds” during the interactions between two agents. This “five minds” model differs from prior works that infer beliefs in an infinite recursion; instead, agents’ beliefs are converged into a “common mind”. Based on this representation, we further devise a hierarchical energybased model that jointly tracks and predicts all five minds. From this new perspective, a social event is interpreted by a series of nonverbal communication and belief dynamics, which transcends the classic keyframe video summary. In the experiments, we demonstrate that using such a social account provides a better video summary on videos with rich social interactions compared with state-of-the-art keyframe video summary methods

@inproceedings{fan2021learning, title = {Learning Tradic Belief Dynamics in Nonverbal Communication from Videos}, author = {Lifeng Fan and Shuwen Qiu and Zilong Zheng and Tao Gao and Song-Chun Zhu and Yixin Zhu}, year = {2021}, booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)} }
Patchwise Generative ConvNet: Training Energy-Based Models from a Single Natural Image for Internal Learning Oral CVPR'21

Zilong Zheng, Jianwen Xie, and Ping Li

In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) , 2021

Abs PDF Supp Code Website Bibtex

Exploiting internal statistics of a single natural image has long been recognized as a significant research paradigm where the goal is to learn the internal distribution of patches within the image without relying on external training data. Different from prior works that model such a distribution implicitly with a top-down latent variable model (e.g., generator), this paper proposes to explicitly represent the statistical distribution within a single natural image by using an energy-based generative framework, where a pyramid of energy functions, each parameterized by a bottom-up deep neural network, are used to capture the distributions of patches at different resolutions. Meanwhile, a coarse-to-fine sequential training and sampling strategy is presented to train the model efficiently. Besides learning to generate random samples from white noise, the model can learn in parallel with a self-supervised task (e.g., recover the input image from its corrupted version), which can further improve the descriptive power of the learned model. The proposed model is simple and natural in that it does not require an auxiliary model (e.g., discriminator) to assist the training. Besides, it also unifies internal statistics learning and image generation in a single framework. Experimental results presented on various image generation and manipulation tasks, including super-resolution, image editing, harmonization, style transfer, etc., have demonstrated the effectiveness of our model for internal learning.

@inproceedings{zheng2021patchgencn, title={Patchwise Generative ConvNet: Training Energy-Based Models from a Single Natural Image for Internal Learning}, author={Zheng, Zilong and Xie, Jianwen and Li, Ping}, booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)}, year={2021} }
Generative PointNet: Deep Energy-Based Learning on Unordered Point Sets for 3D Generation, Reconstruction and Classification CVPR'21

Jianwen Xie, Yifei Xu, Zilong Zheng, Song-Chun Zhu, and Ying Nian Wu

In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) , 2021

Abs arXiv PDF Website Bibtex

We propose a generative model of unordered point sets, such as point clouds, in the forms of an energy-based model, where the energy function is parameterized by an input-permutation-invariant bottom-up neural network. The energy function learns a coordinate encoding of each point and then aggregates all individual point features into energy for the whole point cloud. We show that our model can be derived from the discriminative PointNet. The model can be trained by MCMC-based maximum likelihood learning (as well as its variants), without the help of any assisting networks like those in GANs and VAEs. Unlike most point cloud generator that relys on hand-crafting distance metrics, our model does not rely on hand-crafting distance metric for point cloud generation, because it synthesizes point clouds by matching observed examples in terms of statistical property defined by the energy function. Furthermore, we can learn a short-run MCMC toward the energy-based model as a flow-like generator for point cloud reconstruction and interpretation. The learned point cloud representation can be also useful for point cloud classification. Experiments demonstrate the advantages of the proposed generative model of point clouds.

@inproceedings{xie2021GPointent, title={Generative PointNet: Deep Energy-Based Learning on Unordered Point Sets for 3D Generation, Reconstruction and Classification}, author={Xie, Jianwen and Xu, Yifei and Zheng, Zilong and Zhu, Song-Chun and Wu, Ying Nian}, booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)}, year={2021} }
GRICE: A Grammar-based Dataset for Recovering Implicature and Conversational rEasoning ACL'21

Zilong Zheng, Shuwen Qiu, Lifeng Fan, Yixin Zhu, and Song-Chun Zhu

In Findings of the Association for Computational Linguistics: ACL-IJCNLP (ACL-Findings), 2021 , 2021

Abs PDF Code Bibtex

Understanding what we genuinely mean instead of what we literally say in conversations is challenging for both humans and machines; yet, this direction is mostly left untouched in modern open-ended dialogue systems. To fill in this gap, we present a grammar-based dialogue dataset, GRICE, designed to bring implicature into pragmatic reasoning in the context of conversations. Our design of GRICE also incorporates other essential aspects of modern dialogue modeling (e.g., coreference). The entire dataset is systematically generated using a hierarchical grammar model, such that each dialogue context has intricate implicatures and is temporally consistent. We further present two tasks, the implicature recovery task followed by the pragmatic reasoning task in conversation, to evaluate the model's reasoning capability. In experiments, we adopt baseline methods that claimed to have pragmatics reasoning capability; the results show a large performance gap between baseline methods and human performance. After integrating a simple module that explicitly reasons about implicature, the model shows an overall performance boost in conversational reasoning. These observations demonstrate the significance of implicature recovery for open-ended dialogue reasoning and call for future research in conversational implicature and conversational reasoning.

@inproceedings{zheng2021implicature, title={GRICE: A Grammar-based Dataset for Recovering Implicature and Conversational Reasoning}, author={Zheng, Zilong and Qiu, Shuwen and Fan, Lifeng and Zhu, Yixin and Zhu, Song-Chun}, booktitle={Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021}, year={2021}, pages = {2074--2085} }
Learning Energy-Based Model with Variational Auto-Encoder as Amortized Sampler AAAI'21

Jianwen Xie, Zilong Zheng, and Ping Li

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI) , 2021

Abs arXiv Bibtex

Due to the intractable partition function, training energybased models (EBMs) by maximum likelihood requires Markov chain Monte Carlo (MCMC) sampling to approximate the gradient of the Kullback–Leibler divergence between data and model distributions. However, it is non-trivial to sample from an EBM because of the difficulty of mixing between modes. In this paper, we propose to learn a variational auto-encoder (VAE) to initialize the finite-step MCMC, such as Langevin dynamics that is derived from the energy function, for efficient amortized sampling of the EBM. With these amortized MCMC samples, the EBM can be trained by maximum likelihood, which follows an “analysis by synthesis” scheme; while the variational auto-encoder learns from these MCMC samples via variational Bayes. We call this joint training algorithm the variational MCMC teaching, in which the VAE chases the EBM toward data distribution. We interpret the learning algorithm as a dynamic alternating projection in the context of information geometry. Our proposed models can generate samples comparable to GANs and EBMs. Additionally, we demonstrate that our models can learn effective probabilistic distribution toward supervised conditional learning experiments.

@article{xie2021vaeebm, title={Learning Energy-Based Model with Variational Auto-Encoder as Amortized Sampler}, author={Xie, Jianwen and Zheng, Zilong and Li, Ping}, journal={The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI)}, year={2021} }
Learning Cycle-Consistent Cooperative Networks via Alternating MCMC Teaching for Unsupervised Cross-Domain Translation AAAI'21

Jianwen Xie*, Zilong Zheng*, Xiaolin Fang, Song-Chun Zhu, and Ying Nian Wu

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI) , 2021

Abs arXiv PDF Website Bibtex

This paper studies the unsupervised cross-domain translation problem by proposing a generative framework, in which the probability distribution of each domain is represented by a generative cooperative network that consists of an energy-based model and a latent variable model. The use of generative cooperative network enables maximum likelihood learning of the domain model by MCMC teaching, where the energy-based model seeks to fit the data distribution of domain and distills its knowledge to the latent variable model via MCMC. Specifically, in the MCMC teaching process, the latent variable model parameterized by an encoder-decoder maps examples from the source domain to the target domain, while the energy-based model further refines the mapped results by Langevin revision such that the revised results match to the examples in the target domain in terms of the statistical properties, which are defined by the learned energy function. For the purpose of building up a correspondence between two unpaired domains, the proposed framework simultaneously learns a pair of cooperative networks with cycle consistency, accounting for a two-way translation between two domains, by alternating MCMC teaching. Experiments show that the proposed framework is useful for unsupervised image-to-image translation and unpaired image sequence translation.

@article{xie2021cycle, title={Learning Cycle-Consistent Cooperative Networks via Alternating MCMC Teaching for Unsupervised Cross-Domain Translation}, author={Xie, Jianwen and Zheng, Zilong and Fang, Xiaolin and Zhu, Song-Chun and Wu, Ying Nian}, journal={The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI)}, year={2021} }

2020

Generative VoxelNet: Learning Energy-Based Models for 3D Shape Synthesis and Analysis TPAMI

Jianwen Xie*, Zilong Zheng*, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, and Ying Nian Wu

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , 2020

Abs PDF Website Bibtex

3D data that contains rich geometry information of objects and scenes is a valuable asset for understanding 3D physical world. With the recent emergence of large-scale 3D datasets, it becomes increasingly crucial to have a powerful 3D generative model for 3D shape synthesis and analysis. This paper proposes a 3D shape descriptor network, which is a deep 3D convolutional energy-based model, for representing volumetric shape patterns. The maximum likelihood training of the model follows an “analysis by synthesis” scheme. The benefits of the proposed model are five-fold: first, unlike GANs and VAEs, the training of the model does not rely on any auxiliary models; second, the model can synthesize realistic 3D shapes by sampling from the probability distribution via MCMC, such as Langevin dynamics; third, the conditional version of the model can be applied to 3D object recovery and super-resolution; fourth, the model can be used to train a 3D generator network via MCMC teaching; fifth, the unsupervisedly trained model provides a powerful feature extractor for 3D data, which can be useful for 3D object classification. Experiments demonstrate that the proposed model can generate high-quality 3D shape patterns and can be useful for a wide variety of 3D shape analysis.

@article{xie2020gvoxelnet, title={Generative VoxelNet: Learning Energy-Based Models for 3D Shape Synthesis and Analysis}, author= {Xie, Jianwen and Zheng, Zilong and Gao, Ruiqi and Wang, Wenguan and Zhu, Song-Chun and Wu, Ying Nian}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)}, year={2020} }
Joint Inference of States, Robot Knowledge, and Human (False-)Beliefs ICRA'20

Tao Yuan, Hangxin Liu, Lifeng Fan, Zilong Zheng, Tao Gao, Yixin Zhu, and Song-Chun Zhu

In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , 2020

Abs PDF Video Bibtex

Aiming to understand how human (false-)belief—a core socio-cognitive ability—would affect human interactions with robots, this paper proposes to adopt a graphical model to unify the representation of object states, robot knowledge, and human (false-)beliefs. Specifically, a parse graph (PG) is learned from a single-view spatiotemporal parsing by aggregating various object states along the time; such a learned representation is accumulated as the robot’s knowledge. An inference algorithm is derived to fuse individual PG from all robots across multi-views into a joint PG, which affords more effective reasoning and inference capability to overcome the errors originated from a single view. In the experiments, through the joint inference over PGs, the system correctly recognizes human (false-)belief in various settings and achieves better cross-view accuracy on a challenging small object tracking dataset.

@inproceedings{yuan2020joint, title={Joint Inference of States, Robot Knowledge, and Human (False-)Beliefs}, author={Yuan, Tao and Liu, Hangxin and Fan, Lifeng and Zheng, Zilong and Gao, Tao and Zhu, Yixin and Zhu, Song-Chun}, booktitle={Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)}, year={2020} }
Motion-Based Generator Model: Unsupervised Disentanglement of Appearance, Trackable and Intrackable Motions in Dynamic Patterns Oral AAAI'20

Jianwen Xie*, Ruiqi Gao*, Zilong Zheng, Song-Chun Zhu, and Ying Nian Wu

The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI) , 2020

Abs arXiv Code Website Bibtex

Dynamic patterns are characterized by complex spatial and motion patterns. Understanding dynamic patterns requires a disentangled representational model that separates the factorial components. A commonly used model for dynamic patterns is the state space model, where the state evolves over time according to a transition model and the state generates the observed image frames according to an emission model. To model the motions explicitly, it is natural for the model to be based on the motions or the displacement fields of the pixels. Thus in the emission model, we let the hidden state generate the displacement field, which warps the trackable component in the previous image frame to generate the next frame while adding a simultaneously emitted residual image to account for the change that cannot be explained by the deformation. The warping of the previous image is about the trackable part of the change of image frame, while the residual image is about the intrackable part of the image. We use a maximum likelihood algorithm to learn the model parameters that iterates between inferring latent noise vectors that drive the transition model and updating the parameters given the inferred latent vectors. Meanwhile we adopt a regularization term to penalize the norms of the residual images to encourage the model to explain the change of image frames by trackable motion. Unlike existing methods on dynamic patterns, we learn our model in unsupervised setting without ground truth displacement fields or optical flows. In addition, our model defines a notion of intrackability by the separation of warped component and residual component in each image frame. We show that our method can synthesize realistic dynamic pattern, and disentangling appearance, trackable and intrackable motions. The learned models can be useful for motion transfer, and it is natural to adopt it to define and measure intrackability of a dynamic pattern.

@article{xie2020motion, title={Motion-Based Generator Model: Unsupervised Disentanglement of Appearance, Trackable and Intrackable Motions in Dynamic Patterns}, author={Xie, Jianwen and Gao, Ruiqi and Zheng, Zilong and Zhu, Song-Chun and Wu, Ying Nian}, journal={The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI)}, year={2020} }

2019

Reasoning Visual Dialogs with Structural and Partial Observations Oral CVPR'19

Zilong Zheng*, Wenguan Wang*, Siyuan Qi*, and Song-Chun Zhu

In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) , 2019

Abs arXiv Code Bibtex

We propose a novel model to address the task of Visual Dialog which exhibits complex dialog structures. To obtain a reasonable answer based on the current question and the dialog history, the underlying semantic dependencies between dialog entities are essential. In this paper, we explicitly formalize this task as inference in a graphical model with partially observed nodes and unknown graph structures (relations in dialog). The given dialog entities are viewed as the observed nodes. The answer to a given question is represented by a node with missing value. We first introduce an Expectation Maximization algorithm to infer both the underlying dialog structures and the missing node values (desired answers). Based on this, we proceed to propose a differentiable graph neural network (GNN) solution that approximates this process. Experiment results on the VisDial and VisDial-Q datasets show that our model outperforms comparative methods. It is also observed that our method can infer the underlying dialog structure for better dialog reasoning.

@inproceedings{zheng2019reasoning, title={Reasoning Visual Dialogs with Structural and Partial Observations}, author={Zheng, Zilong and Wang, Wenguan and Qi, Siyuan and Zhu, Song-Chun}, booktitle={Computer Vision and Pattern Recognition (CVPR), 2019 IEEE Conference on}, year={2019} }
Learning Dynamic Generator Model by Alternating Back-Propagation Through Time Spotlight AAAI'19

Jianwen Xie*, Ruiqi Gao*, Zilong Zheng, Song-Chun Zhu, and Ying Nian Wu

The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI) , 2019

Abs arXiv Code Website Bibtex

This paper studies the dynamic generator model for spatial-temporal processes such as dynamic textures and action sequences in video data. In this model, each time frame of the video sequence is generated by a generator model, which is a non-linear transformation of a latent state vector, where the non-linear transformation is parametrized by a top-down neural network. The sequence of latent state vectors follows a non-linear auto-regressive model, where the state vector of the next frame is a non-linear transformation of the state vector of the current frame as well as an independent noise vector that provides randomness in the transition. The non-linear transformation of this transition model can be parametrized by a feedforward neural network. We show that this model can be learned by an alternating back-propagation through time algorithm that iteratively samples the noise vectors and updates the parameters in the transition model and the generator model. We show that our training method can learn realistic models for dynamic textures and action patterns.

@article{xie2019DG, title = {Learning Dynamic Generator Model by Alternating Back-Propagation Through Time}, author = {Xie, Jianwen and Gao, Ruiqi and Zheng, Zilong and Zhu, Song-Chun and Wu, Ying Nian}, journal={The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI)}, year = {2019} }

2018

Learning Descriptor Networks for 3D Shape Synthesis and Analysis Oral CVPR'18

Jianwen Xie*, Zilong Zheng*, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, and Ying Nian Wu

In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) , 2018

Abs arXiv Code Website Bibtex

This paper proposes a 3D shape descriptor network, which is a deep convolutional energy-based model, for modeling volumetric shape patterns. The maximum likelihood training of the model follows an “analysis by synthesis” scheme and can be interpreted as a mode seeking and mode shifting process. The model can synthesize 3D shape patterns by sampling from the probability distribution via MCMC such as Langevin dynamics. The model can be used to train a 3D generator network via MCMC teaching. The conditional version of the 3D shape descriptor net can be used for 3D object recovery and 3D object super-resolution. Experiments demonstrate that the proposed model can generate realistic 3D shape patterns and can be useful for 3D shape analysis.

@inproceedings{xie2018learning, title={Learning Descriptor Networks for 3D Shape Synthesis and Analysis}, author={Xie, Jianwen and Zheng, Zilong and Gao, Ruiqi and Wang, Wenguan and Zhu, Song-Chun and Wu, Ying Nian}, booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)}, pages={8629--8638}, year={2018} }

Publications

2023

2022

2021

2020

2019

2018