Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

Concordia Authors

NeurIPS D&B Track  ·  2025


Abstract

Large language model (LLM) agents have demonstrated impressive capabilities for social interaction and are increasingly being deployed in situations where they might engage with both human and artificial agents. These interactions represent a critical frontier for LLM-based agents, yet existing evaluation methods fail to measure how well these capabilities generalize to novel social situations. In this paper, we introduce a method for evaluating the ability of LLM-based agents to cooperate in zero-shot, mixed-motive environments using Concordia, a natural language multi-agent simulation environment. This work introduces an approach to measuring human-appropriate cooperative intelligence, emphasizing an agent’s ability to identify and exploit opportunities for mutual gain across diverse partners and contexts. We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains across a suite of diverse scenarios ranging from negotiation to collective action problems. Our findings reveal significant gaps between current agent capabilities and the robust generalization required for reliable cooperation, particularly in scenarios demanding persuasion and norm enforcement.




Citation

@inproceedings{concordia2025,
    title={Evaluating Generalization Capabilities of {LLM}-Based Agents in Mixed-Motive Scenarios Using Concordia},
    author={Chandler Smith and Marwa Abdulhai and Manfred Diaz and Marko Tesic and Rakshit Trivedi and Sasha Vezhnevets and Lewis Hammond and Jesse Clifton and Minsuk Chang and Edgar A. Du{\'e}{\~n}ez-Guzm{\'a}n and John P Agapiou and Jayd Matyas and Danny Karmon and Beining Zhang and Jim Dilkes and Akash Kundu and Jord Nguyen and Emanuel Tewolde and Jebish Purbey and Ram Mohan Rao Kadiyala and Siddhant Gupta and Aliaksei Korshuk and Buyantuev Alexander and Ilya Makarov and Gang Zhao and Rolando Fernandez and Zhihan Wang and Caroline Wang and Jiaxun Cui and Lingyun Xiao and Di Yang Shi and Yoonchang Sung and Arrasy Rahman and Peter Stone and Yipeng Kang and Hyeonggeun Yun and Ananya Ananya and Taehun Cha and Zhiqiang Wu and Elizaveta Tennant and Olivia Macmillan-Scott and Marta Emili Garc{\'\i}a Segura and Diana Riazi and Fuyang Cui and Sriram Ganapathi Subramanian and Toryn Q. Klassen and Nico Schiavone and Mogtaba Alim and Sheila A. McIlraith and Manuel Sebastian Rios Beltran and Oswaldo Pe{\~n}a and Carlos Saith Rodriguez Rojas and Manuela Chacon-Chamorro and Ruben Manrique and Luis Felipe Giraldo and Nicanor Quijano and Yiding Wang and Yuxuan Chen and Fangwei Zhong and Mengmeng Wang and Wenming Tu and Zhaowei Zhang and Ziang Chen and Zixia Jia and Xue Feng and Zilong Zheng and Chichen Lin and Weijian Fan and Chenao Liu and Sneheel Sarangi and Ziyan Wang and Shuqing Shi and Yali Du and Avinaash Anand Kulandaivel and Yang Liu and Wu Ruiyang and Chetan Talele and 陆孙嘉 and Gema Parre{\~n}o Piqueras and Shamika Dhuri and Bain McHale and Tim Baarslag and Dylan Hadfield-Menell and Natasha Jaques and Jose Hernandez-Orallo and Joel Z Leibo},
    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
    year={2025},
    url={https://openreview.net/forum?id=yG4Fj0voJZ}
}

    Related Publications

  • Li et al., Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space, arXiv, 2025.
  • Zhao et al., DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints, in AAAI, 2025.