PUBLICATION | ICML'26

UniCode: Augmenting Evaluation for Code Reasoning

Xinyue Zheng^*, Haowei Lin^*, Shaofei Cai, Yaodong Yang, Zilong Zheng^#, and Yitao Liang^#

ICML · 2026 · arXiv: arxiv.org/abs/2510.17868

Abstract

Current coding benchmarks often overstate Large Language Model (LLM) capabilities due to static paradigms and data contamination, allowing models to exploit statistical shortcuts rather than genuine reasoning. To address this, we introduce UniCode, a generative evaluation framework that systematically probes LLM reasoning boundaries via: (1) multi-dimensional augmentation operators to create diverse algorithmic variants; (2) a scalable test generation pipeline achieving 94.5% correctness without human-written solutions; and (3) fine-grained diagnostic metrics for rich error signals. Our evaluation of state-of-the-art models reveals a significant 31.2% performance collapse. Critically, we observe a high variance across different reasoning axes, revealing a profound fragility under structural shifts despite surface-level robustness. Furthermore, we identify a ``seed-problem regression,” where models fail by defaulting to memorized seed logic and inefficient complexities.

Citation

@inproceedings{zheng2026unicode,
    title={UniCode: Augmenting Evaluation for Code Reasoning},
    author={Xinyue Zheng and Haowei Lin and Shaofei Cai and Yaodong Yang and Zilong Zheng and Yitao Liang},
    booktitle={Proceedings of the 43rd International Conference on Machine Learning},
    year={2026}
}

Related Publications

Li et al., Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space, arXiv, 2026.

Liu et al., RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling, in ICLR, 2026.

Qin et al., Reinforced Query Reasoners for Reasoning-intensive Retrieval Tasks, in EMNLP, 2025.

Zhao et al., Absolute Zero: Reinforced Self-play Reasoning with Zero Data, in NeurIPS, 2025.