X-Talk: On the Underestimated Potential of Modular Speech-to-Speech Dialogue System

Zhanxun Liu, Yifan Duan, Mengmeng Wang, Pengchao Feng, Haotian Zhang, Xiaoyu Xing, Yijia Shan, Haina Zhu, Yuhang Dai, Chaochao Lu, Xipeng Qiu, Lei Xie, Lan Wang, Nan Yan, Zilong Zheng, Ziyang Ma, Kai Yu, and Xie Chen

2026 · arXiv: arxiv.org/abs/2512.18706

RLVR Reasoning XAI Large Reasoning Model

Abstract

We present X-Talk, an open-source framework that champions a decoupled, modular design for LLM-driven speech-tospeech (S2S) systems. While the dominant trend favors end-to-end (E2E) modeling to optimize information flow, these “omni-models” often struggle to balance the competing objectives of complex speech tasks within a single network. X-Talk challenges this paradigm by demonstrating that a systematically optimized cascaded pipeline can achieve sub-second latency without sacrificing modular flexibility. Our framework seamlessly integrates specialized front-end components (e.g., VAD, speech enhancement) and diverse understanding models (e.g., ASR, emotion, and environmental sound analysis) with LLM capabilities like retrieval-augmented generation (RAG) and tool use. By revitalizing the cascaded approach, X-Talk highlights the underestimated potential of modular S2S systems and provides a robust foundation for future research and applications.

Citation

@misc{liu2025xtalkunderestimatedpotentialmodular,
      title={X-Talk: On the Underestimated Potential of Modular Speech-to-Speech Dialogue System}, 
      author={Zhanxun Liu and Yifan Duan and Mengmeng Wang and Pengchao Feng and Haotian Zhang and Xiaoyu Xing and Yijia Shan and Haina Zhu and Yuhang Dai and Chaochao Lu and Xipeng Qiu and Lei Xie and Lan Wang and Nan Yan and Zilong Zheng and Ziyang Ma and Kai Yu and Xie Chen},
      year={2025},
      eprint={2512.18706},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2512.18706}, 
}

Related Publications

Zhao et al., Absolute Zero: Reinforced Self-play Reasoning with Zero Data, in NeurIPS, 2025.

Li et al., Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space, arXiv, 2026.

Zhang et al., Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs, in ICLR, 2025.

Wu et al., TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation, in ICML, 2025.