BOW: Training Language Models to Reason Over Plausible Next Words

Ming Shen; Zhikun Xu; Xiao Ye; Jacob Dineen; Ben Zhou

BOW: Training Language Models to Reason Over Plausible Next Words

Ming Shen, Zhikun Xu, Xiao Ye, Jacob Dineen, Ben Zhou

Pending EMNLP 2026

Abstract

Next-word prediction (NWP) trains language models against a single observed continuation, even though many contexts admit multiple plausible next words. Recent RL-based next-word reasoning methods make this tension explicit: they reward a model for producing a rationale that supports one context-conditioned continuation, which can turn a pre-existing preference into a confident, self-justifying trace. We introduce BOW, an RL framework that instead trains models to produce self-contained, neutral, and comprehensive descriptions of the plausible next-word space. BOW's core reward path routes credit through the reasoning trajectory itself: a policy generates a next-word reasoning trace, and a frozen scorer assigns the main reward from that trace alone, without access to the original context. Our full variant, BOW-Reg, adds a lightweight breadth regularizer that discourages premature collapse to narrow answer naming, while keeping the main reward path trace-only. Across ten general reasoning benchmarks, BOW remains competitive with the original instruction models and often outperforms trained baselines. More directly, on SharedRef and HoWN-Simple, BOW-Reg achieves the best referential-ambiguity correctness and lowest single-sense collapse on both LLM backbones we study. Human evaluation further shows that BOW-Reg elicits broader next-word reasoning traces, while the full method remains strong on intrinsic NWP.

Citation

@misc{shen2025bow,
  title={BOW: Training Language Models to Reason Over Plausible Next Words},
  author={Ming Shen and Zhikun Xu and Xiao Ye and Jacob Dineen and Ben Zhou},
  year={2025},
  eprint={2506.13502},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2506.13502}
}