BOW: Training Language Models to Reason Over Plausible Next Words
Abstract
Next-word prediction (NWP) trains language models against a single observed continuation, even though many contexts admit multiple plausible next words. Recent RL-based next-word reasoning methods make this tension explicit: they reward a model for producing a rationale that supports one context-conditioned continuation, which can turn a pre-existing preference into a confident, self-justifying trace. We introduce BOW, an RL framework that instead trains models to produce self-contained, neutral, and comprehensive descriptions of the plausible next-word space. BOW's core reward path routes credit through the reasoning trajectory itself: a policy generates a next-word reasoning trace, and a frozen scorer assigns the main reward from that trace alone, without access to the original context. Our full variant, BOW-Reg, adds a lightweight breadth regularizer that discourages premature collapse to narrow answer naming, while keeping the main reward path trace-only. Across ten general reasoning benchmarks, BOW remains competitive with the original instruction models and often outperforms trained baselines. More directly, on SharedRef and HoWN-Simple, BOW-Reg achieves the best referential-ambiguity correctness and lowest single-sense collapse on both LLM backbones we study. Human evaluation further shows that BOW-Reg elicits broader next-word reasoning traces, while the full method remains strong on intrinsic NWP.
Citation
@misc{shen2025bow,
title={BOW: Training Language Models to Reason Over Plausible Next Words},
author={Ming Shen and Zhikun Xu and Xiao Ye and Jacob Dineen and Ben Zhou},
year={2025},
eprint={2506.13502},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.13502}
}