Authors:

(1) Chaojie Wang*, Skywork AI;

(2) Yanchen Deng*, Nanyang Technological University;

(3) Zhiyi Lyu, Nanyang Technological University;

(4) Liang Zeng, Skywork AI;

(5) Jujie He, Skywork AI.

Abstract and 1. Introduction

  1. Related Works
  2. Preliminary
  3. Q*: A General, Versatile and Agile Deliberation Framework for LLMs
  4. Experiments
  5. Conclusion and References

Abstract

Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. However, the auto-regressive generation process makes LLMs prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. In this paper, by casting multi-step reasoning of LLMs as a heuristic search problem, we aim to alleviate the pathology by introducing Q*, a general, versatile and agile framework for guiding LLMs decoding process with deliberative planning. By learning a plug-and-play Q-value model as heuristic function for estimating expected future rewards, our Q* can effectively guide LLMs to select the most promising next reasoning step without fine-tuning LLMs for the current task, which avoids the significant computational overhead and potential risk of performance degeneration on other tasks. Extensive experiments on GSM8K, MATH and MBPP demonstrate the superiority of our method, contributing to improving the reasoning performance of existing open-source LLMs.

1 Introduction

Large Language Models (LLMs) have exhibited impressive capabilities in solving various reasoning tasks encoded in natural languages, including math word problems [1–6], code generation [7–9] and planning [10–12]. Unfortunately, even the most advanced LLMs still face significant challenges and are prone to introduce errors, hallucinations and inconsistent statements as the number of reasoning steps grows due to their auto-regressive nature [13, 14]. In fact, the auto-regressive generation process of LLMs can be characterized by “System 1" [15], a mode of thought which is fast, instinctive but less accurate. Most of recent works focus on improving LLMs’ “System 1” capability by (1) constructing sophisticated prompts with extensive expertise to trigger the potential capacities of LLMs without modifying their parameters [16–19], (2) fine-tuning LLMs with massive task-specific corpus at the price of significant computational burdens and the potential risk of performance degeneration on other tasks [5, 6, 20, 21], or (3) training reward models to rank the candidate responses [4, 22–24].

On the other hand, solving complex reasoning problems requires more in-depth, deliberative and logical thinking steps, i.e., the “System 2" mode [15]. Taking solving math word problems as an example, any incorrect intermediate reasoning step (e.g., calculation errors, mis-interpretations) can potentially lead to incorrect final answers. Prior attempts [25–28] for enhancing “System 2” reasoning capability includes performing deliberation with basic tree search algorithms (e.g., BFS or DFS), Monte Carlo Tree Search (MCTS) [29], and A* [30]. Nonetheless, the utility functions used in these methods often require laborious expertise to design for each specific task, which are difficult to be extended to new scenarios. Furthermore, deliberation with MCTS would require significant number of rollouts before finding high-quality responses when solving the problems with many reasoning steps, which substantially slows down the overall decoding process.

In light of this, we propose Q*, a general, versatile and agile framework for improving the multi-step reasoning capability of LLMs with deliberative planning. Different from the existing deliberation methods, our method does not rely on domain knowledge to design the heuristic function. Besides, by leveraging plug-and-play Q-value models as heuristic function, our Q* can effectively solve various tasks via guiding LLMs to select the most promising next step without fine-tuning LLMs beforehand, which avoids the significant computational overhead and potential risk of performance degeneration in other tasks. Finally, Q* considers only one single step when performing deliberation, which is much cheaper than completing rollouts in MCTS. Specifically, the main contributions of our work are summarized as follows:

• We formalize the multi-step reasoning of LLMs as a Markov Decision Process (MDP) where the state is the concatenation of input prompt and the reasoning steps generated so far, the action is the next reasoning step and the reward measures how well the task is solved.

• We present several general approaches to estimate the optimal Q-value of state-action pairs, i.e., offline reinforcement learning, the best sequence from rollouts, and completion with stronger LLMs. It is noteworthy that our methods only need the ground-truth of training problems and can be easily applied to various reasoning tasks without modification.

• We cast solving multi-step reasoning tasks as a heuristic search problem, where the objective is to find the most proper reasoning trace with maximum utility. Built upon A* search, our deliberation framework, Q*, leverages plug-and-play Q-value models as heuristic function and guides LLMs to select the most promising next reasoning step in best-first fashion.

• We conduct extensive experiments on math reasoning and code generation tasks, demonstrating that Q* can significantly improve the multi-step reasoning capability of existing open-sourced LLMs.

LLM alignment. Alignment has become an important technique to prevent the output of LLMs deviates from human’s expectation. Supervised Fine-Tuning (SFT) is probably the most fundamental alignment approach that directly minimizes the cross-entropy loss between the output and groundtruth. Reinforcement learning from Human Feedback (RLHF) [31], on the other hand, firstly learns a reward model (RM) from human preferences and then optimizes the SFT model with reinforcement learning algorithms to maximize the cumulative rewards from RM. Direct Preference Optimization (DPO) [32] aligns LLMs directly according to the ranking information from human feedback without explicitly learning RM. Recently, Aligner [33] came out as a model-agnostic alignment method by learning to re-write LLMs’ output. Compared to these methods, our Q* achieves the goal of alignment with distinct merits. Different from SFT and Aligner, Q* does not rely on massive human annotated preference pairs which are expensive to collect; different from RLHF and DPO, our Q* does not modify the parameters of LLMs, which avoids the potential risk of performance degeneration on other tasks.

Enhancing LLMs with planning. Tree-of-thoughts (ToT) [25] improves the LLMs’ reasoning capability by exploring the intermediate steps towards problem solving with basic tree-search algorithms. In the same vein, A* search and MCTS are applied to serve as a planning technique to enhance the performance of LLMs when solving challenging complex reasoning problems [26–28, 34]. Unfortunately, the utility function used in these methods is either constructed from LLMs’ feedback (e.g., [25, 27]), which could be highly-inaccurate in complex problems, or specific to each individual task (e.g., [28, 34]). Moreover, planning with MCTS often requires to perform costly rollout, which can significantly slow down the overall decoding process. In contrast, our Q* solely relies on training a Q-value model to guide LLMs to select the most promising next reasoning step and the pipeline can be easily applied to various reasoning tasks without modification. Besides, we consider only a single step each time in Q*, which is much cheaper than complete rollout in MCTS-based methods.

LLMs for math reasoning & code generation. Math reasoning and code generation require LLMs to perform multi-step reasoning on relations, quantities and logics which are inherently challenging. Current techniques include: 1) prompt engineering which triggers the potential capacities of LLMs with sophisticated prompts [16–19, 35, 36]. However, constructing such prompt needs extensive expertise and case-by-case tuning, which is difficult to generalize to different tasks; 2) Fine-tuning LLMs with massive math/code corpus [5, 6, 8, 9, 20, 21, 37], which usually comes at the price of significant computational burden and may compromise the performance on other tasks; 3) training RMs/verifiers to rank the candidate solutions without providing any guidance in intermediate steps [4, 22–24]. Differently, our Q* leverages a plug-and-play Q-value model to direct the deliberation process of LLMs, which effectively provides guidance for each intermediate step without modifying the parameters of LLMs. Moreover, by casting multi-step reasoning of LLMs as a heuristic search problem, our method can be generalized to various reasoning tasks without laborious prompt engineering.

This paper is available on arxiv under CC BY 4.0 license.

∗The first two authors contributed equally, and the order of authors was determined by a coin toss.