Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs






Published 6/14/2024 by Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, Min Lin
The recent development of chain-of-thought (CoT) decoding has enabled large language models (LLMs) to generate explicit logical reasoning paths for complex problem-solving. However, research indicates that these paths are not always deliberate and optimal. The tree-of-thought (ToT) method employs tree-searching to extensively explore the reasoning space and find better reasoning paths that CoT decoding might overlook. This deliberation, however, comes at the cost of significantly increased inference complexity. In this work, we demonstrate that fine-tuning LLMs leveraging the search tree constructed by ToT allows CoT to achieve similar or better performance, thereby avoiding the substantial inference burden. This is achieved through Chain of Preference Optimization (CPO), where LLMs are fine-tuned to align each step of the CoT reasoning paths with those of ToT using the inherent preference information in the tree-search process. Extensive experimental results show that CPO significantly improves LLM performance in solving a variety of complex problems, including question answering, fact verification, and arithmetic reasoning, demonstrating its effectiveness. Our code is available at

  • This paper introduces a new technique called "Chain of Preference Optimization" (CoPO) that improves the chain-of-thought reasoning abilities of large language models (LLMs).
  • Chain-of-thought reasoning involves breaking down a complex problem into a sequence of simpler steps, with each step building on the previous ones.
  • CoPO aims to help LLMs generate more coherent and effective chain-of-thought reasoning by optimizing the model's preferences for different steps in the reasoning process.

Plain English Explanation

Chain-of-thought reasoning is a powerful technique that allows language models to tackle complex problems by breaking them down into a series of logical steps. However, current models can struggle to maintain a consistent and effective chain of reasoning, often getting sidetracked or making illogical leaps.

The Chain of Preference Optimization (CoPO) approach aims to address this by training the model to have a stronger preference for generating coherent and effective chains of reasoning. The key idea is to explicitly optimize the model's preferences for different steps in the reasoning process, rather than just optimizing for the final output.

For example, when solving a math word problem, the model might be encouraged to prefer generating steps like "identify the relevant information," "determine the appropriate mathematical operation," and "calculate the answer" over less coherent sequences of steps. By shaping the model's preferences in this way, CoPO helps it stay on track and produce more logical and effective chains of thought.

This approach builds on techniques like chain-of-thought reasoning without prompting and multi-step reasoning across languages, which have also explored ways to improve the reasoning abilities of language models. By focusing on the model's preferences and decision-making process, CoPO represents a novel and potentially powerful approach to enhancing the step-by-step reasoning capabilities of these powerful AI systems.

Technical Explanation

The key innovation of the Chain of Preference Optimization (CoPO) approach is the way it shapes the language model's preferences for different steps in the reasoning process. Rather than just optimizing the model's output for the final answer, CoPO also optimizes the model's preferences for generating the individual steps that lead to that answer.

This is done by introducing a "Chain of Preference" (CoP) loss function, which measures how well the model's preferences align with a target sequence of reasoning steps. During training, the model is encouraged to assign higher probabilities to the target steps, helping it learn to generate more coherent and effective chains of thought.

The authors evaluate CoPO on a range of reasoning tasks, including math word problems, logical puzzles, and multi-step question answering. They find that models trained with CoPO consistently outperform baseline models that do not use this preference optimization approach, demonstrating the effectiveness of the technique.

One key insight from the paper is that CoPO helps the model avoid getting "stuck" in local minima or making illogical leaps in its reasoning. By shaping the model's preferences at each step, CoPO guides it towards more effective chains of thought.

Critical Analysis

The Chain of Preference Optimization (CoPO) approach represents a promising step forward in enhancing the reasoning capabilities of large language models. By explicitly modeling the model's preferences for different steps in the reasoning process, the technique helps address a key limitation of current models - their tendency to struggle with maintaining coherent and effective chains of thought.

That said, the paper does not explore some potential limitations or caveats of the CoPO approach. For example, it's unclear how well the technique would scale to more complex, open-ended reasoning tasks, where the space of possible reasoning steps is much larger and less structured.

Additionally, the paper does not delve into potential negative societal impacts or ethical considerations around deploying such a powerful reasoning system. As language models become more advanced, it will be increasingly important to carefully examine these types of issues.

Overall, though, the CoPO technique represents an intriguing and valuable contribution to the field of AI reasoning. By focusing on the model's decision-making process rather than just the final output, it points the way towards more robust and effective chain-of-thought reasoning in language models.


The Chain of Preference Optimization (CoPO) approach introduced in this paper represents an important step forward in improving the chain-of-thought reasoning capabilities of large language models. By explicitly optimizing the model's preferences for different steps in the reasoning process, CoPO helps the model maintain more coherent and effective chains of thought, avoiding the tendency to get stuck in local minima or make illogical leaps.

The authors demonstrate the effectiveness of CoPO across a range of reasoning tasks, showing that models trained with this technique consistently outperform baseline models. While the paper does not explore all potential limitations or negative implications of the approach, it represents a valuable contribution to the ongoing efforts to enhance the reasoning abilities of powerful AI systems.

As language models continue to advance, techniques like CoPO will likely become increasingly important for unlocking their full potential as tools for complex problem-solving and decision-making. By focusing on the model's internal decision-making processes, rather than just the final outputs, CoPO points the way towards more robust and effective chain-of-thought reasoning in the years to come.

