V-STaR: Training Verifiers for Self-Taught Reasoners

Read original: arXiv:2402.06457 - Published 8/15/2024 by Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, Rishabh Agarwal

🏋️

Overview

Existing self-improvement approaches for large language models (LLMs) discard incorrect solutions generated during the training process.
V-STaR, a new self-improvement method, utilizes both correct and incorrect solutions to train a verifier that judges the correctness of model-generated solutions.
Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering improved performance on code generation and math reasoning tasks.

Plain English Explanation

[object Object] are powerful AI systems that can understand and generate human-like text. To improve their problem-solving abilities, researchers have developed [object Object] that fine-tune the models on self-generated solutions.

However, these approaches [object Object] generated during the training process, potentially missing out on valuable information. To address this, the researchers propose [object Object], a new self-improvement method that uses both correct and incorrect solutions to train a [object Object] that can judge the correctness of the model's solutions.

By running V-STaR for multiple iterations, the researchers are able to create progressively better [object Object] and verifiers. This leads to a 4% to 17% improvement in test accuracy on common [object Object] and [object Object] benchmarks, compared to existing self-improvement and verification approaches.

Technical Explanation

The researchers propose a new self-improvement approach called V-STaR, which stands for Verification-based Self-Training and Reasoning. Unlike previous methods that discard incorrect solutions, V-STaR utilizes both correct and incorrect solutions generated during the self-improvement process to train a verifier using [object Object].

This verifier is then used at inference time to select the best solution among the many candidate solutions generated by the model. By running V-STaR for multiple iterations, the researchers are able to create progressively better reasoners and verifiers, leading to improved performance on [object Object] and [object Object] tasks.

The researchers evaluate V-STaR using [object Object] and report a 4% to 17% improvement in test accuracy over existing self-improvement and verification approaches.

Critical Analysis

The researchers acknowledge that V-STaR may still discard some potentially useful information from the incorrect solutions generated during the self-improvement process. They suggest that further research is needed to fully leverage the insights contained in these incorrect solutions.

Additionally, the researchers note that the performance improvements of V-STaR may be limited by the quality of the initial LLM model used. Weaker models may benefit less from the iterative self-improvement and verification process.

Researchers may also want to explore the impact of the [object Object] used to train the verifier, as different optimization methods could potentially yield further performance gains.

Conclusion

The V-STaR approach represents a promising advancement in the field of self-improvement for large language models. By utilizing both correct and incorrect solutions generated during the training process, V-STaR is able to create progressively better reasoners and verifiers, leading to significant improvements in code generation and math reasoning tasks.

While the approach has some limitations, the researchers have demonstrated the value of considering all available information, even when it may initially appear to be "incorrect." This underscores the importance of developing robust verification methods that can extract insights from a wide range of model outputs, rather than simply discarding them.

Overall, the V-STaR research highlights the potential for innovative self-improvement techniques to drive continued advancements in large language model capabilities, with important implications for various applications that rely on these powerful AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

V-STaR: Training Verifiers for Self-Taught Reasoners

Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, Rishabh Agarwal

Common self-improvement approaches for large language models (LLMs), such as STaR, iteratively fine-tune LLMs on self-generated solutions to improve their problem-solving ability. However, these approaches discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information in such solutions. To address this shortcoming, we propose V-STaR that utilizes both the correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model-generated solutions. This verifier is used at inference time to select one solution among many candidate solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering a 4% to 17% test accuracy improvement over existing self-improvement and verification approaches on common code generation and math reasoning benchmarks with LLaMA2 models.

8/15/2024

⚙️

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy

The performance of Large Vision Language Models (LVLMs) is dependent on the size and quality of their training datasets. Existing video instruction tuning datasets lack diversity as they are derived by prompting large language models with video captions to generate question-answer pairs, and are therefore mostly descriptive. Meanwhile, many labeled video datasets with diverse labels and supervision exist - however, we find that their integration into LVLMs is non-trivial. Herein, we present Video Self-Training with augmented Reasoning (Video-STaR), the first video self-training approach. Video-STaR allows the utilization of any labeled video dataset for video instruction tuning. In Video-STaR, an LVLM cycles between instruction generation and finetuning, which we show (I) improves general video understanding and (II) adapts LVLMs to novel downstream tasks with existing supervision. During generation, an LVLM is prompted to propose an answer. The answers are then filtered only to those that contain the original video labels, and the LVLM is then re-trained on the generated dataset. By only training on generated answers that contain the correct video labels, Video-STaR utilizes these existing video labels as weak supervision for video instruction tuning. Our results demonstrate that Video-STaR-enhanced LVLMs exhibit improved performance in (I) general video QA, where TempCompass performance improved by 10%, and (II) on downstream tasks, where Video-STaR improved Kinetics700-QA accuracy by 20% and action quality assessment on FineDiving by 15%.

7/9/2024

Lean-STaR: Learning to Interleave Thinking and Proving

Haohan Lin, Zhiqing Sun, Yiming Yang, Sean Welleck

Traditional language model-based theorem proving assumes that by training on a sufficient amount of formal proof data, a model will learn to prove theorems. Our key observation is that a wealth of informal information that is not present in formal proofs can be useful for learning to prove theorems. For instance, humans think through steps of a proof, but this thought process is not visible in the resulting code. We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof, thereby boosting the model's theorem-proving capabilities. Lean-STaR uses retrospective ground-truth tactics to generate synthetic thoughts for training the language model. At inference time, the trained model directly generates the thoughts prior to the prediction of the tactics in each proof step. Building on the self-taught reasoner framework, we then apply expert iteration to further fine-tune the model on the correct proofs it samples and verifies using the Lean solver. Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment, significantly outperforming base models ($boldsymbol{43.4% rightarrow 46.3%,}$ Pass@64). We also analyze the impact of the augmented thoughts on various aspects of the theorem proving process, providing insights into their effectiveness.

8/12/2024

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang, Fan Yang, Mao Yang

This paper introduces rStar, a self-play mutual reasoning approach that significantly improves reasoning capabilities of small language models (SLMs) without fine-tuning or superior models. rStar decouples reasoning into a self-play mutual generation-discrimination process. First, a target SLM augments the Monte Carlo Tree Search (MCTS) with a rich set of human-like reasoning actions to construct higher quality reasoning trajectories. Next, another SLM, with capabilities similar to the target SLM, acts as a discriminator to verify each trajectory generated by the target SLM. The mutually agreed reasoning trajectories are considered mutual consistent, thus are more likely to be correct. Extensive experiments across five SLMs demonstrate rStar can effectively solve diverse reasoning problems, including GSM8K, GSM-Hard, MATH, SVAMP, and StrategyQA. Remarkably, rStar boosts GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B, from 36.46% to 81.88% for Mistral-7B, from 74.53% to 91.13% for LLaMA3-8B-Instruct. Code will be available at https://github.com/zhentingqi/rStar.

8/13/2024