Prometheus-eval

Models by this creator

🏅

prometheus-13b-v1.0

115

prometheus-13b-v1.0 is an alternative to GPT-4 for fine-grained evaluation of language models. Developed by prometheus-eval, it uses the Llama-2-Chat model as a base and fine-tunes it on 100K feedback samples from the Feedback Collection dataset. This specialized fine-tuning allows prometheus-13b-v1.0 to outperform GPT-3.5-Turbo and Llama-2-Chat 70B, and perform on par with GPT-4 on various benchmarks. In contrast to GPT-4, prometheus-13b-v1.0 is a more affordable and customizable evaluation model that can be tuned to assess language models based on specific criteria like child readability, cultural sensitivity, or creativity. Model inputs and outputs Inputs Instruction**: The task or prompt to be evaluated Response**: The text response to be evaluated Reference answer**: A reference answer that would receive a score of 5 Score rubric**: A set of criteria and descriptions for scoring the response on a scale of 1 to 5 Outputs Feedback**: A detailed assessment of the response quality based on the provided score rubric Score**: An integer between 1 and 5 indicating the quality of the response, as per the score rubric Capabilities prometheus-13b-v1.0 excels at fine-grained evaluation of language model outputs. It can provide detailed feedback and scoring for responses across a wide range of criteria, making it a powerful tool for model developers and researchers looking to assess the performance of their language models. The model's specialized fine-tuning on human feedback data enables it to identify and react appropriately to the emotional context of user inputs, a key capability for providing empathetic and nuanced evaluations. What can I use it for? prometheus-13b-v1.0 can be used as a cost-effective alternative to GPT-4 for evaluating the performance of language models. It is particularly well-suited for assessing models based on customized criteria, such as child readability, cultural sensitivity, or creativity. The model can also be used as a reward model for Reinforcement Learning from Human Feedback (RLHF) approaches, helping to fine-tune language models to align with human preferences and values. Things to try One interesting use case for prometheus-13b-v1.0 is to provide detailed feedback on the outputs of large language models, helping to identify areas for improvement and guide further model development. Researchers and developers could use the model to evaluate their models on a wide range of benchmarks and tasks, and then use the detailed feedback to inform their fine-tuning and training processes. Additionally, the model could be used to assess the safety and appropriateness of language model outputs, ensuring that they align with ethical guidelines and promote positive behavior.

Updated 5/30/2024

Text-to-Text

💬

prometheus-7b-v2.0

prometheus-eval

The prometheus-7b-v2.0 is a language model developed by the team at prometheus-eval. It is an alternative to GPT-4 for fine-grained evaluation of language models and as a reward model for Reinforcement Learning from Human Feedback (RLHF). The model is based on the Mistral-Instruct base model and has been fine-tuned on 100K feedback within the Feedback Collection and 200K feedback within the Preference Collection datasets. It supports both absolute grading (direct assessment) and relative grading (pairwise ranking), and surprisingly, the weight merging process used to support both formats also improves performance on each. Similar models include the prometheus-13b-v1.0 and prometheus-13b-v1.0 models, which use different base models and training approaches. Model Inputs and Outputs The prometheus-7b-v2.0 model is a language model that can be used for text-to-text generation tasks. It requires different prompt formats for absolute grading (direct assessment) and relative grading (pairwise ranking). Inputs An instruction (might include an input) A response to evaluate A reference answer that gets a score of 5 A score rubric representing an evaluation criteria Outputs A detailed feedback assessing the quality of the response based on the given score rubric An integer score between 1 and 5 referring to the score rubric Capabilities The prometheus-7b-v2.0 model excels at fine-grained evaluation of language models, outperforming GPT-3.5-Turbo and on par with GPT-4 on various benchmarks. It can be used to evaluate LLMs with customized criteria, such as child readability, cultural sensitivity, or creativity. Additionally, it can be used as a reward model for Reinforcement Learning from Human Feedback (RLHF). What can I use it for? The prometheus-7b-v2.0 model can be leveraged for a variety of applications, particularly in the field of language model evaluation and development. It can be used to assess the performance of other language models, providing detailed feedback and scoring to help improve their capabilities. Additionally, the model can be employed as a reward model in Reinforcement Learning from Human Feedback (RLHF) workflows, helping to fine-tune language models to better align with human preferences and values. Things to try One interesting aspect of the prometheus-7b-v2.0 model is its ability to perform well on both absolute grading (direct assessment) and relative grading (pairwise ranking) tasks, despite the weight merging process used to support both formats. Experimenting with different prompts and evaluation criteria could lead to insights into how the model achieves this performance. Another area to explore is the potential for the prometheus-7b-v2.0 model to be used in conjunction with other language models, either as a specialized evaluation tool or as part of a more comprehensive model development workflow. Combining the capabilities of this model with other state-of-the-art language models could yield interesting and powerful applications.

Updated 7/2/2024

Text-to-Text

➖

prometheus-8x7b-v2.0

prometheus-eval

The prometheus-8x7b-v2.0 model is an alternative to GPT-4 evaluation, designed to provide fine-grained evaluation capabilities for language models and reward models used in Reinforcement Learning from Human Feedback (RLHF). It is a language model trained by prometheus-eval using Mistral-Instruct as the base model. The model is fine-tuned on 100K feedback within the Feedback Collection and 200K feedback within the Preference Collection datasets. This fine-tuning allows the model to excel at evaluating long-form responses, outperforming GPT-3.5-Turbo and matching the performance of GPT-4 on various benchmarks. The surprising aspect is that the model's performance is improved through weight merging, which allows it to handle both absolute grading (direct assessment) and relative grading (pairwise ranking) effectively. Model inputs and outputs The prometheus-8x7b-v2.0 model is designed to evaluate language model responses based on specific criteria. It requires four key components as input: Inputs Instruction**: The task or context that the response should be evaluated against. Response to evaluate**: The language model response that needs to be evaluated. Reference answer**: A high-quality reference answer (score of 5) that serves as a benchmark. Score rubric**: A set of criteria and descriptions for scoring the response on a scale of 1 to 5. Outputs Feedback**: A detailed assessment of the response based on the provided score rubric. Score**: An integer score between 1 and 5, reflecting the quality of the response according to the rubric. Capabilities The prometheus-8x7b-v2.0 model is particularly adept at providing fine-grained, objective evaluations of language model outputs. It can assess responses across a wide range of criteria, such as coherence, factual accuracy, creativity, and emotional intelligence. By leveraging the provided reference answer and score rubric, the model can deliver consistent and nuanced feedback, making it a valuable tool for language model development and benchmarking. What can I use it for? The prometheus-8x7b-v2.0 model can be used in a variety of applications related to language model evaluation and development. Some potential use cases include: Fine-grained Evaluation of Language Models**: Use the model to assess the quality of responses generated by other language models, such as GPT-4, across customizable criteria. Reward Model for RLHF**: Incorporate the prometheus-8x7b-v2.0 model as the reward model in Reinforcement Learning from Human Feedback (RLHF) pipelines to guide the training of language models. Benchmark and Comparison**: Leverage the model's evaluation capabilities to benchmark the performance of different language models and compare their strengths and weaknesses. Prompt Engineering**: Utilize the model's feedback to iteratively refine prompts and improve the performance of language models on specific tasks. Things to try One interesting aspect of the prometheus-8x7b-v2.0 model is its ability to handle both absolute grading (direct assessment) and relative grading (pairwise ranking) effectively. This flexibility allows for a wide range of evaluation scenarios, from assessing individual responses against a fixed rubric to comparing multiple responses and identifying the most suitable one. Another key feature to explore is the model's capacity to provide detailed, actionable feedback. By leveraging the provided reference answer and score rubric, the model can offer valuable insights into the strengths and weaknesses of a given response, which can be instrumental in guiding language model development and refinement.

Updated 9/6/2024

Text-to-Text