Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource Settings

2405.16820

Published 5/28/2024 by Robert Wolfe, Isaac Slaughter, Bin Han, Bingbing Wen, Yiwei Yang, Lucas Rosenblatt, Bernease Herman, Eva Brown, Zening Qu, Nic Weber and 1 other

cs.LG cs.AI cs.CY cs.HC

Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource Settings

Abstract

The rapid proliferation of generative AI has raised questions about the competitiveness of lower-parameter, locally tunable, open-weight models relative to high-parameter, API-guarded, closed-weight models in terms of performance, domain adaptation, cost, and generalization. Centering under-resourced yet risk-intolerant settings in government, research, and healthcare, we see for-profit closed-weight models as incompatible with requirements for transparency, privacy, adaptability, and standards of evidence. Yet the performance penalty in using open-weight models, especially in low-data and low-resource settings, is unclear. We assess the feasibility of using smaller, open-weight models to replace GPT-4-Turbo in zero-shot, few-shot, and fine-tuned regimes, assuming access to only a single, low-cost GPU. We assess value-sensitive issues around bias, privacy, and abstention on three additional tasks relevant to those topics. We find that with relatively low effort, very low absolute monetary cost, and relatively little data for fine-tuning, small open-weight models can achieve competitive performance in domain-adapted tasks without sacrificing generality. We then run experiments considering practical issues in bias, privacy, and hallucination risk, finding that open models offer several benefits over closed models. We intend this work as a case study in understanding the opportunity cost of reproducibility and transparency over for-profit state-of-the-art zero shot performance, finding this cost to be marginal under realistic settings.

Create account to get full access

Overview

This paper explores the performance of open-weight language models in low-resource settings compared to the popular ChatGPT model.
The researchers found that their open-weight models can achieve competitive results with ChatGPT, even when trained on a fraction of the data.
This suggests that open-source language models can provide a viable and more transparent alternative to large commercial models like ChatGPT.

Plain English Explanation

The paper looks at how well open-source language models, which have their internal parameters (or "weights") publicly available, can perform compared to ChatGPT - a highly capable but opaque commercial language model.

The researchers trained their own open-weight models using a much smaller dataset than was used to train ChatGPT. Surprisingly, they found that these open-weight models were able to achieve similar performance to ChatGPT on a variety of tasks, even though they had far less training data.

This is significant because open-source models are more transparent about how they work under the hood, compared to commercial models like ChatGPT which are closed-source. The fact that open-weight models can rival ChatGPT's capabilities, even with less data, suggests they could provide a viable and more transparent alternative for many applications.

Technical Explanation

The paper presents a comparative evaluation of open-weight language models against the popular ChatGPT model, even in low-resource settings. The researchers trained their own open-weight models using a fraction of the data used to train ChatGPT, and found that these models were able to achieve competitive or even superior performance on a range of benchmarks.

Specifically, the team experimented with a technique called qLoRA to efficiently fine-tune a pre-trained open-source language model. This allowed them to adapt the model to new tasks using relatively little additional training data.

When evaluated on tasks like natural language inference, question answering, and text generation, the open-weight models matched or outperformed ChatGPT, despite being trained on a much smaller corpus. The authors attribute this to the open-weight models' superior parameter efficiency and the benefits of transparency.

Critical Analysis

The paper makes a compelling case that open-weight language models can be competitive with highly capable commercial models like ChatGPT, even when trained on a fraction of the data. This is an encouraging finding for the development of more transparent and accessible AI systems.

However, the authors acknowledge several limitations to their work. First, the benchmarks used may not fully capture the breadth of capabilities exhibited by ChatGPT. There may be some tasks where the commercial model still maintains a significant advantage. Additionally, the open-weight models were evaluated in isolation, without considering factors like deployment cost or energy efficiency.

Further research is needed to fully understand the tradeoffs between open-weight and commercial models, and to explore ways of enhancing the capabilities of open-source alternatives. As noted in this related paper, continued advancements in open-source AI could have significant implications for the democratization of AI technology.

Conclusion

This paper provides evidence that open-weight language models can achieve performance on par with the industry-leading ChatGPT, even when trained on a much smaller dataset. This suggests that transparent, open-source AI systems can be a viable and competitive alternative to large, opaque commercial models.

As the field of generative AI continues to advance, the ability to develop powerful language models with open architectures and publicly available parameters could have important implications for AI transparency and accessibility. The findings in this paper represent an encouraging step in that direction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Fine-Tuned 'Small' LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification

Martin Juan Jos'e Bucher, Marco Martini

Generative AI offers a simple, prompt-based alternative to fine-tuning smaller BERT-style LLMs for text classification tasks. This promises to eliminate the need for manually labeled training data and task-specific model training. However, it remains an open question whether tools like ChatGPT can deliver on this promise. In this paper, we show that smaller, fine-tuned LLMs (still) consistently and significantly outperform larger, zero-shot prompted models in text classification. We compare three major generative AI models (ChatGPT with GPT-3.5/GPT-4 and Claude Opus) with several fine-tuned LLMs across a diverse set of classification tasks (sentiment, approval/disapproval, emotions, party positions) and text categories (news, tweets, speeches). We find that fine-tuning with application-specific training data achieves superior performance in all cases. To make this approach more accessible to a broader audience, we provide an easy-to-use toolkit alongside this paper. Our toolkit, accompanied by non-technical step-by-step guidance, enables users to select and fine-tune BERT-like LLMs for any classification task with minimal technical and computational effort.

6/14/2024

cs.CL cs.AI

GEB-1.3B: Open Lightweight Large Language Model

Jie Wu, Yufeng Zhu, Lei Shen, Xuqing Lu

Recently developed large language models (LLMs) such as ChatGPT, Claude, and Llama have demonstrated impressive abilities, and even surpass human-level performance in several tasks. Despite their success, the resource-intensive demands of these models, requiring significant computational power for both training and inference, limit their deployment to high-performance servers. Additionally, the extensive calculation requirements of the models often lead to increased latency in response times. With the increasing need for LLMs to operate efficiently on CPUs, research about lightweight models that are optimized for CPU inference has emerged. In this work, we introduce GEB-1.3B, a lightweight LLM trained on 550 billion tokens in both Chinese and English languages. We employ novel training techniques, including ROPE, Group-Query-Attention, and FlashAttention-2, to accelerate training while maintaining model performance. Additionally, we fine-tune the model using 10 million samples of instruction data to enhance alignment. GEB-1.3B exhibits outstanding performance on general benchmarks such as MMLU, C-Eval, and CMMLU, outperforming comparative models such as MindLLM-1.3B and TinyLLaMA-1.1B. Notably, the FP32 version of GEB-1.3B achieves commendable inference times on CPUs, with ongoing efforts to further enhance speed through advanced quantization techniques. The release of GEB-1.3B as an open-source model marks a significant contribution to the development of lightweight LLMs, promising to foster further research and innovation in the field.

6/17/2024

cs.CL

🌀

A Survey on the Real Power of ChatGPT

Ming Liu, Ran Liu, Ye Zhu, Hua Wang, Youyang Qu, Rongsheng Li, Yongpan Sheng, Wray Buntine

ChatGPT has changed the AI community and an active research line is the performance evaluation of ChatGPT. A key challenge for the evaluation is that ChatGPT is still closed-source and traditional benchmark datasets may have been used by ChatGPT as the training data. In this paper, (i) we survey recent studies which uncover the real performance levels of ChatGPT in seven categories of NLP tasks, (ii) review the social implications and safety issues of ChatGPT, and (iii) emphasize key challenges and opportunities for its evaluation. We hope our survey can shed some light on its blackbox manner, so that researchers are not misleaded by its surface generation.

5/13/2024

cs.CL cs.AI

🚀

Improving Large Models with Small models: Lower Costs and Better Performance

Dong Chen, Shuo Zhang, Yueting Zhuang, Siliang Tang, Qidong Liu, Hua Wang, Mingliang Xu

Pretrained large models (PLMs), such as ChatGPT, have demonstrated remarkable performance across diverse tasks. However, the significant computational requirements of PLMs have discouraged most product teams from running or fine-tuning them. In such cases, to harness the exceptional performance of PLMs, one must rely on expensive APIs, thereby exacerbating the economic burden. Despite the overall inferior performance of small models, in specific distributions, they can achieve comparable or even superior results. Consequently, some input can be processed exclusively by small models. On the other hand, certain tasks can be broken down into multiple subtasks, some of which can be completed without powerful capabilities. Under these circumstances, small models can handle the simple subtasks, allowing large models to focus on challenging subtasks, thus improving the performance. We propose Data Shunt$^+$ (DS$^+$), a general paradigm for collaboration of small and large models. DS$^+$ not only substantially reduces the cost associated with querying large models but also effectively improves large models' performance. For instance, ChatGPT achieves an accuracy of $94.43%$ on Amazon Product sentiment analysis, and DS$^+$ achieves an accuracy of $95.64%$, while the cost has been reduced to only $31.18%$. Besides, experiments also prove that the proposed collaborative-based paradigm can better inject specific task knowledge into PLMs compared to fine-tuning.

6/26/2024

cs.CL cs.AI cs.LG