H2O-Danube-1.8B Technical Report

2401.16818

Published 4/16/2024 by Philipp Singer, Pascal Pfeiffer, Yauhen Babakhin, Maximilian Jeblick, Nischay Dhankhar, Gabor Fodor, Sri Satish Ambati

cs.CL cs.LG

✨

Abstract

We present H2O-Danube, a series of small 1.8B language models consisting of H2O-Danube-1.8B, trained on 1T tokens, and the incremental improved H2O-Danube2-1.8B trained on an additional 2T tokens. Our models exhibit highly competitive metrics across a multitude of benchmarks and, as of the time of this writing, H2O-Danube2-1.8B achieves the top ranking on Open LLM Leaderboard for all models below the 2B parameter range. The models follow core principles of LLama 2 and Mistral, and we leverage and refine various techniques for pre-training large language models. We additionally release chat models trained with supervised fine-tuning followed by direct preference optimization. We make all models openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically.

Get summaries of the top AI research delivered straight to your inbox:

Overview

Presents H2O-Danube, a series of small 1.8B language models
H2O-Danube-1.8B is trained on 1T tokens, and H2O-Danube2-1.8B is trained on an additional 2T tokens
Models exhibit highly competitive metrics across multiple benchmarks
H2O-Danube2-1.8B achieves top ranking on Open LLM Leaderboard for models below 2B parameters
Follow core principles of LLama 2 and Mistral, leveraging and refining techniques for pre-training large language models
Release chat models trained with supervised fine-tuning and direct preference optimization
Models made openly available under Apache 2.0 license to democratize LLMs

Plain English Explanation

The researchers have developed a series of small 1.8 billion parameter language models called H2O-Danube. The first model, H2O-Danube-1.8B, was trained on 1 trillion tokens of text data, while the second model, H2O-Danube2-1.8B, was trained on an additional 2 trillion tokens. These models perform extremely well on a variety of benchmarks, with H2O-Danube2-1.8B even ranking first among all models with under 2 billion parameters on the Open LLM Leaderboard.

The models are built upon the foundations of LLama 2 and Mistral, two other influential large language models. The researchers have further refined and improved the techniques used to pre-train these large models.

In addition to the main language models, the researchers have also released chat models that have been fine-tuned with supervised training and then optimized for direct user preferences. All of these models are made freely available to the public under the Apache 2.0 license, which helps make large language models more accessible and widely usable.

Technical Explanation

The H2O-Danube series of language models consists of two main versions: H2O-Danube-1.8B, which was trained on 1 trillion tokens of text data, and H2O-Danube2-1.8B, which was trained on an additional 2 trillion tokens. Both models have 1.8 billion parameters, placing them in the "small" category of large language models.

These models were developed by leveraging and refining the core principles and techniques used in the LLama 2 and Mistral language models. The researchers integrated various advancements in pre-training large language models to achieve highly competitive performance across a wide range of benchmarks.

In addition to the main language models, the researchers also trained chat models using supervised fine-tuning followed by direct preference optimization. These chat models are designed to engage in more natural, conversational interactions with users.

All of the H2O-Danube models, including the chat variants, are made openly available under the Apache 2.0 license. This open-source approach helps democratize access to large language models, allowing a wider audience to utilize and build upon these powerful AI systems.

Critical Analysis

The H2O-Danube models represent a significant advancement in the field of large language models, particularly in terms of their impressive performance on a wide range of benchmarks. The researchers' approach of building upon the foundations of LLama 2 and Mistral, while further refining and improving the pre-training techniques, has led to the development of highly capable models.

However, it's important to note that the paper does not provide detailed information about the specific techniques and methodologies used in the pre-training process. While the researchers mention leveraging and refining various approaches, a more in-depth explanation of the innovations and modifications would be helpful for a deeper understanding of the models' capabilities and potential limitations.

Additionally, the paper does not discuss the potential biases or ethical considerations associated with the H2O-Danube models. As large language models can sometimes exhibit undesirable biases or generate harmful content, it would be valuable for the researchers to address these concerns and outline their strategies for mitigating such issues.

Furthermore, the paper lacks a comprehensive analysis of the chat models' performance and their ability to engage in natural, contextual conversations. While the release of these chat models is a positive step, a more detailed evaluation of their conversational skills and user experience would provide valuable insights.

Conclusion

The H2O-Danube series of language models represents a significant advancement in the field of large language models. By building upon the foundations of LLama 2 and Mistral and further refining the pre-training techniques, the researchers have developed highly capable models that exhibit strong performance across a variety of benchmarks.

The open-source release of these models, including the chat variants, is a commendable effort to democratize access to powerful AI systems and foster a wider ecosystem of language model development and application. However, the paper could benefit from more detailed explanations of the technical innovations, potential biases and ethical considerations, as well as a more in-depth evaluation of the chat models' conversational abilities.

Overall, the H2O-Danube models are a promising development in the ongoing quest to create highly capable and accessible large language models that can positively impact various domains and applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

⚙️

OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs trained starting from Llama 2

Mihai Masala, Denis C. Ilie-Ablachim, Dragos Corlatescu, Miruna Zavelca, Marius Leordeanu, Horia Velicu, Marius Popescu, Mihai Dascalu, Traian Rebedea

In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English. Hence, their performance in English greatly exceeds their performance in other languages. This document presents our approach to training and evaluating the first foundational and chat LLM specialized for Romanian.

5/20/2024

cs.CL

🐍

Tele-FLM Technical Report

Xiang Li, Yiqun Yao, Xin Jiang, Xuezhi Fang, Chao Wang, Xinzhang Liu, Zihan Wang, Yu Zhao, Xin Wang, Yuyao Huang, Shuangyong Song, Yongxiang Li, Zheng Zhang, Bo Zhao, Aixin Sun, Yequan Wang, Zhongjiang He, Zhongyuan Wang, Xuelong Li, Tiejun Huang

Large language models (LLMs) have showcased profound capabilities in language understanding and generation, facilitating a wide array of applications. However, there is a notable paucity of detailed, open-sourced methodologies on efficiently scaling LLMs beyond 50 billion parameters with minimum trial-and-error cost and computational resources. In this report, we introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities. Tele-FLM demonstrates superior multilingual language modeling abilities, measured by BPB on textual corpus. Besides, in both English and Chinese foundation model evaluation, it is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B. In addition to the model weights, we share the core designs, engineering practices, and training details, which we expect to benefit both the academic and industrial communities.

4/26/2024

cs.CL cs.AI

HLAT: High-quality Large Language Model Pre-trained on AWS Trainium

Haozheng Fan, Hao Zhou, Guangtai Huang, Parameswaran Raman, Xinwei Fu, Gaurav Gupta, Dhananjay Ram, Yida Wang, Jun Huan

Getting large language models (LLMs) to perform well on the downstream tasks requires pre-training over trillions of tokens. This typically demands a large number of powerful computational devices in addition to a stable distributed training framework to accelerate the training. The growing number of applications leveraging AI/ML had led to a scarcity of the expensive conventional accelerators (such as GPUs), which begs the need for the alternative specialized-accelerators that are scalable and cost-efficient. AWS Trainium is the second-generation machine learning accelerator that has been purposely built for training large deep learning models. Its corresponding instance, Amazon EC2 trn1, is an alternative to GPU instances for LLM training. However, training LLMs with billions of parameters on trn1 is challenging due to its relatively nascent software ecosystem. In this paper, we showcase HLAT: a 7 billion parameter decoder-only LLM pre-trained using trn1 instances over 1.8 trillion tokens. The performance of HLAT is benchmarked against popular open source baseline models including LLaMA and OpenLLaMA, which have been trained on NVIDIA GPUs and Google TPUs, respectively. On various evaluation tasks, we show that HLAT achieves model quality on par with the baselines. We also share the best practice of using the Neuron Distributed Training Library (NDTL), a customized distributed training library for AWS Trainium to achieve efficient training. Our work demonstrates that AWS Trainium powered by the NDTL is able to successfully pre-train state-of-the-art LLM models with high performance and cost-effectiveness.

4/17/2024

cs.CL cs.LG

JetMoE: Reaching Llama2 Performance with 0.1M Dollars

Yikang Shen, Zhen Guo, Tianle Cai, Zengyi Qin

Large Language Models (LLMs) have achieved remarkable results, but their increasing resource demand has become a major obstacle to the development of powerful and accessible super-human intelligence. This report introduces JetMoE-8B, a new LLM trained with less than $0.1 million, using 1.25T tokens from carefully mixed open-source corpora and 30,000 H100 GPU hours. Despite its low cost, the JetMoE-8B demonstrates impressive performance, with JetMoE-8B outperforming the Llama2-7B model and JetMoE-8B-Chat surpassing the Llama2-13B-Chat model. These results suggest that LLM training can be much more cost-effective than generally thought. JetMoE-8B is based on an efficient Sparsely-gated Mixture-of-Experts (SMoE) architecture, composed of attention and feedforward experts. Both layers are sparsely activated, allowing JetMoE-8B to have 8B parameters while only activating 2B for each input token, reducing inference computation by about 70% compared to Llama2-7B. Moreover, JetMoE-8B is highly open and academia-friendly, using only public datasets and training code. All training parameters and data mixtures have been detailed in this report to facilitate future efforts in the development of open foundation models. This transparency aims to encourage collaboration and further advancements in the field of accessible and efficient LLMs. The model weights are publicly available at https://github.com/myshell-ai/JetMoE.

4/12/2024

cs.CL cs.AI