SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

Published 11/6/2024 by Raghu Prabhakar, Ram Sivaramakrishnan, Darshan Gandhi, Yun Du, Mingran Wang, Xiangyu Song, Kejie Zhang and 23 more...

Overview

Monolithic large language models (LLMs) like GPT-4 have enabled modern generative AI applications, but training, serving, and maintaining them at scale remains expensive and challenging.
The Composition of Experts (CoE) approach is a modular alternative that can reduce the cost and complexity, but it faces challenges with hardware utilization and model switching.
This paper describes how combining CoE, streaming dataflow, and a three-tier memory system can address the AI memory wall and scale CoE systems.

BS and TP values are both 8.

1/4

Original caption: (a) BS=8, TP=8

Plain English Explanation

The paper discusses a new approach to building and deploying large AI models called Composition of Experts (CoE). Traditionally, AI models have been built as a single, monolithic system, like

GPT-4

. While these large models have enabled amazing AI applications, they are very expensive and complex to train, serve, and maintain at scale.

The CoE approach is a modular alternative that breaks the model down into smaller "expert" components that can be trained and deployed more efficiently. However, this modular approach presents its own challenges when using conventional hardware. The smaller expert models may not be able to fully utilize the available computing power, and rapidly switching between a large number of expert models can be slow or costly.

The researchers in this paper propose a solution that combines CoE with a new hardware architecture called Samba-CoE. This system uses a special type of AI accelerator chip with a unique three-level memory system to address the challenges of deploying large-scale CoE models. The result is a system that can run CoE models much more efficiently, reducing the cost and complexity compared to traditional monolithic AI models.

Technical Explanation

This paper introduces

Samba-CoE

, a Composition of Experts (CoE) system with 150 experts and a trillion total parameters. Samba-CoE is deployed on the SambaNova SN40L Reconfigurable Dataflow Unit (RDU) - a custom-designed dataflow accelerator architecture for enterprise AI applications.

The key innovations in Samba-CoE include:

Three-Tier Memory System: The SN40L chip features a hierarchy of memory types - on-chip distributed SRAM, on-package High-Bandwidth Memory (HBM), and off-package DDR DRAM. This provides high-performance memory access for the CoE models.
Dedicated Inter-RDU Network: Multiple SN40L chips can be connected via a dedicated network, enabling the scaling up and out of the CoE system over multiple sockets.
Streaming Dataflow Architecture: The dataflow design of the SN40L chip, combined with the multi-level memory system, allows for efficient processing of the CoE models without the need for fused operations.

The researchers evaluate Samba-CoE on various benchmarks and show speedups ranging from 2x to 13x compared to an unfused baseline. They also demonstrate significant improvements in machine footprint, model switching time, and overall performance compared to state-of-the-art GPU systems like the DGX H100 and DGX A100.

Critical Analysis

The paper presents a compelling solution to the challenges of deploying large-scale CoE systems, but it also acknowledges some potential limitations:

Hardware Specificity: The Samba-CoE system is closely tied to the SambaNova SN40L hardware, which may limit its broader applicability. The researchers do not provide a clear path for adapting the approach to other hardware platforms.
Scalability Concerns: While the system can scale up and out using multiple SN40L chips, the researchers do not explore the limits of this scalability or the potential bottlenecks that may arise as the system grows larger.
Power and Energy Efficiency: The paper focuses on performance metrics like speedup and footprint reduction, but does not address the power consumption or energy efficiency of the Samba-CoE system. This could be an important consideration for real-world deployment.
Complexity and Maintenance: Introducing a new hardware architecture and multi-level memory system adds complexity to the system. The researchers do not discuss the potential challenges of maintaining and updating such a complex system over time.

Overall, the Samba-CoE approach represents a promising step forward in addressing the challenges of deploying large-scale AI models, but further research may be needed to assess its broader applicability and long-term feasibility.

Conclusion

This paper presents a novel solution for scaling Composition of Experts (CoE) systems, a modular approach to building large AI models. By combining CoE with a custom-designed hardware architecture and a three-tier memory system, the researchers have developed a system called Samba-CoE that can significantly improve the performance, cost, and complexity of deploying large-scale AI models compared to traditional monolithic approaches.

The key innovations in Samba-CoE, such as the dedicated inter-RDU network and the streaming dataflow architecture, demonstrate how hardware-software co-design can address the challenges of the AI memory wall and enable more efficient deployment of modular AI systems. While the solution is closely tied to the SambaNova SN40L hardware, the principles and insights from this research could inform the development of similar systems on other platforms.

As AI models continue to grow in size and complexity, the need for scalable and cost-effective deployment solutions will become increasingly important. The Samba-CoE system represents a significant step forward in addressing these challenges and could pave the way for more accessible and widespread adoption of large-scale AI applications.

Fusion impact on operation intensity. Without full fusion, memory limitations are likely.

1/2

Fusion Level	Operation Intensity (Ops/Byte)
No Fusion	39.5
Gemm0 - Mul - Transpose	102.6
Fully Spatially Fused	410.4

Original caption: TABLE I: Impact of different levels of fusion on operation intensity for the example in Figure 3. Without full fusion, this example will be memory bound on most architectures.

Parameter	Value
Compute Capability (BF16)	638 TFLOPs
SRAM Capacity	520 MB
HBM Capacity	64 GB
HBM Bandwidth	1.8 TB/s
DDR Capacity	1.5 TB
DDR Bandwidth	200 GB/s
PCU Count	1040
PMU Count	1040
Clock Frequency	<2 GHz
Process Technology	5nm
Die Size	<650 mm²
Dies per Socket	2

Original caption: TABLE II: Chip parameters for the SN40L RDU.

Full paper

Loading PDF viewer...

Read original: arXiv:2405.07518

Listen to this paper