Atlas: Hierarchical Partitioning for Quantum Circuit Simulation on GPUs (Extended Version)

Read original: arXiv:2408.09055 - Published 8/20/2024 by Mingkuan Xu, Shiyi Cao, Xupeng Miao, Umut A. Acar, Zhihao Jia

Atlas: Hierarchical Partitioning for Quantum Circuit Simulation on GPUs (Extended Version)

Overview

Presents an efficient GPU-based approach called Atlas for simulating large quantum circuits
Introduces a novel hierarchical partitioning technique to distribute the workload across GPU cores
Demonstrates significant performance improvements over state-of-the-art quantum circuit simulators

Plain English Explanation

Atlas: Hierarchical Partitioning for Quantum Circuit Simulation on GPUs (Extended Version) describes a new method for simulating quantum circuits on graphics processing units (GPUs). Quantum computers have the potential to solve certain problems much faster than classical computers, but building and programming them is incredibly challenging. Researchers are exploring ways to simulate quantum circuits on traditional computers to better understand quantum algorithms and test new designs.

The key innovation in this work is a hierarchical partitioning technique that divides the quantum circuit into smaller pieces that can be processed in parallel on a GPU. This allows the simulation to scale to much larger circuits than previous methods. The researchers implemented this approach in a tool called Atlas and showed that it outperforms other state-of-the-art quantum circuit simulators, sometimes by orders of magnitude.

Technical Explanation

The Atlas system uses a novel hierarchical partitioning algorithm to distribute the workload of simulating a quantum circuit across the many cores of a GPU. First, it recursively splits the circuit into smaller subcircuits based on the connectivity of the quantum gates. This creates a tree-like structure that can be mapped efficiently to the GPU's thread hierarchy.

Within each subcircuit, Atlas employs an optimized tensor contraction kernel to perform the necessary linear algebra operations. By batching many small tensor contractions together, the kernel achieves high GPU utilization and memory throughput.

The experimental evaluation demonstrates that Atlas can simulate circuits with up to 60 qubits, far exceeding the capabilities of prior GPU-based simulators. On a range of benchmark circuits, Atlas outperforms the state-of-the-art CPU-based simulator Qsharp by up to 100x and the GPU-based simulator Pennylane by up to 10x.

Critical Analysis

The authors acknowledge that Atlas's performance advantage decreases as the circuit depth increases, as the benefits of parallelism become outweighed by the overhead of the partitioning algorithm. They also note that the current implementation does not support certain advanced quantum circuit features, such as measurement and feedback, which limits its applicability for some use cases.

Additionally, while the hierarchical partitioning approach is a key innovation, the authors do not provide a rigorous theoretical analysis of its optimality or explore alternative partitioning schemes that could further improve performance.

Conclusion

Atlas represents a significant advance in the field of quantum circuit simulation, demonstrating the power of leveraging GPU hardware to tackle this computationally intensive task. The hierarchical partitioning technique allows Atlas to scale to much larger circuits than previous simulators, opening the door to more ambitious quantum algorithm research and testing. While there are still some limitations to address, this work represents an important step forward in bridging the gap between quantum theory and practical implementation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Atlas: Hierarchical Partitioning for Quantum Circuit Simulation on GPUs (Extended Version)

Mingkuan Xu, Shiyi Cao, Xupeng Miao, Umut A. Acar, Zhihao Jia

This paper presents techniques for theoretically and practically efficient and scalable Schrodinger-style quantum circuit simulation. Our approach partitions a quantum circuit into a hierarchy of subcircuits and simulates the subcircuits on multi-node GPUs, exploiting available data parallelism while minimizing communication costs. To minimize communication costs, we formulate an Integer Linear Program that rewards simulation of nearby gates on nearby GPUs. To maximize throughput, we use a dynamic programming algorithm to compute the subcircuit simulated by each kernel at a GPU. We realize these techniques in Atlas, a distributed, multi-GPU quantum circuit simulator. Our evaluation on a variety of quantum circuits shows that Atlas outperforms state-of-the-art GPU-based simulators by more than 2$times$ on average and is able to run larger circuits via offloading to DRAM, outperforming other large-circuit simulators by two orders of magnitude.

8/20/2024

Efficient Quantum Circuit Simulation by Tensor Network Methods on Modern GPUs

Feng Pan, Hanfeng Gu, Lvlin Kuang, Bing Liu, Pan Zhang

Efficient simulation of quantum circuits has become indispensable with the rapid development of quantum hardware. The primary simulation methods are based on state vectors and tensor networks. As the number of qubits and quantum gates grows larger in current quantum devices, traditional state-vector based quantum circuit simulation methods prove inadequate due to the overwhelming size of the Hilbert space and extensive entanglement. Consequently, brutal force tensor network simulation algorithms become the only viable solution in such scenarios. The two main challenges faced in tensor network simulation algorithms are optimal contraction path finding and efficient execution on modern computing devices, with the latter determines the actual efficiency. In this study, we investigate the optimization of such tensor network simulations on modern GPUs and propose general optimization strategies from two aspects: computational efficiency and accuracy. Firstly, we propose to transform critical Einstein summation operations into GEMM operations, leveraging the specific features of tensor network simulations to amplify the efficiency of GPUs. Secondly, by analyzing the data characteristics of quantum circuits, we employ extended precision to ensure the accuracy of simulation results and mixed precision to fully exploit the potential of GPUs, resulting in faster and more precise simulations. Our numerical experiments demonstrate that our approach can achieve a 3.96x reduction in verification time for random quantum circuit samples in the 18-cycle case of Sycamore, with sustained performance exceeding 21 TFLOPS on one A100. This method can be easily extended to the 20-cycle case, maintaining the same performance, accelerating by 12.5x compared to the state-of-the-art CPU-based results and 4.48-6.78x compared to the state-of-the-art GPU-based results reported in the literature.

8/13/2024

🛠️

Design and execution of quantum circuits using tens of superconducting qubits and thousands of gates for dense Ising optimization problems

Filip B. Maciejewski, Stuart Hadfield, Benjamin Hall, Mark Hodson, Maxime Dupont, Bram Evert, James Sud, M. Sohaib Alam, Zhihui Wang, Stephen Jeffrey, Bhuvanesh Sundar, P. Aaron Lott, Shon Grabbe, Eleanor G. Rieffel, Matthew J. Reagor, Davide Venturelli

We develop a hardware-efficient ansatz for variational optimization, derived from existing ansatze in the literature, that parametrizes subsets of all interactions in the Cost Hamiltonian in each layer. We treat gate orderings as a variational parameter and observe that doing so can provide significant performance boosts in experiments. We carried out experimental runs of a compilation-optimized implementation of fully-connected Sherrington-Kirkpatrick Hamiltonians on a 50-qubit linear-chain subsystem of Rigetti Aspen-M-3 transmon processor. Our results indicate that, for the best circuit designs tested, the average performance at optimized angles and gate orderings increases with circuit depth (using more parameters), despite the presence of a high level of noise. We report performance significantly better than using a random guess oracle for circuits involving up to approx 5000 two-qubit and approx 5000 one-qubit native gates. We additionally discuss various takeaways of our results toward more effective utilization of current and future quantum processors for optimization.

9/16/2024

Achieving Energetic Superiority Through System-Level Quantum Circuit Simulation

Rong Fu, Zhongling Su, Han-Sen Zhong, Xiti Zhao, Jianyang Zhang, Feng Pan, Pan Zhang, Xianhe Zhao, Ming-Cheng Chen, Chao-Yang Lu, Jian-Wei Pan, Zhiling Pei, Xingcheng Zhang, Wanli Ouyang

Quantum Computational Superiority boasts rapid computation and high energy efficiency. Despite recent advances in classical algorithms aimed at refuting the milestone claim of Google's sycamore, challenges remain in generating uncorrelated samples of random quantum circuits. In this paper, we present a groundbreaking large-scale system technology that leverages optimization on global, node, and device levels to achieve unprecedented scalability for tensor networks. This enables the handling of large-scale tensor networks with memory capacities reaching tens of terabytes, surpassing memory space constraints on a single node. Our techniques enable accommodating large-scale tensor networks with up to tens of terabytes of memory, reaching up to 2304 GPUs with a peak computing power of 561 PFLOPS half-precision. Notably, we have achieved a time-to-solution of 14.22 seconds with energy consumption of 2.39 kWh which achieved fidelity of 0.002 and our most remarkable result is a time-to-solution of 17.18 seconds, with energy consumption of only 0.29 kWh which achieved a XEB of 0.002 after post-processing, outperforming Google's quantum processor Sycamore in both speed and energy efficiency, which recorded 600 seconds and 4.3 kWh, respectively.

7/2/2024