Scaling on Frontier: Uncertainty Quantification Workflow Applications using ExaWorks to Enable Full System Utilization

Read original: arXiv:2407.01484 - Published 7/2/2024 by Mikhail Titov, Robert Carson, Matthew Rolchigo, John Coleman, James Belak, Matthew Bement, Daniel Laney, Matteo Turilli, Shantenu Jha

Scaling on Frontier: Uncertainty Quantification Workflow Applications using ExaWorks to Enable Full System Utilization

Overview

Explores the use of the ExaWorks runtime system to enable full system utilization for Uncertainty Quantification (UQ) workflow applications on the Frontier supercomputer
Focuses on scaling UQ workflows, which involve running multiple simulations with varying input parameters to quantify the uncertainty in the output
Demonstrates the ability to efficiently schedule and manage these workflows on Frontier, a state-of-the-art exascale system

Plain English Explanation

The paper discusses how researchers used a software system called ExaWorks to help run complex scientific simulations on the Frontier supercomputer, a powerful new machine. The goal was to fully utilize the capabilities of Frontier to study the uncertainty in the results of these simulations.

Uncertainty quantification (UQ) workflows involve running many similar simulations with slightly different input parameters, in order to understand how changes in the inputs affect the outputs. This is an important technique in fields like additive manufacturing, where small variations in the manufacturing process can have a big impact on the final product.

The researchers showed that ExaWorks could effectively manage these UQ workflows on Frontier, scheduling the individual simulations efficiently and ensuring the full system was being used to its maximum potential. This allowed them to run large ensembles of simulations in a coordinated way, providing deeper insights into the uncertainties involved.

The ExaWorks project is an effort to develop advanced workflow management systems for exascale computing, which will be crucial as supercomputers become more powerful and complex. By using ExaWorks on Frontier, the researchers demonstrated how these new workflow tools can help scientists take full advantage of the latest high-performance computing resources.

Technical Explanation

The paper explores the use of the ExaWorks runtime system to enable efficient execution of Uncertainty Quantification (UQ) workflow applications on the Frontier supercomputer. UQ workflows involve running multiple simulations with varying input parameters to quantify the uncertainty in the output, which is an important technique in fields like additive manufacturing.

The researchers leveraged ExaWorks, a sophisticated workflow management system developed as part of the ExaWorks project, to schedule and execute these UQ workloads on Frontier. Frontier is a state-of-the-art exascale system that provides immense computational power, but also presents challenges in terms of efficiently utilizing all available resources.

By integrating ExaWorks with Frontier's job scheduling and resource management systems, the researchers were able to demonstrate the ability to fully utilize the system's capabilities for UQ workflow applications. ExaWorks handled the complexities of managing the individual simulation tasks, dependencies, and data movement, allowing the researchers to focus on the scientific objectives.

The paper presents performance results and insights gained from running real-world UQ workflows, highlighting the benefits of using an advanced workflow management system like ExaWorks to enable full system utilization on cutting-edge supercomputing platforms like Frontier.

Critical Analysis

The paper provides a valuable case study on the practical application of workflow management systems, such as ExaWorks, to enable efficient utilization of state-of-the-art supercomputing resources for complex scientific simulations. The researchers have done a commendable job in demonstrating the capabilities of ExaWorks and its integration with the Frontier system.

However, the paper could have delved deeper into some of the specific challenges and trade-offs encountered in this integration process. For example, it would be interesting to understand how ExaWorks handles task scheduling, load balancing, and data management in the context of Frontier's unique hardware and software architecture. Insights into the adaptations or extensions made to ExaWorks to better suit the Frontier environment would strengthen the paper's contribution.

Additionally, the paper could have explored the broader implications of this work, such as the potential for using similar workflow management approaches to enable efficient utilization of the computing continuum, where computational resources span from edge devices to exascale supercomputers. This could provide a more comprehensive perspective on the significance of the research.

Overall, the paper presents a valuable demonstration of the practical benefits of advanced workflow management systems in the context of exascale computing. Further exploration of the technical details and broader implications could strengthen the paper's impact and contribution to the field.

Conclusion

The paper showcases the use of the ExaWorks runtime system to enable efficient execution of Uncertainty Quantification (UQ) workflow applications on the Frontier exascale supercomputer. By integrating ExaWorks with Frontier's job scheduling and resource management systems, the researchers were able to fully utilize the system's capabilities for running large ensembles of UQ simulations.

The success of this work highlights the importance of advanced workflow management tools, such as ExaWorks, in unlocking the potential of cutting-edge supercomputing resources for complex scientific research. As the computing landscape continues to evolve towards the computing continuum, with diverse hardware platforms and increasing computational demands, the ability to effectively manage and orchestrate scientific workflows will become increasingly crucial.

This research represents an important step forward in bridging the gap between the capabilities of exascale systems and their practical utilization for real-world scientific applications. The insights and lessons learned from this work can inform the ongoing development of workflow management systems and contribute to the broader efforts to maximize the potential of high-performance computing for scientific discovery and innovation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Scaling on Frontier: Uncertainty Quantification Workflow Applications using ExaWorks to Enable Full System Utilization

Mikhail Titov, Robert Carson, Matthew Rolchigo, John Coleman, James Belak, Matthew Bement, Daniel Laney, Matteo Turilli, Shantenu Jha

When running at scale, modern scientific workflows require middleware to handle allocated resources, distribute computing payloads and guarantee a resilient execution. While individual steps might not require sophisticated control methods, bringing them together as a whole workflow requires advanced management mechanisms. In this work, we used RADICAL-EnTK (Ensemble Toolkit) - one of the SDK components of the ECP ExaWorks project - to implement and execute the novel Exascale Additive Manufacturing (ExaAM) workflows on up to 8000 compute nodes of the Frontier supercomputer at the Oak Ridge Leadership Computing Facility. EnTK allowed us to address challenges such as varying resource requirements (e.g., heterogeneity, size, and runtime), different execution environment per workflow, and fault tolerance. And a native portability feature of the developed EnTK applications allowed us to adjust these applications for Frontier runs promptly, while ensuring an expected level of resource utilization (up to 90%).

7/2/2024

ExaWorks Software Development Kit: A Robust and Scalable Collection of Interoperable Workflow Technologies

Matteo Turilli, Mihael Hategan-Marandiuc, Mikhail Titov, Ketan Maheshwari, Aymen Alsaadi, Andre Merzky, Ramon Arambula, Mikhail Zakharchanka, Matt Cowan, Justin M. Wozniak, Andreas Wilke, Ozgur Ozan Kilic, Kyle Chard, Rafael Ferreira da Silva, Shantenu Jha, Daniel Laney

Scientific discovery increasingly requires executing heterogeneous scientific workflows on high-performance computing (HPC) platforms. Heterogeneous workflows contain different types of tasks (e.g., simulation, analysis, and learning) that need to be mapped, scheduled, and launched on different computing. That requires a software stack that enables users to code their workflows and automate resource management and workflow execution. Currently, there are many workflow technologies with diverse levels of robustness and capabilities, and users face difficult choices of software that can effectively and efficiently support their use cases on HPC machines, especially when considering the latest exascale platforms. We contributed to addressing this issue by developing the ExaWorks Software Development Kit (SDK). The SDK is a curated collection of workflow technologies engineered following current best practices and specifically designed to work on HPC platforms. We present our experience with (1) curating those technologies, (2) integrating them to provide users with new capabilities, (3) developing a continuous integration platform to test the SDK on DOE HPC platforms, (4) designing a dashboard to publish the results of those tests, and (5) devising an innovative documentation platform to help users to use those technologies. Our experience details the requirements and the best practices needed to curate workflow technologies, and it also serves as a blueprint for the capabilities and services that DOE will have to offer to support a variety of scientific heterogeneous workflows on the newly available exascale HPC platforms.

7/24/2024

PETSc/TAO Developments for Early Exascale Systems

Richard Tran Mills, Mark Adams, Satish Balay, Jed Brown, Jacob Faibussowitsch, Toby Isaac, Matthew Knepley, Todd Munson, Hansol Suh, Stefano Zampini, Hong Zhang, Junchao Zhang

The Portable Extensible Toolkit for Scientific Computation (PETSc) library provides scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization via the Toolkit for Advanced Optimization (TAO). PETSc is used in dozens of scientific fields and is an important building block for many simulation codes. During the U.S. Department of Energy's Exascale Computing Project, the PETSc team has made substantial efforts to enable efficient utilization of the massive fine-grain parallelism present within exascale compute nodes and to enable performance portability across exascale architectures. We recap some of the challenges that designers of numerical libraries face in such an endeavor, and then discuss the many developments we have made, which include the addition of new GPU backends, features supporting efficient on-device matrix assembly, better support for asynchronicity and GPU kernel concurrency, and new communication infrastructure. We evaluate the performance of these developments on some pre-exascale systems as well the early exascale systems Frontier and Aurora, using compute kernel, communication layer, solver, and mini-application benchmark studies, and then close with a few observations drawn from our experiences on the tension between portable performance and other goals of numerical libraries.

6/14/2024

🌀

Paving the Way to Hybrid Quantum-Classical Scientific Workflows

Sandeep Suresh Cranganore, Vincenzo De Maio, Ivona Brandic, Ewa Deelman

The increasing growth of data volume, and the consequent explosion in demand for computational power, are affecting scientific computing, as shown by the rise of extreme data scientific workflows. As the need for computing power increases, quantum computing has been proposed as a way to deliver it. It may provide significant theoretical speedups for many scientific applications (i.e., molecular dynamics, quantum chemistry, combinatorial optimization, and machine learning). Therefore, integrating quantum computers into the computing continuum constitutes a promising way to speed up scientific computation. However, the scientific computing community still lacks the necessary tools and expertise to fully harness the power of quantum computers in the execution of complex applications such as scientific workflows. In this work, we describe the main characteristics of quantum computing and its main benefits for scientific applications, then we formalize hybrid quantum-classic workflows, explore how to identify quantum components and map them onto resources. We demonstrate concepts on a real use case and define a software architecture for a hybrid workflow management system.

4/17/2024