CloverLeaf on Intel Multi-Core CPUs: A Case Study in Write-Allocate Evasion

Read original: arXiv:2311.04797 - Published 5/20/2024 by Jan Laukemann, Thomas Gruber, Georg Hager, Dossay Oryspayev, Gerhard Wellein

🛠️

Overview

Researchers analyze the performance of the CloverLeaf code, a benchmark from the SPEChpc 2021 suite, on recent Intel Xeon Ice Lake and Sapphire Rapids server CPUs.
They observe unexpected performance breakdowns when the number of processes is prime.
The researchers create data traffic models to understand the root cause of these performance issues, connecting them to a new feature called SpecI2M in Intel CPUs.

Plain English Explanation

The researchers looked at the performance of a benchmark called CloverLeaf on newer Intel server CPUs. They noticed something strange - the performance would suddenly drop when they used a prime number of processing cores. To figure out why this was happening, the researchers created models to understand how data was moving around in the computer's memory during the benchmark. [This links to the internal link https://aimodels.fyi/papers/arxiv/more-scalable-sparse-dynamic-data-exchange]

They found that the performance drops were caused by a new feature in the Intel CPUs called SpecI2M. This feature is supposed to help the CPU work more efficiently, but it doesn't work properly when the number of processing cores is a prime number. The researchers were able to explain why this was the case and rule out other possible reasons for the performance issues.

Technical Explanation

The researchers analyzed the MPI-only version of the CloverLeaf code from the SPEChpc 2021 benchmark suite, running it on recent Intel Xeon Ice Lake and Sapphire Rapids server CPUs. [This links to the internal link https://aimodels.fyi/papers/arxiv/communication-avoiding-reducing-algorithm-symmetric-eigenproblem-very]

They observed that the performance of the code would break down in unexpected ways when the number of processes used was a prime number. To investigate this effect, the researchers created detailed data traffic models for the key computational kernels in the code, using both application measurements and microbenchmarks.

Through this analysis, the researchers were able to connect the performance breakdowns to a new feature in the Intel CPUs called SpecI2M, which is designed to improve memory efficiency. However, the researchers found that SpecI2M fails to work properly when the number of processes is prime, due to the emergence of short inner loops in the one-dimensional domain decomposition.

The researchers were able to analytically predict the memory data volume for the serial and full-node cases with an error of just a few percent. They also ruled out other potential causes of the prime number effect, such as breaking layer conditions, MPI communication overhead, and load imbalance.

Critical Analysis

The researchers provide a thorough and well-designed study of the performance issues in the CloverLeaf benchmark, using a combination of application measurements, microbenchmarks, and first-principles data traffic modeling. [This links to the internal link https://aimodels.fyi/papers/arxiv/practical-persistent-multi-word-compare-swap-algorithms]

However, the paper does not discuss any potential limitations or caveats of their approach. For example, it would be interesting to see how the performance issues scale with problem size or the number of nodes used. Additionally, the researchers do not explore whether the SpecI2M feature could be optimized or disabled to mitigate the prime number effect.

Further research could also investigate whether similar performance issues are observed in other HPC benchmarks or real-world applications, and whether the insights from this study can be generalized to other hardware architectures or programming models. [This links to the internal link https://aimodels.fyi/papers/arxiv/clover-regressive-lightweight-speculative-decoding-sequential-knowledge]

Conclusion

The researchers have provided valuable insights into the performance characteristics of the CloverLeaf benchmark on recent Intel server CPUs. By uncovering the root cause of the unexpected performance breakdowns when using a prime number of processes, the study highlights the importance of understanding the interplay between hardware features, software design, and problem decomposition in high-performance computing. These findings could have implications for optimizing the performance of a wide range of HPC applications and benchmarks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛠️

CloverLeaf on Intel Multi-Core CPUs: A Case Study in Write-Allocate Evasion

Jan Laukemann, Thomas Gruber, Georg Hager, Dossay Oryspayev, Gerhard Wellein

In this paper we analyze the MPI-only version of the CloverLeaf code from the SPEChpc 2021 benchmark suite on recent Intel Xeon Ice Lake and Sapphire Rapids server CPUs. We observe peculiar breakdowns in performance when the number of processes is prime. Investigating this effect, we create first-principles data traffic models for each of the stencil-like hotspot loops. With application measurements and microbenchmarks to study memory data traffic behavior, we can connect the breakdowns to SpecI2M, a new write-allocate evasion feature in current Intel CPUs. For serial and full-node cases we are able to predict the memory data volume analytically with an error of a few percent. We find that if the number of processes is prime, SpecI2M fails to work properly, which we can attribute to short inner loops emerging from the one-dimensional domain decomposition in this case. We can also rule out other possible causes of the prime number effect, such as breaking layer conditions, MPI communication overhead, and load imbalance.

5/20/2024

🤖

Microarchitectural comparison and in-core modeling of state-of-the-art CPUs: Grace, Sapphire Rapids, and Genoa

Jan Laukemann, Georg Hager, Gerhard Wellein

With Nvidia's release of the Grace Superchip, all three big semiconductor companies in HPC (AMD, Intel, Nvidia) are currently competing in the race for the best CPU. In this work we analyze the performance of these state-of-the-art CPUs and create an accurate in-core performance model for their microarchitectures Zen 4, Golden Cove, and Neoverse V2, extending the Open Source Architecture Code Analyzer (OSACA) tool and comparing it with LLVM-MCA. Starting from the peculiarities and up- and downsides of a single core, we extend our comparison by a variety of microbenchmarks and the capabilities of a full node. The write-allocate (WA) evasion feature, which can automatically reduce the memory traffic caused by write misses, receives special attention; we show that the Grace Superchip has a next-to-optimal implementation of WA evasion, and that the only way to avoid write allocates on Zen 4 is the explicit use of non-temporal stores.

9/14/2024

Benchmarking with Supernovae: A Performance Study of the FLASH Code

Joshua Martin, Catherine Feldman, Eva Siegmann, Tony Curtis, David Carlson, Firat Coskun, Daniel Wood, Raul Gonzalez, Robert J. Harrison, Alan C. Calder

Astrophysical simulations are computation, memory, and thus energy intensive, thereby requiring new hardware advances for progress. Stony Brook University recently expanded its computing cluster SeaWulf with an addition of 94 new nodes featuring Intel Sapphire Rapids Xeon Max series CPUs. We present a performance and power efficiency study of this hardware performed with FLASH: a multi-scale, multi-physics, adaptive mesh-based software instrument. We extend this study to compare performance to that of Stony Brook's Ookami testbed which features ARM-based A64FX-700 processors, and SeaWulf's AMD EPYC Milan and Intel Skylake nodes. Our application is a stellar explosion known as a thermonuclear (Type Ia) supernova and for this 3D problem, FLASH includes operators for hydrodynamics, gravity, and nuclear burning, in addition to routines for the material equation of state. We perform a strong-scaling study with a 220 GB problem size to explore both single- and multi-node performance. Our study explores the performance of different MPI mappings and the distribution of processors across nodes. From these tests, we determined the optimal configuration to balance runtime and energy consumption for our application.

8/30/2024

🏋️

Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels

Dane C. Lacey, Christie L. Alappat, Florian Lange, Georg Hager, Holger Fehske, Gerhard Wellein

Sparse matrix-vector products (SpMVs) are a bottleneck in many scientific codes. Due to the heavy strain on the main memory interface from loading the sparse matrix and the possibly irregular memory access pattern, SpMV typically exhibits low arithmetic intensity. Repeating these products multiple times with the same matrix is required in many algorithms. This so-called matrix power kernel (MPK) provides an opportunity for data reuse since the same matrix data is loaded from main memory multiple times, an opportunity that has only recently been exploited successfully with the Recursive Algebraic Coloring Engine (RACE). Using RACE, one considers a graph based formulation of the SpMV and employs s level-based implementation of SpMV for reuse of relevant matrix data. However, the underlying data dependencies have restricted the use of this concept to shared memory parallelization and thus to single compute nodes. Enabling cache blocking for distributed-memory parallelization of MPK is challenging due to the need for explicit communication and synchronization of data in neighboring levels. In this work, we propose and implement a flexible method that interleaves the cache-blocking capabilities of RACE with an MPI communication scheme that fulfills all data dependencies among processes. Compared to a traditional distributed memory parallel MPK, our new Distributed Level-Blocked MPK yields substantial speed-ups on modern Intel and AMD architectures across a wide range of sparse matrices from various scientific applications. Finally, we address a modern quantum physics problem to demonstrate the applicability of our method, achieving a speed-up of up to 4x on 832 cores of an Intel Sapphire Rapids cluster.

5/24/2024