TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

Read original: arXiv:2308.03578 - Published 6/12/2024 by Laxman Dhulipala, Jason Lee, Jakub {L}k{a}cki, Vahab Mirrokni

🔗

Overview

Introduces TeraHAC, a $(1+\epsilon)$-approximate hierarchical agglomerative clustering (HAC) algorithm
Scales to trillion-edge graphs by combining the nearest-neighbor chain algorithm and $(1+\epsilon)$-approximate HAC
Allows partitioning the graph across multiple machines to compute the clustering within each partition before communicating with other partitions

Plain English Explanation

TeraHAC is a new algorithm for grouping related items in very large datasets, called hierarchical agglomerative clustering (HAC). Traditional HAC methods struggle to handle datasets with trillions of connections, but TeraHAC overcomes this by using a clever combination of two techniques: the nearest-neighbor chain algorithm and an approximate version of HAC. This allows TeraHAC to divide the dataset across multiple computers and make significant progress on each piece before needing to share information between them. The result is a system that can cluster datasets orders of magnitude larger than what was previously possible, while still maintaining high-quality groupings.

Technical Explanation

TeraHAC is a $(1+\epsilon)$-approximate hierarchical agglomerative clustering (HAC) algorithm designed to scale to trillion-edge graphs. It achieves this scalability through a novel combination of the nearest-neighbor chain algorithm and the concept of $(1+\epsilon)$-approximate HAC.

The key insight is that by partitioning the graph across multiple machines and making significant progress in computing the clustering within each partition before any communication with other partitions is needed, TeraHAC can vastly reduce the number of communication rounds required compared to previous distributed HAC approaches. This allows it to be up to 8.3x faster than the previous state-of-the-art SCC algorithm, while achieving 1.16x higher quality clustering.

In essence, TeraHAC retains the quality of the celebrated HAC algorithm while dramatically improving the running time, enabling it to scale to datasets orders of magnitude larger than what was previously possible, as demonstrated by the experiments on graphs with up to 8 trillion edges.

Critical Analysis

The paper provides a thorough experimental evaluation of TeraHAC, demonstrating its significant performance advantages over prior methods. However, it does not extensively discuss potential limitations or areas for further research.

One aspect that could be explored further is the sensitivity of TeraHAC's performance to the choice of the $\epsilon$ parameter, which controls the trade-off between clustering quality and computational efficiency. The paper shows results for a single value of $\epsilon$, but understanding how the algorithm behaves across a wider range of $\epsilon$ values could provide additional insights.

Additionally, while the paper highlights TeraHAC's ability to scale to trillion-edge graphs, it would be valuable to understand how the algorithm's performance and resource requirements scale as the graph size increases further, potentially into the exascale regime and beyond.

Conclusion

The TeraHAC algorithm represents a significant advancement in hierarchical agglomerative clustering, enabling the analysis of datasets with trillions of connections. By combining the nearest-neighbor chain algorithm and $(1+\epsilon)$-approximate HAC, TeraHAC can partition the graph across multiple machines and make substantial progress on each partition before needing to communicate. This results in a system that is orders of magnitude faster than previous approaches, while still maintaining high-quality clustering.

The ability to scale HAC to such massive datasets opens up new possibilities for understanding complex, large-scale phenomena in fields like social network analysis, drug discovery, and transport optimization. As the volume and connectivity of data continues to grow, innovations like TeraHAC will be increasingly crucial for unlocking the insights hidden within.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔗

TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

Laxman Dhulipala, Jason Lee, Jakub {L}k{a}cki, Vahab Mirrokni

We introduce TeraHAC, a $(1+epsilon)$-approximate hierarchical agglomerative clustering (HAC) algorithm which scales to trillion-edge graphs. Our algorithm is based on a new approach to computing $(1+epsilon)$-approximate HAC, which is a novel combination of the nearest-neighbor chain algorithm and the notion of $(1+epsilon)$-approximate HAC. Our approach allows us to partition the graph among multiple machines and make significant progress in computing the clustering within each partition before any communication with other partitions is needed. We evaluate TeraHAC on a number of real-world and synthetic graphs of up to 8 trillion edges. We show that TeraHAC requires over 100x fewer rounds compared to previously known approaches for computing HAC. It is up to 8.3x faster than SCC, the state-of-the-art distributed algorithm for hierarchical clustering, while achieving 1.16x higher quality. In fact, TeraHAC essentially retains the quality of the celebrated HAC algorithm while significantly improving the running time.

6/12/2024

👁️

It's Hard to HAC with Average Linkage!

MohammadHossein Bateni, Laxman Dhulipala, Kishen N Gowda, D Ellis Hershkowitz, Rajesh Jayaram, Jakub {L}k{a}cki

Average linkage Hierarchical Agglomerative Clustering (HAC) is an extensively studied and applied method for hierarchical clustering. Recent applications to massive datasets have driven significant interest in near-linear-time and efficient parallel algorithms for average linkage HAC. We provide hardness results that rule out such algorithms. On the sequential side, we establish a runtime lower bound of $n^{3/2-epsilon}$ on $n$ node graphs for sequential combinatorial algorithms under standard fine-grained complexity assumptions. This essentially matches the best-known running time for average linkage HAC. On the parallel side, we prove that average linkage HAC likely cannot be parallelized even on simple graphs by showing that it is CC-hard on trees of diameter $4$. On the possibility side, we demonstrate that average linkage HAC can be efficiently parallelized (i.e., it is in NC) on paths and can be solved in near-linear time when the height of the output cluster hierarchy is small.

4/24/2024

🔍

A Near-Linear Time Approximation Algorithm for Beyond-Worst-Case Graph Clustering

Vincent Cohen-Addad, Tommaso d'Orsi, Aida Mousavifar

We consider the semi-random graph model of [Makarychev, Makarychev and Vijayaraghavan, STOC'12], where, given a random bipartite graph with $alpha$ edges and an unknown bipartition $(A, B)$ of the vertex set, an adversary can add arbitrary edges inside each community and remove arbitrary edges from the cut $(A, B)$ (i.e. all adversarial changes are textit{monotone} with respect to the bipartition). For this model, a polynomial time algorithm is known to approximate the Balanced Cut problem up to value $O(alpha)$ [MMV'12] as long as the cut $(A, B)$ has size $Omega(alpha)$. However, it consists of slow subroutines requiring optimal solutions for logarithmically many semidefinite programs. We study the fine-grained complexity of the problem and present the first near-linear time algorithm that achieves similar performances to that of [MMV'12]. Our algorithm runs in time $O(|V(G)|^{1+o(1)} + |E(G)|^{1+o(1)})$ and finds a balanced cut of value $O(alpha)$. Our approach appears easily extendible to related problem, such as Sparsest Cut, and also yields an near-linear time $O(1)$-approximation to Dagupta's objective function for hierarchical clustering [Dasgupta, STOC'16] for the semi-random hierarchical stochastic block model inputs of [Cohen-Addad, Kanade, Mallmann-Trenn, Mathieu, JACM'19].

6/10/2024

🌐

A Versatile Framework for Attributed Network Clustering via K-Nearest Neighbor Augmentation

Yiran Li, Gongyao Guo, Jieming Shi, Renchi Yang, Shiqi Shen, Qing Li, Jun Luo

Attributed networks containing entity-specific information in node attributes are ubiquitous in modeling social networks, e-commerce, bioinformatics, etc. Their inherent network topology ranges from simple graphs to hypergraphs with high-order interactions and multiplex graphs with separate layers. An important graph mining task is node clustering, aiming to partition the nodes of an attributed network into k disjoint clusters such that intra-cluster nodes are closely connected and share similar attributes, while inter-cluster nodes are far apart and dissimilar. It is highly challenging to capture multi-hop connections via nodes or attributes for effective clustering on multiple types of attributed networks. In this paper, we first present AHCKA as an efficient approach to attributed hypergraph clustering (AHC). AHCKA includes a carefully-crafted K-nearest neighbor augmentation strategy for the optimized exploitation of attribute information on hypergraphs, a joint hypergraph random walk model to devise an effective AHC objective, and an efficient solver with speedup techniques for the objective optimization. The proposed techniques are extensible to various types of attributed networks, and thus, we develop ANCKA as a versatile attributed network clustering framework, capable of attributed graph clustering (AGC), attributed multiplex graph clustering (AMGC), and AHC. Moreover, we devise ANCKA with algorithmic designs tailored for GPU acceleration to boost efficiency. We have conducted extensive experiments to compare our methods with 19 competitors on 8 attributed hypergraphs, 16 competitors on 6 attributed graphs, and 16 competitors on 3 attributed multiplex graphs, all demonstrating the superb clustering quality and efficiency of our methods.

8/13/2024