Spectral Regularized Kernel Two-Sample Tests

Read original: arXiv:2212.09201 - Published 5/3/2024 by Omar Hagrass, Bharath K. Sriperumbudur, Bing Li

➖

Overview

The paper focuses on understanding the optimality of two-sample tests based on the reproducing kernel Hilbert space (RKHS) embedding of probability distributions.
The authors show that the popular Maximum Mean Discrepancy (MMD) two-sample test is not optimal in terms of the separation boundary measured in Hellinger distance.
They propose a modified MMD test that incorporates covariance information and prove it to be minimax optimal with a smaller separation boundary.
An adaptive version of the proposed test is also introduced, which is almost minimax optimal up to a logarithmic factor.
Numerical experiments on synthetic and real data demonstrate the superior performance of the proposed tests compared to the MMD test and other popular tests.

Plain English Explanation

The paper explores a popular approach for comparing two sets of data, where the data points may not be in a standard Euclidean space. This approach involves embedding probability distributions in a reproducing kernel Hilbert space (RKHS), which is a mathematical space that can handle more complex data structures.

The authors first show that a commonly used two-sample test, called the Maximum Mean Discrepancy (MMD) test, is not optimal in terms of how well it can distinguish between the two data sets. To address this, they propose a modified version of the MMD test that takes into account the covariance information of the data, which the original MMD test ignores. They prove that this modified test is the best possible test in a specific mathematical sense.

The authors then develop an adaptive version of their proposed test, which automatically chooses the appropriate amount of regularization based on the data. This adaptive test is shown to be almost as good as the best possible test, with only a small loss in performance.

Through numerical experiments on both synthetic and real-world data, the authors demonstrate that their proposed tests outperform the original MMD test and other popular approaches in the literature. This suggests that their new tests could be valuable tools for comparing and analyzing complex data sets in a wide range of applications.

Technical Explanation

The paper focuses on the problem of nonparametric two-sample testing, where the goal is to determine whether two samples of data come from the same underlying probability distribution or not. Over the past decade, an approach based on reproducing kernel Hilbert space (RKHS) embedding of probability distributions has gained significant popularity for tackling this problem on general (non-Euclidean) domains.

The authors first show that the popular MMD two-sample test is not optimal in terms of the separation boundary measured in Hellinger distance, which is a widely used metric for comparing probability distributions. To address this, they propose a modified MMD test that incorporates covariance information, which is not captured by the original MMD test. They prove that this proposed test is minimax optimal, meaning it achieves the smallest possible separation boundary among all valid tests.

The authors then introduce an adaptive version of the proposed test, which involves a data-driven strategy to choose the regularization parameter. They show that this adaptive test is almost minimax optimal, up to a logarithmic factor. Importantly, the authors also consider the permutation variant of the tests, where the test threshold is chosen elegantly through the permutation of the samples.

Through numerical experiments on both synthetic and real-world data, the authors demonstrate the superior performance of their proposed tests compared to the MMD test and other popular tests in the literature. This suggests that their new tests could be valuable tools for comparing and analyzing complex data sets in a wide range of applications.

Critical Analysis

The paper presents a thorough theoretical analysis of the optimality of two-sample tests based on RKHS embedding of probability distributions. The authors' proof of the minimax optimality of their proposed test is a strong mathematical result, which is supported by the numerical experiments.

However, the paper does not discuss the computational complexity of the proposed tests, which could be an important practical consideration. Additionally, the authors only consider the case of two-sample testing, and it would be interesting to see how their approach could be extended to handle multiple samples or other related problems.

Another potential limitation is the reliance on the Hellinger distance as the metric for comparing probability distributions. While Hellinger distance is a widely used and well-studied metric, there may be other relevant distance measures that could be explored in this context.

Despite these minor caveats, the paper makes a significant contribution to the literature on nonparametric testing in general (non-Euclidean) domains. The authors' insights and the proposed tests could have important implications for a wide range of applications where comparing and analyzing complex data sets is a crucial task.

Conclusion

This paper presents an important advance in the field of nonparametric two-sample testing on general (non-Euclidean) domains. The authors show that the popular MMD two-sample test is not optimal and propose a modified version that is proven to be minimax optimal. They also introduce an adaptive variant of their test that is nearly as good as the best possible test.

The numerical experiments demonstrate the superior performance of the proposed tests, suggesting that they could be valuable tools for comparing and analyzing complex data sets in a wide range of applications. This work contributes to a better understanding of the optimal testing procedures in the RKHS embedding framework and could inspire further research in this direction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

➖

Spectral Regularized Kernel Two-Sample Tests

Omar Hagrass, Bharath K. Sriperumbudur, Bing Li

Over the last decade, an approach that has gained a lot of popularity to tackle nonparametric testing problems on general (i.e., non-Euclidean) domains is based on the notion of reproducing kernel Hilbert space (RKHS) embedding of probability distributions. The main goal of our work is to understand the optimality of two-sample tests constructed based on this approach. First, we show the popular MMD (maximum mean discrepancy) two-sample test to be not optimal in terms of the separation boundary measured in Hellinger distance. Second, we propose a modification to the MMD test based on spectral regularization by taking into account the covariance information (which is not captured by the MMD test) and prove the proposed test to be minimax optimal with a smaller separation boundary than that achieved by the MMD test. Third, we propose an adaptive version of the above test which involves a data-driven strategy to choose the regularization parameter and show the adaptive test to be almost minimax optimal up to a logarithmic factor. Moreover, our results hold for the permutation variant of the test where the test threshold is chosen elegantly through the permutation of the samples. Through numerical experiments on synthetic and real data, we demonstrate the superior performance of the proposed test in comparison to the MMD test and other popular tests in the literature.

5/3/2024

🧪

Computational-Statistical Trade-off in Kernel Two-Sample Testing with Random Fourier Features

Ikjun Choi, Ilmun Kim

Recent years have seen a surge in methods for two-sample testing, among which the Maximum Mean Discrepancy (MMD) test has emerged as an effective tool for handling complex and high-dimensional data. Despite its success and widespread adoption, the primary limitation of the MMD test has been its quadratic-time complexity, which poses challenges for large-scale analysis. While various approaches have been proposed to expedite the procedure, it has been unclear whether it is possible to attain the same power guarantee as the MMD test at sub-quadratic time cost. To fill this gap, we revisit the approximated MMD test using random Fourier features, and investigate its computational-statistical trade-off. We start by revealing that the approximated MMD test is pointwise consistent in power only when the number of random features approaches infinity. We then consider the uniform power of the test and study the time-power trade-off under the minimax testing framework. Our result shows that, by carefully choosing the number of random features, it is possible to attain the same minimax separation rates as the MMD test within sub-quadratic time. We demonstrate this point under different distributional assumptions such as densities in a Sobolev ball. Our theoretical findings are corroborated by simulation studies.

7/15/2024

Robust Kernel Hypothesis Testing under Data Corruption

Antonin Schrab, Ilmun Kim

We propose two general methods for constructing robust permutation tests under data corruption. The proposed tests effectively control the non-asymptotic type I error under data corruption, and we prove their consistency in power under minimal conditions. This contributes to the practical deployment of hypothesis tests for real-world applications with potential adversarial attacks. One of our methods inherently ensures differential privacy, further broadening its applicability to private data analysis. For the two-sample and independence settings, we show that our kernel robust tests are minimax optimal, in the sense that they are guaranteed to be non-asymptotically powerful against alternatives uniformly separated from the null in the kernel MMD and HSIC metrics at some optimal rate (tight with matching lower bound). Finally, we provide publicly available implementations and empirically illustrate the practicality of our proposed tests.

5/31/2024

Convergence Conditions of Online Regularized Statistical Learning in Reproducing Kernel Hilbert Space With Non-Stationary Data

Xiwei Zhang, Tao Li

We study the convergence of recursive regularized learning algorithms in the reproducing kernel Hilbert space (RKHS) with dependent and non-stationary online data streams. Firstly, we study the mean square asymptotic stability of a class of random difference equations in RKHS, whose non-homogeneous terms are martingale difference sequences dependent on the homogeneous ones. Secondly, we introduce the concept of random Tikhonov regularization path, and show that if the regularization path is slowly time-varying in some sense, then the output of the algorithm is consistent with the regularization path in mean square. Furthermore, if the data streams also satisfy the RKHS persistence of excitation condition, i.e. there exists a fixed length of time period, such that the conditional expectation of the operators induced by the input data accumulated over every time period has a uniformly strictly positive compact lower bound in the sense of the operator order with respect to time, then the output of the algorithm is consistent with the unknown function in mean square. Finally, for the case with independent and non-identically distributed data streams, the algorithm achieves the mean square consistency provided the marginal probability measures induced by the input data are slowly time-varying and the average measure over each fixed-length time period has a uniformly strictly positive lower bound.

6/11/2024