k均值算法在高维空间的灾难性失效及哈蒂根算法如何避免它 / The Catastrophic Failure of The k-Means Algorithm in High Dimensions, and How Hartigan's Algorithm Avoids It
1️⃣ 一句话总结
这篇论文证明,在高维、高噪声数据中,经典的Lloyd k均值算法会完全失效,几乎任何初始划分都会成为最终结果,而Hartigan的k均值算法则能避免这个问题,从而解释了为何高维数据下k均值效果不佳。
Lloyd's k-means algorithm is one of the most widely used clustering methods. We prove that in high-dimensional, high-noise settings, the algorithm exhibits catastrophic failure: with high probability, essentially every partition of the data is a fixed point. Consequently, Lloyd's algorithm simply returns its initial partition - even when the underlying clusters are trivially recoverable by other methods. In contrast, we prove that Hartigan's k-means algorithm does not exhibit this pathology. Our results show the stark difference between these algorithms and offer a theoretical explanation for the empirical difficulties often observed with k-means in high dimensions.
k均值算法在高维空间的灾难性失效及哈蒂根算法如何避免它 / The Catastrophic Failure of The k-Means Algorithm in High Dimensions, and How Hartigan's Algorithm Avoids It
这篇论文证明,在高维、高噪声数据中,经典的Lloyd k均值算法会完全失效,几乎任何初始划分都会成为最终结果,而Hartigan的k均值算法则能避免这个问题,从而解释了为何高维数据下k均值效果不佳。
源自 arXiv: 2602.09936