Prof. Mireille Boutin

Real data is often a lot easier to cluster than data that has been generated synthetically. This phenomenon, first described in an ICIP paper and the thesis of my student Sangchun Han, occurs because real data tends to have a lot of structure. So much structure that a projection of the data onto a random line is likely to feature a clear binary clustering.

Thus, an easy way to cluster real data is to project it onto a random line, and to look for clusters in the resulting one-dimensional data. The clusterability of the projected data can be measured with a quantity W quantifying the relative scatters among the classes versus between the classes. Since the projection is random, W is a random variable. We measure the clusterability of the original dataset in high-dimension by looking at the probability function of W. The original dataset can be clustered using a hierarchy of random projections (RP1D code). If the dataset is very small, one random projection is used repeatedly (n-TARP code), resulting in several different clustering structures.

References:

•S. Han and M. Boutin, “The Hidden Structure of Image Datasets,” IEEE International Conference on Image Processing (ICIP), Quebec City, Canada, September 27–30, 2015.
•T. Yellamraju, M. Boutin, “Clusterability and Clustering of Images and Other “Real” High-Dimensional Data,” IEEE Transactions on Image Processing, Vol. 27, No. 4, April 2018, pp. 1927 - 1938.
•S. Han, “A Method for Clustering High-Dimensional Data Using 1D Random Projections,” Ph.D. thesis, Purdue University, 2014.

Code:

•RP1D clustering code

•n-TARP Clustering Code

Clusterability and Clustering of High-dimensional Data