Cluster - MATLAB Version

Here you will find my MATLAB re-implementation of Prof. Bouman's Cluster program, which is originally implemented in C. Given a set of multidimensional training vectors, the program models the data as a Gaussian mixture distribution, estimates the order of the mixture by the minimum description length (MDL) criterion, and estimates the parameters of the Gaussian mixture by the expectation-maximization (EM) algorithm. For the theory and full description of the algorithm, please refer to Prof. Bouman's in-depth documentation of the C-version Cluster program.

Installation requires unzipping the distribution file to a directory and have this directory added to the MATLAB search path. The main program is GaussianMixture.m and usage can be accessed via the online help, i.e. typing help GaussianMixture on the MATLAB prompt. Other utilities include GMClassLikelihood.m that calculates the log likelihood of data given a particular Gaussian mixture, and SplitClasses.m that parallels the same utility of the C-version.

Also included with the distribution are demonstrations under the directories example1, example2 and example3. They correspond to the same demonstrations in the C-version. Each directory contains the script rundemo.m to execute the demonstration. For details of these demonstration please refer to the C-version documentation and the inline comments included in rundemo.m.

I put some efforts to optimize the performance of the program (mainly vectorizing most of the operations). Tests on my Pentium 4 1.8GHz, 512 MB RAM machine show that for up to two hundred thousands (0.2 million) of 2-dimensional training data, the program finishs reasonably fast. But beyond that, the C-version out-performs significantly. So this version does not mean to be used for serious applications or very large scale experiments, but will be very convenient for testing new ideas easily in the MATLAB environment.

The implementation follows very closely to the C-counterpart, excepts:

  • It does not support the diag option of the C-version;
  • It accepts the training data set of a single class at a time.

I also performed several tests to verify the result of the MATLAB version, and the result of one of these tests can be downloaded below.

  • - The MATLAB version of the Cluster program. Installation requires unzipping the distribution file to a directory, and have the directory added to the MATLAB search path.
  • verify.pdf - The result of the verification test.
  • data - Test data used in the verification test.