Example File: CLUSTERS.XLS
Solution Type: Continuous Chromosomes
The workbook CLUSTERS.XLS contains two examples in different worksheets that illustrate how GeneHunter can be utilized for clustering data. The first example in the sheet 1 CLUSTER shows how to find the cluster center of a set of data. The second example, 3 CLUSTERS. is an example of clustering data into separate clusters. This workbook is located in C:\GENEHUNTER\EXAMPLES\EXCEL if you chose the default directory during installation. (If you install the entire AI Trilogy set of programs, the GeneHunter folder is a subfolder of the C:\AI Trilogy folder.)
In 1 CLUSTER we have formulated the problem of finding the cluster center of data as an optimization problem suitable for GeneHunter solution. The data we are going to cluster is two dimensional, that is there are two features (components) of each point (vector) in the data. We have chosen two dimensional data because it can be easily plotted on a scatter chart. Otherwise, there is no reason why larger dimensions cannot be clustered.
Figure 5.17 The data and cluster center for the 1Cluster example.
The two dimensional data to be clustered is in the first two columns of the worksheet. The two components are called feature 1 and feature 2. The idea is to find a new point in this same two dimensional space whose distances from all the points already in the data is a minimum. In other words, we want to minimize the sum of the distances by moving this new point around. This new point will be the cluster center.
There will be two chromosomes corresponding to the x and y (feature 1 and feature 2) coordinates of the cluster center. You will find them in cells I9:J9.
In order to form a fitness function we will need to know the distance from every point in the space to the (proposed) cluster center. These distances are calculated in column D. The sum of these distances is in cell G9, and this is the fitness function which we want to minimize. That's all there is to it! You can watch GeneHunter move the cluster center around in the scatter plot, but watch closely because it doesn't take long.
This example is a little more complicated because we are clustering the data into three separate clusters, and finding the cluster centers of these clusters at the same time. This is an example of self classification of data, because the clusters are really classes of the data. New data might be classified by determining to which cluster center the new data is closest. This might be a very useful classification algorithm.
Figure 5.18 The data and cluster centers for the 3 Clusters example.
We have again used two dimensional data, so we can do scatter plots. This time there are 6 chromosomes for the cluster centers, because there are three cluster centers we want to find. These chromosomes are in cells N4:O6.
We can't just sum the distances like we did in 1 CLUSTER because we have three cluster centers, and we want the distances from the closest cluster center. So first we have to find out which cluster center is closest to the point in question. That is the purpose of columns D, E, and F. Then in column H we can calculate which distance is the minimum, showing us the closest cluster center to the point. The sum of these minimum distances is located in cell O10, and this is the fitness function.
That might be the end if we had not decided to add a twist to this problem. We would like to make clusters (classes) that have an equal number of points in each class, or as close as possible. This is not always an appropriate thing to do, especially if you are building a classifier as previously discussed. However, depending upon the problem it may be desired, and in our case it gives us a chance to illustrate the use of dual fitness functions.
We decided to build a second fitness function to force the equalization of the three classes. There are several ways this can be done, but we have chosen to minimize the standard deviation of the numbers in each class. If the standard deviation becomes zero, all classes will have an equal number. If it is high, then there will be large inequality.
First we have to compute the number in each class, and these numbers are in cells R4:R6. The standard deviation is computed in R8 using the Excel standard function stdev. Then all we had to do was build a second fitness function where we specified that R8 was to be a minimum.