Data Clustering

Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Clustering is the classification of similar objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. Besides the term data clustering, there are a number of terms with similar meanings, including cluster analysis, automatic classification, numerical taxonomy, botryology and typological analysis. Data clustering algorithms can be hierarchical or partitional. With hierarchical algorithms, successive clusters are found using previously established clusters, whereas partitional algorithms determine all clusters in one go. Hierarchical algorithms can be agglomerative (bottom-up) or divisive (top-down). Agglomerative algorithms begin with each element as a separate cluster and merge them in successively larger clusters. The divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.

Hierarchical clustering

Introduction

Hierarchical clustering builds a hierarchy of clusters. The traditional representation of this hierarchy is a tree, with individual elements at one end and a single cluster with every element at the other. Agglomerative algorithms begin at the top of the tree, whereas divisive algorithms begin at the bottom. (In the figure, the arrows indicate an agglomerative clustering.) Traditional representation Cutting the tree at a given height will give a clustering at a selected precision. In the above example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, but with fewer clusters.

Agglomerative hierarchical clustering

This method builds the hierarchy from the individual elements by progressively merging clusters. Again, we have six elements {a} {b} {c} {d} {e} and {f}. the first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, therefore we must define a distance d(\mathrm{element}_1,\mathrm{element}_2) between elements. Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge them further. But to do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters. Usually the distance between two clusters \mathcal{A} and \mathcal{B} is one of the following:
  • the maximum distance between elements of each cluster (also called complete linkage clustering) \max_{x \in \mathcal{A},\, y \in \mathcal{B}} d(x,y)
  • the minimum distance between elements of each cluster (also called single linkage clustering) \min_{x \in \mathcal{A},\, y \in \mathcal{B}} d(x,y)
  • the mean distance between elements of each cluster (also called average linkage clustering) {1 \over {\mathrm{card}(\mathcal{A})\mathrm{card}(\mathcal{B})}}\sum_{x \in \mathcal{A}}\sum_{ y \in \mathcal{B}} d(x,y)
  • the sum of all intra cluster variance
  • the increase in variance for the cluster being merged (Ward's criterion)
Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion).

Partitional clustering

k-means and derivatives

k-means clustering

The k-means algorithm assigns each point to the cluster whose center (or centroid) is nearest. The centroid is the point generated by computing the arithmetic mean for each dimension separately for all the points in the cluster.
Example: The data set has three dimensions and the cluster has two points: X = (x1, y1, z1) and Y = (x2, y2, z2). Then the centroid Z becomes Z = (x3, y3, z3), where x3 = (x1 + x2)/2 and y3 = (y1 + y2)/2 and z3 = (z1 + z2)/2
This is the basic structure of the algorithm (J. MacQueen, 1967):
  • Randomly generate k clusters and determine the cluster centers or directly generate k seed points as cluster centers
  • Assign each point to the nearest cluster center.
  • Recompute the new cluster centers.
  • Repeat until some convergence criterion is met (usually that the assignment hasn't changed).
The main advantages of this algorithm are its simplicity and speed, which allows it to run on large datasets. Yet it does not systematically yield the same result with each run of the algorithm. Rather, the resulting clusters depend on the initial assignments. The k-means algorithm maximizes inter-cluster (or minimizes intra-cluster) variance, but does not ensure that the solution given is not a local minimum of variance.

Fuzzy c-means clustering

One of the problems of the k-means algorithm is that it gives a hard partitioning of the data, that is to say that each point is attributed to one and only one cluster. But points on the edge of the cluster, or near another cluster may not be as much in the cluster as point in the center of cluster. Therefore, in fuzzy clustering, each point does not pertain to a given cluster, but has a degree of belonging to a certain cluster, as in fuzzy logic. For each point x we have a coefficient giving the degree of being in the kth cluster u_k(x). Usually, the sum of those coefficients has to be one, so that u_k(x) denotes a probability of belonging to a certain cluster:
\forall x \sum_{k=1}^{\mathrm{num.}\ \mathrm{clusters}} u_k(x) \ =1.
With fuzzy c-means, the centroid of a cluster is computed as being the mean of all points, weighted by their degree of belonging to the cluster, that is
   
\mathrm{center}_k = .
The degree of being in a certain cluster is the inverse of the distance to the cluster
u_k(x) = {1 \over d(\mathrm{center}_k,x)},
then the coefficients are normalized and fuzzyfied with a real parameter m>1 so that their sum is 1. So u_k(x) = \frac{1}{\sum_j \left(\frac{d(\mathrm{center}_k,x)}{d(\mathrm{center}_j,x)}\right)^{\frac{1}{m-1}}} The fuzzy c-means algorithm is greatly similar to the k-means algorithm; when m\leftarrow 1, algorithms are equivalent:
  • Choose a number of clusters
  • Assign randomly to each point coefficients for being in the clusters
  • Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than \epsilon, the given sensitivity threshold) :
    • Compute the centroid for each cluster, using the formula above
    • For each point, compute its coefficients of being in the clusters, using the formula above
The fuzzy c-means algorithm minimizes intra-cluster variance as well, but has the same problems as k-means, the minimum is local minimum, and the results depend on the initial choice of weights.

Applications

In biology has two main applications in the fields of computational biology and bioinformatics.

References

  • Jain, Murty and Flynn: Data Clustering: A Review, ACM Comp. Surv, 1999. Available here
  • for another presentation of hierarchical, k-means and fuzzy c-means see this introduction to clustering. Also has an explanation on mixture of gaussians.

See also

k-means, ANN, Self-organizing map, other clustering and mixture model links.

 

<< PreviousWord BrowserNext >>
ship of fools (film)
ayttm
andrew kirkpatrick
modular curve
hackney marshes
jocelyn
jacksboro
places inhabited by rusyns
jacksonport
diagonal lashing
manuel jos de arriaga
kneiphof
victoria ribeiro
tripod lashing
shoreditch station
jamesport
luis guzmn
rusyns
arawhata river
dubai duty free classic (snooker)
michael hardie boys
ladek zdrj
feudal law
lewis nash
hatfield swamp
clustering (demographics)
rudolph schoenheimer
dominum directum
an evening with kevin smith
british film institute
pather panchali
nulle terre sans seigneur
disraeli gears
capillary wave
david rittenberg
patronus charm
brian moore
transatlantic flight
uss macon
sweet baby james
rockaway, new york
curves in differential geometry
usa 3000 airlines
dominum utile