Earlier I posted a blog for "k-means + heatmap" used for clustering analysis. Recently to prepare for the "Bioinformatics Tools" meeting, I made a slide with more details on "clustering analysis". Here it is:
https://docs.google.com/presentation/d/1vMS3VS3zZmT5Mty1-4R0a-1E-lstjdiXiWOeMdQ_O6M/edit
My learning notes for R, Unix, Perl, statistics, tools/resources, biology etc. everything about Bioinformatics
Showing posts with label hierarchical clustering. Show all posts
Showing posts with label hierarchical clustering. Show all posts
Sunday, April 29, 2012
Monday, October 10, 2011
k-mean clustering + heatmap
If you want more info about clustering, I have another post about "Clustering analysis and its implementation in R". Here is the link:
http://onetipperday.blogspot.com/2012/04/clustering-analysis-2.html------------
Several R functions in this topic:
1. dist(X) -- calculate the distance of rows of data matrix X. The default distance method is euclidean. It can be maximal, manhattan, binary etc.
> a=matrix(sample(9),nrow=3) > a [,1] [,2] [,3] [1,] 5 2 9 [2,] 8 7 1 [3,] 6 4 3 > dist(a, diag=T, method='max') 1 2 3 1 0 2 8 0 3 6 3 0 > dist(a, diag=T, method='euc') 1 2 3 1 0.000000 2 9.899495 0.000000 3 6.403124 4.123106 0.000000
2. hclust(D) -- hierarchical clustering of a distance/dissimilarity matrix (e.g output of dist function): join two most similar objects (based on similarity method) each time until there is one single cluster.
hclust(D) can be displayed in a tree format, using plot(hclust(D)), or plclust(hclust(D))
3. heatmap(X, distfun = dist, hclustfun = hclust, ...) -- display matrix of X and cluster rows/columns by distance and clustering method.
One enhanced version is heatmap.2, which has more functions. For example, you can use
- key, symkey etc. for legend,
- "col=heat.colors(16)" or "col='greenred', breaks=16" to specify colors of image
- cellnote (text matrix with same dim), notecex, notecol for text in grid
- colsep/rowsep to define blocks of separation, e.g. colsep=c(1,3,6,8) will display a white separator at columns of 1, 3, 6, 8 etc.
Both have 'ColSideColors/RowSideColors', a color vector with length of cols/rows. Here is an example(http://chromium.liacs.nl/R_users/20060207/Renee_graphs_and_others.pdf).
Another enhanced version is pheatmap, which produced pretty heatmap with additional options:
Another enhanced version is pheatmap, which produced pretty heatmap with additional options:
- cellwidth/cellheight to set the size of cell
- treeheight_row/treeheight_col: height of tree
- annotation: a data.frame, each column is an annotation of columns of X. So, nrow(annotation)==ncol(X)
- legend/annotation_legend: whether to show legend
- filename: save to file
4. kmeans(X, centers=k) -- partition points (actually rows of X matrix) into k clusters . For example:
# a 2-dimensional example x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") (cl <- kmeans(x, 2)) plot(x, col = cl$cluster) points(cl$centers, col = 1:2, pch = 8, cex=2)
The number of cluster can be determined by plot of sum of squares, eg.
# Determine number of clusters
wss <- (nrow(x)-1)*sum(apply(x,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(x,centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
Using hclust and cutree can also set the number of clusters:
hc <- hclust(dist(x), "ward") plot(hc) # the plot can also help to decide the # of clusters memb <- cutree(hc, k = 2)
Note: kmean is using partition method to cluster, while hclust is to use hierarchical clustering method. Here is a series of nice lectures for this. A more detail for cluster can be found here: CRAN Task View: Cluster Analysis
Subscribe to:
Posts (Atom)