H-Clust – Tag SNP Selection



H-clust is a simple clustering method that can be used to rapidly identify a set of tag SNP’s based upon genotype data. This method does not require haplotype estimation. H-clust consists of two stages. The first stage uses hierarchical clustering to determine the clusters. In the second stage, the tag SNP is chosen by finding the SNP most correlated with all the other SNPs in the cluster. Optionally, the quality of each SNP can be included in the analysis. In this case, both quality and correlation affect the determination of tag SNPs. The input for H-clust is a genotype matrix using 0,1,2 to denote the number of copies of a particular allele. It then computes the similarity matrix based on Pearson’s correlation between allele counts. The distance between two SNPs is one minus the squared correlation. By default, H-clust uses the “complete linkage” method. Hierarchical clustering can be represented as a dendrogram in which any two SNPs diverge at a height equal to their distance. The clusters are obtained by declaring SNPs to be in the same cluster when they converge before a certain cut-off value. In the H-clust program, this cutoff is 1- hcbound, where hcbound is determined by the user. (This is slightly different in the stepwise version, see below.) The second stage of H-clust finds a tag SNP to represent the cluster. This is done by scoring each SNP based on squared correlation and quality. If multiple SNPs are scored equally, then the one in the middle is chosen as the tag SNP.


The Devlin lab








Rinald, Bacanu, Devlin, Sonpar, Wasserman and Roeder.
Characterization of Multilocus Linkage Disequilibrium
Genet Epidemiol. 2005 Apr;28(3):193-206.

Exit mobile version