H-clust is a simple clustering method that can be used to rapidly identify a set of tag SNP’s based upon genotype data. This method does not require haplotype estimation. H-clust consists of two stages. The first stage uses hierarchical clustering to determine the clusters. In the second stage, the tag SNP is chosen by finding the SNP most correlated with all the other SNPs in the cluster. Optionally, the quality of each SNP can be included in the analysis. In this case, both quality and correlation affect the determination of tag SNPs. The input for H-clust is a genotype matrix using 0,1,2 to denote the number of copies of a particular allele. It then computes the similarity matrix based on Pearson’s correlation between allele counts. The distance between two SNPs is one minus the squared correlation. By default, H-clust uses the “complete linkage” method. Hierarchical clustering can be represented as a dendrogram in which any two SNPs diverge at a height equal to their distance. The clusters are obtained by declaring SNPs to be in the same cluster when they converge before a certain cut-off value. In the H-clust program, this cutoff is 1- hcbound, where hcbound is determined by the user. (This is slightly different in the stepwise version, see below.) The second stage of H-clust finds a tag SNP to represent the cluster. This is done by scoring each SNP based on squared correlation and quality. If multiple SNPs are scored equally, then the one in the middle is chosen as the tag SNP.
SNPPicker is a post-processor to optimize the selection of tag SNPs from common bin-tagging programs. SNPPicker uses a multi-step search strategy in combination with a statistical model to produce optimal genotyping panels. SNPPicker’s algorithm is also designed to optimize tag SNP selection for multi-population panels. It accounts for several assay-specific constraints such as predicted assay failure of SNPs and avoidance of SNPs that are too close. The latter issue affects one third of all SNPs in the 1000 genomes project pilot 1 data.SNPPicker automates the design of tag SNP genotyping panels by maximizing the likelihood of successfully genotyping the selected SNPs while minimizing the number of tag SNPs to assay. Geno-typing success is a function of two properties: the genotyping probability of a bin (or cluster of bins) statistically derived from the individual genotyping probability of each SNP; and (for some platforms) the proximity distance between SNPs. The genotyping probabilities currently used by SNPPicker are derived a from pro-spective analysis of the performance of genotyping assay and the probability model can be updated or changed for other platforms. SNP proximity is a strictly enforced constraint