Skip Navigation

Publication Detail

Title: H-CLAP: hierarchical clustering within a linear array with an application in genetics.

Authors: Ghosh, Samiran; Townsend, Jeffrey P

Published In Stat Appl Genet Mol Biol, (2015 Apr)

Abstract: In most cases where clustering of data is desirable, the underlying data distribution to be clustered is unconstrained. However clustering of site types in a discretely structured linear array, as is often desired in studies of linear sequences such as DNA, RNA or proteins, represents a problem where data points are not necessarily exchangeable and are directionally constrained within the array. Each position in the linear array is fixed, and could be either "marked" (i.e., of interest such as polymorphic or substitute sites) or "non-marked." Here we describe a method for clustering of those marked sites. Since the cluster-generating process is constrained by discrete locality inside such an array, traditional clustering methods need adjustment to be appropriate. We develop a hierarchical Bayesian approach. We adopt a Markov clustering algorithm, revealing any natural partitioning in the pattern of marked sites. The resulting recursive partitioning and clustering algorithm is named hierarchical clustering in a linear array (H-CLAP). It employs domain-specific directional constraints directly in the likelihood construction. Our method, being fully Bayesian, is more flexible in cluster discovery compared to a standard agglomerative hierarchical clustering algorithm. It not only provides hierarchical clustering, but also cluster boundaries, which may have their own biological significance. We have tested the efficacy of our method on data sets, including two biological and several simulated ones.

PubMed ID: 25803088 Exiting the NIEHS site

MeSH Terms: Algorithms; Bayes Theorem; Cluster Analysis*; Computational Biology/methods; Gene Expression Profiling/methods; Genetics; Oligonucleotide Array Sequence Analysis/methods*

Back
to Top