Skip Navigation
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.


The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Your Environment. Your Health.

Publication Detail

Title: Sparse principal component analysis for identifying ancestry-informative markers in genome-wide association studies.

Authors: Lee, Seokho; Epstein, Michael P; Duncan, Richard; Lin, Xihong

Published In Genet Epidemiol, (2012 May)

Abstract: Genome-wide association studies (GWAS) routinely apply principal component analysis (PCA) to infer population structure within a sample to correct for confounding due to ancestry. GWAS implementation of PCA uses tens of thousands of single-nucleotide polymorphisms (SNPs) to infer structure, despite the fact that only a small fraction of such SNPs provides useful information on ancestry. The identification of this reduced set of ancestry-informative markers (AIMs) from a GWAS has practical value; for example, researchers can genotype the AIM set to correct for potential confounding due to ancestry in follow-up studies that utilize custom SNP or sequencing technology. We propose a novel technique to identify AIMs from genome-wide SNP data using sparse PCA. The procedure uses penalized regression methods to identify those SNPs in a genome-wide panel that significantly contribute to the principal components while encouraging SNPs that provide negligible loadings to vanish from the analysis. We found that sparse PCA leads to negligible loss of ancestry information compared to traditional PCA analysis of genome-wide SNP data. We further demonstrate the value of sparse PCA for AIM selection using real data from the International HapMap Project and a genomewide study of inflammatory bowel disease. We have implemented our approach in open-source R software for public use.

PubMed ID: 22508067 Exiting the NIEHS site

MeSH Terms: Algorithms; Chromosome Mapping/methods; Computational Biology/methods; Genetic Variation; Genetics, Population; Genome-Wide Association Study*; HapMap Project; Humans; Inflammatory Bowel Diseases/genetics; Models, Statistical; Polymorphism, Single Nucleotide; Population Groups/genetics*; Principal Component Analysis*; Regression Analysis; Software

to Top