Superfund Research Program
Quantitative Biology: Biostatistics, Bioinformatics, and Computation
Project Leader: Mark J. van der Laan
Co-Investigator: Alan E. Hubbard
Grant Number: P42ES004705
Funding Period: 2006-2017
Project-Specific Links
Final Progress Reports
Year: 2016 2010
This Core's leaders are developing innovative methods in a variety of contexts to improve the use, analysis, and interpretation of data. They have made important contributions to the development of methods essential to the use of the great volumes of data produced by new methods in genomics and other “omics” areas. Newer methods are contributing to important new results in Superfund Research at Berkeley.
Overall goals
The overall goal of this Core is to develop better statistical methods suitable for the kinds of data emerging from the more sophisticated “omics” research methods now integral to the Superfund Research Program. These methods are already producing new insights not achievable with older approaches.
Important advances
- Developing more powerful statistical methods that are adapted to the data set under review.
- Devising means to integrate higher-order knowledge with data sets resulting from high-throughput methods.
Accomplishments for the last year
The Core's collaborative work this year concentrated on bioinformatic methods that could incorporate higher-order information into the analysis of the data that is produced from high-throughput “omics” assays. This means developing methods to add consideration of information that they already have about biological pathways related to disease, exposure, biological processes, etc. into their analyses of high-throughput omic data and exposures.
This has been facilitated by the newest member of Core D, Reuben Thomas, a post-doctoral researcher from NIEHS. He has helped develop statistical methods for examining the correspondence of gene expression data versus exposure and existing hypothesized pathways related to relevant biological pathways.
Investigators continue to refine their “semiparametric” approach. This means methods that adapt to the data that is actually produced, rather than being based solely on a model of how they think the data should be. Investigators are using this to look for associations between sets of characteristics such as gene expression, proteomics, methylation, etc. and their independent association with exposures to environmental contaminants in the presence of other confounding variables.
Having reached a relatively satisfactory set of methodologies along with corresponding code for quick implementation, focus has shifted towards refining the methods to superimpose these results onto the accumulating knowledge base regarding biological pathways. Specifically, in the context of the investigator’s study of occupational benzene exposure and genomics, they used a method known as “structurally enhanced pathway enrichment analysis” (Thomas et al. 2009). This incorporates information about the genome and biological pathways. It uses manually drawn pathway maps representing current knowledge on the molecular interaction and reaction networks involved in cellular processes such as metabolism, and cell cycle.
This procedure revealed highly significant (p < 0.001) impacts of relatively high benzene occupational exposure to several pathways. (These are the transcriptome of genes related to the toll-like receptor signaling pathway, oxidative phosphorylation, B cell receptor signaling pathway, apoptosis, acute myeloid leukemia, and T cell receptor signaling.)
Perhaps even more important, the combination of statistical methodology for selecting differentially expressed genes and the use of these statistical tests for highlighting “significant” pathways found the same pathways.
The investigators also examined dose-specific pathways and found that some are uniquely impacted only among very highly exposed workers. (These include, for instance, expression among nucleosome assembly and the ABC transporter pathways.)
What investigators plan to do next
The bottom line for the computational core is now the investigators have in place an analysis stream that both finds individual culprits via rigorous statistical estimation and inference but also can find higher-level patterns via methodology designed for finding significantly affected biological pathways, including disease pathways.