Superfund Research Program
Optimizing Sampling and Statistical Analysis for Hazardous Waste Site Assessment
- Project Summary
This project is designed to provide statistical and analysis tools to improve the accuracy and reliability of site and exposure assessment for Superfund hazardous waste sites. Brent Coull and his research team approach builds on the basic spatial kriging model, along with optimal design considerations that maximize prediction accuracy while minimizing cost and accounting for practical considerations. The project will clarify under what circumstances a spatial model-based approach provides real benefits over methods that do not account for spatial correlation.
The researchers will develop spatial measurement error models for relating environmental concentrations to exposure as measured by biomarkers, thus, accounting for incomplete environmental sampling. They will also develop software tools that enable broad use of the methods by the EPA and site professionals.
Studies and Results
Seasonal modulation smoothing mixed models for times series forecasting: The research team developed methods for monthly time series collected from monitoring stations. In most of environmental monitoring data, it is important to decompose the series into trend and seasonality. The researchers consider smooth modulation models using penalized splines to simultaneously estimate both temporal trends and seasonal effects. One characteristic of data collected from monitoring stations is the existence of missing observations (due to equipment failure, maintenance, and the like). The researchers' approach allows for an automatic interpolation of missing values, but also the extrapolation or forecasting of future observations. A manuscript (Lee D-J and Durbán M, 2015) describing the methods has been submitted for publication.
Comparative study of low-rank geoadditive models under different sampling schemes: Recently, spatial low-rank regression models have been proposed for modeling spatial data. These models present a computationally efficient alternative to standard geostatistical models. However, under different sampling schemes, the use of low-rank approximations may lead to similar limitations as kriging models. Low-rank kriging models consist of a selection of a reduced number of knots and their location based on space filling algorithms, and the definition of a covariance matrix to account for the spatial correlation. The use of the knot selection procedure may not perform well if sampling is preferential. The researchers perform a simulation study to compare the performance of these methods. As an alternative the use of tensor product of B-splines models with equally spaced knots is shown to have good performance in different spatial designs when spatial prediction is the main goal.
Statistical analysis strategies that integrate multiple data types: In the last year the researchers have published a paper (Nikolov et al., 2014) describing computationally efficient strategies for integrating spatio-temporal data from multiple data sources. Briefly, this work considered a Bayesian hierarchical framework in which a joint model consists of a set of submodels, one for each data source, and an integrative model for the latent process that serves to relate the submodels to one another. When a submodel depends on the latent process nonlinearly, inference using standard MCMC techniques can be computationally prohibitive. The researchers used our model to address a temporal change of support problem whereby interest focuses on pooling daily and multiday black carbon readings in order to maximize the spatial coverage of the study region.
Correlated measurement error models for spatial health effects analyses: Coull and a former graduate student of the Program, Stacey Ackerman-Alexeeff, submitted a paper proposing a spatial SIMEX approach to adjusting for measurement error due to spatially misaligned exposure and biologic data (Alexeeff, Carroll and Coull, 2014). In a second paper, Alexeeff, et al. (2014) used high-resolution satellite data on ambient particle levels to assess the consequences of kriging and land use regression for PM2.5 predictions in epidemiologic analyses. This approach is a critical building block for the spatial prediction modeling for ambient metal concentrations the researchers propose to undertake in Mexico City as part of their competitive renewal of the Optimizing Sampling and Statistical Analysis for Hazardous Waste Site Assessment Project.
Multiple Superfund co-investigators (Coull, Christiani, Wright, Mazumdar, Claus Henn, Valeri) introduced Bayesian kernel machine regression (BKMR) as a new approach to study mixtures, in which the health outcome is regressed on a flexible function of the mixture (e.g., air pollution or toxic waste) components that is specified using a kernel function. In high-dimensional settings, a novel hierarchical variable selection approach is incorporated to identify important mixture components and account for the correlated structure of the mixture. Simulation studies demonstrate the success of BKMR in estimating the exposure-response function and in identifying the individual components of the mixture responsible for health effects. The researchers used the approach to analyze Superfund data on the association between neurodevelopment tests scores and Lead, Arsenic, and Manganese cord blood concentrations in Bangladesh. This paper is in press in the journal Biostatistics.
Characterizing exceedances for remediation purposes: Coull collaborated with doctoral student Mark Meyer and colleague Jeffrey Morris on a project that compared different methods for identifying hot spots on a two-dimensional spatial surface using Bayesian false discovery and other multiple comparison adjustments. This paper is currently in press for Biometrics, and presents important preliminary data for the Harvard SRP center renewal.
Software development: The research team has developed R scripts for running the analyses proposed by Bobb, et al. (2014), and are currently building an R package that will make the methods widely available.
The factor driving the computational complexity of the proposed approaches is the number of observations in the analysis, as the kernel matrix is N times N (where N is the number of observations). In this work the researchers were able to easily fit the small data set from the toxicology study, and were able to apply the model to the larger MOBILIZE study although with current computing resources this took a couple of days. Therefore, computations based on currently developed model fitting algorithms are intractable for large cohorts or large time series studies involving tens to hundreds of thousands of observations. The development of computationally fast methods for big datasets is an area the researchers are actively pursuing. The research team will build the resulting algorithms into the R package that is being built as part of this project as they become available.
While this project focuses on soil and sediment as closely related to the goals of the Superfund program, methods for selecting spatial sites for other media could benefit from insight on how to place samples and conduct analyses. For example, air pollution considered at scales of tens to hundreds of kilometers may be comparable to soil data at scales of meters or tens of meters, provided one adjusts the scale of analysis. The generalness of applicability includes methods to account for targeted samples; for example, many EPA air pollution monitors have been placed in areas of high concentration to monitor peak concentrations, which could distort standard spatial statistical analyses.