Superfund Research Program

Machine Learning Creates More Complete Picture of Groundwater Contamination

Release Date: 04/02/2025

Icon to indicate you can subscribe/listen via iIunessubscribe/listen via iTunes, download(7.0MB), Transcript(155KB)

View a Video Summary View a Video Summary

Highlights

  • Two machine learning algorithms were tested for predicting missing data points in groundwater datasets.
  • The algorithms accurately predicted missing data points for Arizona and North Carolina groundwater databases.
  • Results suggest that pollution may be present in more groundwater sources than sampling data shows.

Research Summary

Machine learning algorithms can fill gaps in sparse or incomplete groundwater datasets, according to researchers partially funded by the NIEHS Superfund Research Program (SRP). The study tested the ability of two algorithms to help scientists analyze co-occurring pollutants in groundwater by filling in missing field data points. It was led by Paul Westerhoff, Ph.D., a professor at Arizona State University and a researcher at the Harvard University SRP Center, and Yaroslava Yingling, Ph.D., professor at North Carolina State University.

Groundwater can become contaminated with multiple, or co-occurring, pollutants, which can interact to create more severe health effects than one contaminant alone. Additionally, contaminants may require different water treatment methods or may even make certain water treatment methods less effective.

Testing groundwater for co-occurring pollutants can be time-consuming and costly. Further, existing data may not include all pollutants of interest. As a result, historic water quality databases, which often contain data gathered over decades, can be sparse and inconsistent.

“Incomplete groundwater datasets can make it difficult for scientists and state agencies to strategically use resources to determine which locations need more intensive monitoring and prioritize sampling efforts,” said Westerhoff.

“We wanted to see if machine learning algorithms can accurately predict missing values to give us a better understanding of groundwater contamination in an area, particularly if co-occurring contaminants are present,” said Yingling.

Filling in the Gaps

The team tested the AMELIA and MICE machine learning algorithms, which are designed to predict missing data points and have previously been used to accurately generate large datasets.

The algorithms were used separately to process incomplete groundwater sampling databases from Arizona and North Carolina. AMELIA and MICE each generated 10 predicted datasets for each state. The researchers tested the validity of the predicted data by calculating consistency across sets. The scientists also compared accuracy of the predicted data by comparing how similar the median values of the predicted and groundwater sampling data were.

Uncovering Unknown Pollution

The researchers utilize machine learning algorithms to predict missing data points in incomplete groundwater data, which helps to create a more complete picture of groundwater contamination.

The researchers found that both AMELIA and MICE generated data that were accurate to within a 5%-10% significance level. Furthermore, while the incomplete data showed that up to 80% of sampling locations had no pollutants or co-occurring pollutants above regulatory limits, the data from AMELIA and MICE predicted that only 15%-55% of locations had no level of pollutants above regulatory limits. According to the scientists, the predicted data indicates that more locations have co-occurring pollutants than previously found in the sampling data. This suggests groundwater remediation methods should focus on mixtures of pollutants, the authors note.

Impact Statement

“Machine learning algorithms like AMELIA and MICE can help enhance groundwater data and provide a more comprehensive understanding of groundwater contamination in areas that are often data-sparse, such as rural communities,” said Yingling.

“This knowledge can help state agencies identify high-risk regions to prioritize sampling,” said Westerhoff. “In places like rural Arizona that haven’t been monitored intensively, these tools can help identify where interventions are needed to reduce people’s exposure to groundwater pollutants.”

For More Information Contact:

Paul Westerhoff
Arizona State University
Mail code 3005
PO Box 873005
Tempe, Arizona 85287-3005
Phone: 480-965-2885
Email: p.westerhoff@asu.edu

Yaroslava Yingling
North Carolina State University
Email: yara_yingling@ncsu.edu

To learn more about this research, please refer to the following sources:

  • Mahmood AU, Islam M, Gulyuk AV, Briese EA, Velasco CA, Malu M, Sharma N, Spanias A, Yingling Y, Westerhoff P. 2024. Multiple Data Imputation Methods Advance Risk Analysis and Treatability of Co-occurring Inorganic Chemicals in Groundwater. Environ Sci Technol 58:46:20513-20524. doi:10.1021/acs.est.4c05203 PMID:39509340 PMCID:PMC11580165

To receive monthly mailings of the Research Briefs, send your email address to srpinfo@niehs.nih.gov.