Superfund Research Program
June 2022
NIEHS Superfund Research Program (SRP) grantees developed publicly available courses to help their trainees and the broader environmental health sciences research community develop data science skills.
Data science is a rapidly growing field focused on building the processes, tools, and techniques needed to organize, manage, merge, and analyze large datasets. In doing so, researchers can bring clarity to complex questions about how the environment affects human health.
Sharing and integrating data is not without challenges. For example, researchers often struggle because data — as well as key information describing the data — are collected, processed, and stored differently across disciplines.
To address these challenges, the SRP courses and workshops help environmental health researchers analyze and manage large datasets to accelerate scientific discoveries.
Tackling Big Data
As an example, the Texas A&M University (TAMU) SRP Center hosted a free interactive seminar series called, “Big Data in Environmental Science and Toxicology.”
Over 280 participants in nine countries learned from government, industry, and academia experts about data science and sharing, and placing research into real-life contexts. Topics covered navigating statistical computing programs, using online resources to predict toxicity of chemicals, methods for chemical safety assessments, and the U.S. Environmental Protection Agency’s Computational Toxicology Chemicals Dashboard to evaluate the safety of chemicals. Recordings of all six sessions are now available to the public.
Building off the success of this series, the organizers collaborated with the TAMU Institute of Data Science to offer a year-long data science training program for graduate students in the biomedical sciences.
Data Courses for Trainees
At the University of North Carolina at Chapel Hill (UNC) SRP Center, Julia Rager, Ph.D., and collaborators launched the inTelligence And Machine lEarning (TAME) toolkit to promote trainee-driven data science and introduce computational methods that can help researchers more efficiently extract information from complex datasets.
The toolkit contains training modules that are organized into three chapters:
- Introductory data science.
- Chemical-biological analyses and predictive modeling.
- Environmental health database mining.
Rager collaborated with David Reif, Ph.D., from the North Carolina State University SRP Center to develop a module on machine learning and predictive modeling. Other modules include an introduction to coding in R (a programing language for statistical computing and graphics), data management best practices, modeling how chemicals enter and move through the body, and environmental health databases.
Rager tested the toolkit with students enrolled in a graduate level computational toxicology course at UNC. According to Rager, course evaluations showed improvements in students’ understanding and comfort with complex chemical toxicity and exposure data, analysis techniques, and coding.
One of the modules, on high dimensional data visualization, was also disseminated as a highly successful two-hour online training workshop led by UNC graduate students.
Promoting Data Interoperability
Another training series, developed by the University of Louisville SRP Center, focused on best practices to enable data interoperability — creating datasets that can be easily used by other researchers and platforms — and accelerate the impact of environmental health research.
The series included speakers from academia and Microsoft, covering a range of topics, including:
- How a large technology company approaches research.
- Data sharing, management, harmonization, and interoperability.
- Big data.
The last session featured a panel discussion on next steps for data sharing with representatives from SRP centers at the University of Louisville, University of Kentucky, and University of Alabama at Birmingham as well as Microsoft.