Skip Navigation

Final Progress Reports: Northeastern University: Data Management and Analysis Core

Superfund Research Program

Data Management and Analysis Core

Project Leader: David Kaeli
Co-Investigators: Akram N. Alshawabkeh, Jennifer Dy, Justin Manjourides, Bhramar Mukherjee (University of Michigan)
Grant Number: P42ES017198
Funding Period: 2010-2025
View this project in the NIH Research Portfolio Online Reporting Tools (RePORT)

Project-Specific Links

Connect with the Grant Recipients

Visit the grantee's eNewsletter page Visit the grantee's eNewsletter page Visit the grantee's Twitter page Visit the grantee's Instagram page Visit the grantee's Facebook page Visit the grantee's Video page

Final Progress Reports

Year:   2019  2013 

Studies and Results

The major activities in the past year have included developing an online data dictionary for PROTECT. The goal of this work is to provide researchers both inside and outside of PROTECT an efficient method to learn about the variety of data stored in the database system. The Data Management and Modeling Core has continued to support the projects and cores, building an integrated, reliable and secure system to house all Center data. The Core team has also developed an inventory utility that identifies missing fields and input files on an individual participant basis. This has proven a very useful tool when identifying incomplete data.

In 2013, the Core database system grew to 3 million records, increasing in size by 1 million records in a single year. This rate of grow is enabled through automation of the data entry process, which has been streamlined through the use of EQuIS and the EDD front-end. The human subject data presently includes 3,552 total fields/participant representing data collected from 14 different forms. The environmental data stored in the system includes information on 1048 wells (14 of them include water contaminant data), 35 springs (3 of them include water contaminant data), field data (9 wells and 2 springs, sampled twice a year), and tap water data testing 13 contaminants. The targeted biological data includes information on 51 targeted chemicals (8 fields per participant), 19 phthalates and phenols, 18 trace metals and 14 pesticides. All data is cross-indexed to allow us to build rich queries across these three different data sources and enabling us to address program-level questions.

In 2014, the Core team is focusing its efforts on expanding the GIS capabilities of the EQuIS back-end system. This will provide Core users the ability to quickly visualize their data in a map-based format to help identify patterns and indicators. Next the Core documents its progress on the project aims. The Core has built a customized user interface to allow for the efficient cleaning and entry of the human subject data being generated. The Human Subject and Sampling Core is using the Redcap system for their initial data entry and cleaning. The Data Management and Modeling Core team has set up a cloud-storage Dropbox system to allow for fast and secure transmission of data from all projects. They have also implemented data cleaning in EQuIS using EDDs, which provides a second level of cleaning and consistency, and is based solely on the Data Dictionary. This activity is complete. Data import from the projects continues on a weekly basis.

The Core has built a customized user interface to allow for the efficient entry and access of the environmental data being used in this project. The Core team has worked closely with the environmental data collection team. The Data Dictionary for that project has been updated and is complete. The Core is developing an integrated database management system that can provide a repository that engineers, scientists and medical doctors can perform relational queries across both environmental/geophysical data, as well as human subject biomedical data. The database system is online and work continues to integrate EQuIS so that users will have a seamless online system. The main focus will be to provide easy to understand exports and GIS reports. This activity will continue to evolve with the needs of the projects. A set of standardized graphical reports that produce layered maps utilizing GIS mapping software - is handled by the EQuIS backend software.

The Core began to provide data mining algorithms integrated into the database user interface to perform clustering analysis and pattern recognition. The Core team has begun experiments utilizing MapReduce and Mahout and has demonstrated the value of these tools to find patterns in their volumes of data. This activity will continue.

Significance

The Data Management and Modeling Core is critical to the integration of biomedical and environmental data. The data sets being contributed by the PROTECT projects have created a knowledge base that will support the project and program level objectives, as well as provide the community with information about their environment. The Core progress is significant for bringing disciplines together to address the program objectives. The Core led the submission of a new proposal in response to the NIH Big Data To Knowledge (BD2K) call focusing on distributed data sharing, and includes PROTECT as a testbed.

Back
to Top