Superfund Research Program
February 2024
By Cassidy Rice
Researchers from the University of Kentucky (UK) SRP Center are using machine learning techniques to help interpret how chemicals are processed, or metabolized, in the body. A series of interconnected processes in the body, known as metabolic pathways, can convert substances into smaller molecules, or metabolites. For certain chemicals, these metabolites can be more toxic than their parent compound.
To understand how these potentially harmful metabolites are created, researchers are mapping them to their metabolic pathways. In a recent publication, data scientists at the UK SRP Center describe the development of a new dataset that can be used in machine learning models to predict the metabolic pathway associations of chemical compounds.
Hunter Moseley, Ph.D., describes below how he and colleagues at the UK SRP Center are applying machine learning technology to their research.
Can you explain your process for developing this new benchmark dataset?
Benchmark datasets are used in machine learning to compare algorithms and ensure that the model represents the real world as reliably as possible.
We developed a new benchmark dataset that classifies, or “maps,” chemical compounds to their metabolic pathways. The dataset was derived from the Kyoto Encyclopedia of Gene and Genomes (KEGG), a collection of databases that link genetic information and biological pathways with chemical substances. Next, we tested the dataset by developing machine learning methods to predict the metabolic pathways of compounds based on their chemical structure.
What was the motivation for this study?
We created the new dataset because a prior benchmark dataset based on KEGG data had duplicate entries, which can confuse the machine learning model and lead to inaccurate results. We currently have a subsequent paper under review that describes the flawed dataset, the affected publications that utilized the flawed dataset, and the catastrophic data leakage caused during machine learning training and testing. Data leakage in machine learning is when a model is created using data that was not included in the training dataset, leading to a potentially inaccurate model.
How can this benchmark dataset improve the research in this field?
The dataset includes the most up to date information from KEGG but eliminates the duplicate entries found in the prior benchmark dataset. Furthermore, we can use the tools we developed to recreate the dataset based on updates to KEGG in the future.
In addition, our machine learning methods can improve the analysis of datasets by increasing the number of compounds mapped to specific metabolic pathways. The tools that we created to generate this dataset also can be used to generate related datasets from other metabolic pathway databases.
What are your next steps?
We are developing new machine learning methods to further refine and improve metabolite mapping to metabolic pathways. We will continue to update the dataset as the information available in KEGG increases, and we plan to expand the tools to generate similar datasets from another database called MetaCyc, which contains information on metabolic pathways, enzymes, and metabolites.