Novel machine learning models to predict endocrine disruption activity for high-throughput chemical screening. 2022

Sean P Collins, and Tara S Barton-Maclaren
Existing Substances Risk Assessment Bureau, Healthy Environments and Consumer Safety Branch, Health Canada Ottawa, Ottawa, ON, Canada.

An area of ongoing concern in toxicology and chemical risk assessment is endocrine disrupting chemicals (EDCs). However, thousands of legacy chemicals lack the toxicity testing required to assess their respective EDC potential, and this is where computational toxicology can play a crucial role. The US (United States) Environmental Protection Agency (EPA) has run two programs, the Collaborative Estrogen Receptor Activity Project (CERAPP) and the Collaborative Modeling Project for Receptor Activity (CoMPARA) which aim to predict estrogen and androgen activity, respectively. The US EPA solicited research groups from around the world to provide endocrine receptor activity Qualitative (or Quantitative) Structure Activity Relationship ([Q]SAR) models and then combined them to create consensus models for different toxicity endpoints. Random Forest (RF) models were developed to cover a broader range of substances with high predictive capabilities using large datasets from CERAPP and CoMPARA for estrogen and androgen activity, respectively. By utilizing simple descriptors from open-source software and large training datasets, RF models were created to expand the domain of applicability for predicting endocrine disrupting activity and help in the screening and prioritization of extensive chemical inventories. In addition, RFs were trained to conservatively predict the activity, meaning models are more likely to make false-positive predictions to minimize the number of False Negatives. This work presents twelve binary and multi-class RF models to predict binding, agonism, and antagonism for estrogen and androgen receptors. The RF models were found to have high predictive capabilities compared to other in silico modes, with some models reaching balanced accuracies of 93% while having coverage of 89%. These models are intended to be incorporated into evolving priority-setting workflows and integrated strategies to support the screening and selection of chemicals for further testing and assessment by identifying potential endocrine-disrupting substances.

UI MeSH Term Description Entries

Related Publications

Sean P Collins, and Tara S Barton-Maclaren
June 2010, Combinatorial chemistry & high throughput screening,
Sean P Collins, and Tara S Barton-Maclaren
January 2022, Biology,
Sean P Collins, and Tara S Barton-Maclaren
March 2020, Scientific reports,
Sean P Collins, and Tara S Barton-Maclaren
October 2020, Environmental science & technology,
Sean P Collins, and Tara S Barton-Maclaren
January 2018, Journal of environmental science and health. Part C, Environmental carcinogenesis & ecotoxicology reviews,
Sean P Collins, and Tara S Barton-Maclaren
April 2024, Life sciences,
Sean P Collins, and Tara S Barton-Maclaren
January 2024, IEEE/ACM transactions on computational biology and bioinformatics,
Sean P Collins, and Tara S Barton-Maclaren
July 2025, Journal of chemical information and modeling,
Sean P Collins, and Tara S Barton-Maclaren
October 2025, ACS pharmacology & translational science,
Copied contents to your clipboard!