Quantitative Prediction of Systemic Toxicity Points of Departure
Oak Ridge Institute for Science and Education, Oak Ridge, Tennessee
U.S. Environmental Protection Agency, Research Triangle Park, North Carolina
Human health risk assessment associated with environmental chemical exposure is limited by the tens of thousands of chemicals little or no experimental in vivo toxicity data. Data gap filling techniques, such as quantitative models based on chemical structure information, are commonly used to predict hazard in the absence of experimental data. This study presents a set of predictive models developed using chemical structural and physicochemical properties for chronic or sub-chronic in vivo points of departure (POD, the point on the dose-response that marks the beginning of a low-dose extrapolation). The in vivo data is taken from the EPA’s ToxValDB, a compilation of information on ~3000 unique chemicals from a variety of public data sources. Using these data, and PubChem fingerprints and Chemistry Development Kit (CDK) descriptors as the feature sets, two types of models were developed: (1) rat (756 training chemicals), and (2) mouse (526 training chemicals). Unsupervised feature selection was used to remove the fingerprints with less than 80% variance and supervised recursive feature elimination with linear regression was used to select 5 most relevant descriptors. Regression models, for both rat and mouse, were developed using linear regression, random forests (RF), and K-nearest neighbor algorithms implemented with hyper- parameter tuning within a 5-fold cross validation scheme. The best rat model (RF) had a RMSE of 1.02 log10 mg/kg/day and R 2 of 0.36, and the best mouse model (RF) had a RMSE of 0.98 log10 mg/kg/day and R 2 of 0.25. Since the training data for both types of models was imbalanced, they were re-constructed by creating 5 bootstrap sample datasets with 10% duplicate data (randomly selected from the long tail), and the models were re-developed on the new bootstrapped datasets. The best resultant rat model (RF) had an average RMSE of 0.92 log 10 mg/kg/day and R 2 of 0.48, and the best resultant mouse (RF) model had an average RMSE of 0.90 log 10 mg/kg/day and R 2 of 0.37. Future directions will include adding uncertainty estimates to the predicted POD values. These models will be used in the context of chemical screening and prioritization efforts.
This abstract does not necessarily represent U.S. EPA policy.