S5: Predicting Cytotoxicity

Understanding and Predicting Cytotoxicity
Fredrik Svensson, University of Cambridge

Fredrik Svensson


University of Cambridge


Postdoctoral Research Fellow


Fredrik Svensson1, Lewis Merwin1, Ulf Norinder2,3, Andreas Bender1
1 Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, UK, 2 Swedish Toxicology Sciences Research Center, Sweden, 3 Dept. Computer and Systems Sciences, Stockholm Uni, Sweden


Even low-level cytotoxicity can be linked to drug adverse effects. Accurate predictions of cytotoxicity therefore have the potential to expedite decision making and reduce attrition throughout the drug discovery and development process.

A compound can act deteriorating on cell viability through a multitude of mechanisms leading to different routes ending in cell death or cell cycle arrest. Generally, in large scale assays the underlying mechanisms are not queried, but instead cell viability is measured in a way that encompasses all these changes. Predictive tools can be a valuable addition to these screens to help understand the compound mode of action.

A major challenge when modeling cytotoxicity data on large datasets is the imbalanced nature of the data. Generally only a small fraction of compounds will display cytotoxicity. Generating models with balanced performance on this kind of data constitutes a challenge to most machine learning techniques.

We have collected publicly available cytotoxicity data from PubChem. The collected data comprised of more than 440,000 unique compounds and was strongly imbalanced with a total of only 0.8% of the compounds being toxic. Random forest models where constructed using Python and the scikit-learn package in combination with the nonconformist package for conformal prediction. On external data available for one of the cell lines the conformal predictor had a sensitivity of 74% and a specificity of 65%, a strong and very balanced performance for this kind of predictions.

Our work shows that models for cytotoxicity can successfully be generated from public data and open methods. The use of conformal prediction facilitates the handling of class imbalance while, at the same time, also delivering predictions with confidence.