Session 6: Limitations of data mining

Limitations of data mining and machine learning for toxicology .

Eli Goldberg


ETH Zurich


PhD Candidate


In the last 15 years, the advancement of mechanistic toxicology models has been slow, primarily due to a lack of quantitative understanding of complex biological interactions.

The search, however, has generated a tremendous amount of in vitro/vivo toxicological data, much of which is readily available for data mining (DM) and/or the development of machine learning-based (ML) predictive models.

The promise of DM and ML is striking as these methods can assist in decision support, improve the speed/accuracy of risk assessment, and support drug discovery efforts. In the face of such possible improvements, the limitations of DM/ML approaches are often ignored. Most importantly, while DM and ML can provide phenomenological insight they cannot provide process understanding, except when employed to identify features with potential mechanistic relevance.

Further, proper model development and application has to be accompanied by an investigation the applicability domain, an examination of the model stability and generalizability, and employ automated feature selection to avoid over-fitting.

For applications where the only objective is prediction, purely DM and ML approaches may be sufficient. However, where process understanding is critical to application, mechanistic or domain-restricted DM/ML models should be developed.