POSTER ABSTRACT / DETAILS:
Ames mutagenicity test gives valuable experimental information for the drug discovery process – an estimation of the potential carcinogenicity of the drug candidates. The in-silico implementation of the Ames test is typically a QSAR model applied in the virtual screening process. We present a recent thorough study of a large set of Ames QSAR models. An exhaustive data mining was performed for a publicly available set of 6512 chemical compounds and their experimental Ames test results .
The original chemical structures were presented on a topological level via connections tables or/and linear notation SMILES . The 3D geometry of the molecules was generated by means of OpenMopac . We calculated a large set of 1D, 2D and 3D descriptors as well as more than 10 different sets of molecular fingerprints. The total number of the initial pool of molecular descriptors exceeded 15 000 descriptors.
The Ames models were derived by means of various data mining techniques: k-nearest neighbors (KNN), Support vector machines (SVM), Random Forest classification, Logistic regression, Gaussian process, Radial Basis Function classification etc. For each modeling technique, several methods for descriptor selection were applied – Principal component analysis (PCA), Correlation approach, Best First search method, Information Gain Ratio, Genetic algorithms etc. Additionally some of the descriptor selection procedures were tested with different parameter settings.
Also each modeling technique was applied with various combinations of the molecular descriptors sets and fingerprint sets. More than 100 QSAR models were studied. The model validation was performed by external 5-fold cross validation where the descriptor selection was applied for each of the 5 data set resamplings (validation ‘folds’).
The top 10 and top 20 models were selected and used for the implementation of a consensus model. The descriptor/fingerprint calculation was performed with DRAGON  software version 5.4 and PaDEL  software as well as some custom fingerprints from Ambit2  software system were used. Model building was performed by a collection of machine learning algorithms implemented in Weka  software version 3.7.11.
 K. Hansen, S.Mika, T. Schroeter, A. Sutter, A. Laak, T. Steger-Hartmann, N. Heinrich, K-R. Müller, Benchmark Data Set for in Silico Prediction of Ames Mutagenicity, J. Chem. Inf. Model. 2009, 49, 2077-2081
 D. Weininger, SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , J. Chem. Inf. Comput. Sci., 28(1): 31–36, 1988<br MOPAC2009, James J. P. Stewart, Stewart Computational Chemistry, Version 11.366W web: HTTP://OpenMOPAC.net
 Talete srl, DRAGON for Windows (Software for Molecular Descriptor Calculations). Version 5.4 – 2006 – http://www.talete.mi.it/
 C. W. E. I. Yap, “Software News and Update PaDEL-Descriptor : An Open Source Software to Calculate Molecular Descriptors and Fingerprints,” J. Comput. Chem., vol. 32, no. 7, pp. 1466–1474, 2011
 J. Jaworskaa, N. Nikolova-Jeliazkova, How can structural similarity analysis help in category formation?, SAR and QSAR in Environmental Research, 2007, volume 18, issue 3-4, p.195-207
 Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1