Fabio Mendes dos Santos1, Hans de Winter2, Koen Augustyns 2 and Julio Cesar Dias Lopes 1,2
1 NEQUIM – Chemoinformatics Group – Departamento de Quimica – Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
2 Medicinal Chemistry Group – Department of Pharmaceutical Scie nces – University of Antwerp , Antwerp , Belgium
The work of the molecular modeling can be divided in three equally important steps. The first one is the choice of the descriptors that must be able to describe accurately the properties studied. The second one is modeling method that must be planned carefully to produce the response we are looking for. Finally, the validation process that need to be properly planned in order to assess the validity of the finds. The most popular methods of validation are jack knife, cross-validation and bootstrap.
In this work, we present a new method for the validation of molecular modeling studies that involve a cross-validation together with a recursive jack knife modeling. Initially the instances under study, belonging to two different classes (active/inactive, for instance), are divided in several groups of same size ( typically five to ten). One of these groups is used as an internal validation set and the remaining groups are recursively divided in two sets, one for training and the other for evaluation. Each one of the original groups are used once for internal validation and one or more times for training and evaluation. The number of models generate vary from 20, for five groups, to 840, for 10 groups. Additionally, we use the Y-randomization approach of each model in order to assess the validity of the model in relation to a random model.
The full set of the model s generated must be subject to an external validation, as the extensive cross-validation will be able to assess the model validity within the dataset only. If there is not such external group it can be generated from the original dataset using bootstrap.
We applied the approach above described to build models to predict the transposition of the blood - brain barrier (BBB), the AMES mutagenicity test and inhibition of five isoforms (3A4, 1A2, 2D6, 2C9 and 2C19) of cytochrome P450. The descriptors were 3D pharmacophore fingerprints generated with an in-house software (3DPharma) together with multiple specie and conformational fuzzification. All structures were subjected to manual pre-treatment (desalting and structures correction) and treatment for multiple tautomers and protomers using Chemaxon softwares (Structure Checker and Calculation Plugins). The calculations of multiple conformations and charges were performed with OMEGA and Molcharge softwares from OpenEye. LibSVM were used to produce the models.
For BBB transposition the model generated achieve a mean accuracy above 95% versus a randomized model accuracy of 80%. For AMES test the mean accuracy was 75 % against 50% for randomized model. For cytochrome isoforms the accuracy varies from 70% to 85%, with accuracy of randomized models between 50% and 70%. It is worth to note that the accuracy produced by Y-randomization reflects the composition of the dataset.
The computational cost of the approach we present here is high but it allows one to asses to validity of the modeling approach, as well the quality of the descriptors used represent the modeled instances. Despite the fact that the number of instances used to generate each model are smaller than a direct (jack knife) approach the results are of same order.
Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) for fellowships (FMS and JCDL) and ChemAxon and OpenEye for academic license of softw are s (FMS and JCDL).