Workshop: Analyze Databases of Small Molecules

Diversity Genie™: A New Application to Analyze Databases of Small Molecules

Iwona E. Weidlich, Computational Drug Design Systems (CODDES) LLC, Rockville, MD
Igor V. Filippov, VIF Innovations, LLC, Rockville, MD


Iwona E. Weidlich

A New Application to Analyze Databases of Small Molecules

Many tasks in computational chemistry and chemoinformatics involve parsing and manipulating of chemical structure files such as SDF and SMILES, and identifiers such as InChI. Such common undertakings may include data visualization, filtering based on associated property values or structural identity, computation of simple physico-chemical descriptors, data merging and extraction, sorting of molecules, etc.

We present a new tool which allows a user to easily estimate the diversity of a chemical set, to visualize the groupings of similar molecules, to calculate, sort, and filter by various molecular properties, and to interconvert between SD, SMILES, and InChI formats. Most of the existing tools are either focused on a single molecule approach or require more or less substantial script programming. Our tool is designed to work with sets of molecules and presents a simple straightforward menu-driven interface for computational chemist’s everyday tasks.

We used it to analyze the data for Tox21 challenge. Frequently data-mining problems, for example SAR/QSAR modeling, require the underlying datasets to be diverse to be applicable to a wide variety of chemotypes. With the increasing availability of large datasets of molecules numbering in tens of millions of structures there is enough publicly accessible data to conduct an investigation into the diversity distribution over a set of sets of randomly selected compounds which has not been possible before. We demonstrate how our tool can simplify such diversity analysis as well as other chemoinformatics workloads.

Diversity estimate, data mining, databases, (Q)SAR, SDF, SMILES, InChI, Tox21 challenge