If you inspect the TCAMS dataset closely, you will find a column called "Target Hypothesis". You will also find that this column is empty for most columns. The Tres Cantos group of GlaxoSmithKline have analyzed the TCAMS compounds if they were identified as hits in previous experiments.
If the target in previous experiments had a homolog in P.falciparum, the hypothesis was made that the compound targets this protein also in P.falciparum. We will try to extract this information and use it to build a model that predicts whether a compound is a kinase inhibitor or not.
To create the dataset required for model building go to the dataset URI http://apps.ideaconsult.net:8080/ambit2/dataset/584486?page=0&pagesize=10.
Browse the dataset and find the column “Target hypothesis”. You will note that most entries are empty (only ~6% of the compounds have a target hypothesis annotated). In the 10 compounds displayed by default when following the link to the TCAMS data, you will not find a single one with a non-empty target hypothesis. You could either browse through the pages until you find a non-empty target hypothesis, or increase the pagesize to e.g. 100.
Eventually, you should find an entry with the value “Adrenergic receptor antagonist“ (e.g. using "?page=1&pagesize=10" it will be towards the bottom of the page). The feature value "Adrenergic receptor antagonist" is actually a link to a search against the whole Ambit2 database for compounds with the "Target_Hypothesis" feature and a feature value equal to "Adrenergic receptor antagonist".
What we would like is a list of kinase inhibitors (the target hypothesis for these is called "Ser/Thr protein kinase"). To create a predictive model, however, that's not enough, as it represents only the "positive" compounds. We need some negatives, as well. We cannot use the compounds with an empty target hypothesis, as these may be kinase inhibitors. We can use, however, the compounds with a target hypothesis different from "Ser/Thr protein kinase". Thus, to create our training dataset for model building, let's download all compounds with a non-empty target hypothesis.
To achieve this, we will use the URI searching the whole Ambit2 database for "Adrenergic receptor antagonists" as a starting point, and modify it to search for all non-empty target hypotheses. If you copy-paste the URI given by the "Adrenergic receptor antagonist" link, you will get:
First, we modify the search query: instead of Adrenergic receptor antagonists, we want to search for empty target hypotheses (then we will simply negate the search at the end). Thus, we replace
"search=Adrenergic+receptor+antagonist" →→→ with →→→ "search=+"
To negate the search, add →→→ "&condition=!=" at the end of the URI. The resulting URI is http://apps.ideaconsult.net:8080/ambit2/compound?feature_uris=http://apps.ideaconsult.net:8080/ambit2/feature/636447&property=Target_Hypothesis&search=+&feature_uris=http://apps.ideaconsult.net:8080/ambit2/dataset/584486/feature&condition=!=
|When following the above URL you’ll get a table with compounds that have a non-empty Target_Hypothesis.
The next step will be to export data.
Click on the left one of the two little Excel icons (when moving the mouse pointer on top of it, a small text box “text/csv” should appear) to save the selected data as CSV.
That is, we leave only the SMILES column and the Target_Hypothesis column.
Now you should have the Target_Hypothesis in column 1 (or A), and the SMILES in column 2 (or B). If you are using Excel, go to the cell C2.
Type =IF(A2="Ser/Thr protein kinase", 1, 0) and hit “Enter”.
Again click on cell C2 to activate it.
Now double-click on the little black square at the bottom-right corner of the cell’s border to fill the column with this formula.
Name the newly created column "Kinase Inhibitor" by typing the name into cell C1.
Now, copy the whole column C, and paste it (at the same place) using Excel’s “Paste Special” function, pasting only the values.
Once that’s done, delete column A (holding the text entries for the Target_Hypothesis).
Save the resulting table as text CSV file to TCAMS-kinase_full.csv.
In your web browser, navigate to www.toxcreate.org.
Read the instructions, and try to create a model using your dataset.
If successful, you can use ToxCreate to apply the model to example compounds (using the drawing tool in ToxCreate).
Back to the Drug Discovery Predictive Tutorial Overview