Frequently asked questions about the validation
What is the test_target_dataset(_uri)?
The test-target-dataset contains the actual endpoint values of the compounds in the test-dataset. It is required as param for the validation, if the actual endpoint is not specified in the test-dataset. This param is obsolete if the test-dataset contains endpoint values. The validation needs the actual test-compound endpoint values to compare them to the predictions of the model, i.e. to perform a validation.
How are duplicates handled in datasets?
Duplicate compounds are common in chemical datasets. A duplicate may for example occur because a compound was measured twice in an experiment, and therefore a duplicate compound can have different endpoint values.
Duplicate compounds are handled as two single compound predictions within the validation.
Consider the following example: a test dataset contains the duplicate with different values, once it is active, once it is inactive. The prediction alogrithm will of course make one prediction for this compound, i.e. it either predict it as active or inactive (we ommit un-predicted compounds if it is outside the AD for simplicity). Then the validation will count the predicted compound twice, once as correct prediction, once as incorrect prediction.