Chemoinformatics-based classification of prohibited substances employed for doping in sport

E. O. Cannon, A. Bender, D. S. Palmer, J. B. Mitchell

Research output: Contribution to journalArticle

27 Citations (Scopus)

Abstract

Representative molecules from 10 classes of prohibited substances were taken from the World Anti-Doping Agency (WADA) list, augmented by molecules from corresponding activity classes found in the MDDR database. Together with some explicitly allowed compounds, these formed a set of 5245 molecules. Five types of fingerprints were calculated for these substances. The random forest classification method was used to predict membership of each prohibited class on the basis of each type of fingerprint, using 5-fold cross-validation. We also used a k-nearest neighbors (kNN) approach, which worked well for the smallest values of k. The most successful classifiers are based on Unity 2D fingerprints and give very similar Matthews correlation coefficients of 0.836 (kNN) and 0.829 (random forest). The kNN classifiers tend to give a higher recall of positives at the expense of lower precision. A naïve Bayesian classifier, however, lies much further toward the extreme of high recall and low precision. Our results suggest that it will be possible to produce a reliable and quantitative assignment of membership or otherwise of each class of prohibited substances. This should aid the fight against the use of bioactive novel compounds as doping agents, while also protecting athletes against unjust disqualification.
LanguageUndefined/Unknown
Pages2369-2380
Number of pages12
JournalJournal of Chemical Information and Modeling
Volume46
Issue number6
DOIs
Publication statusPublished - 2006

Keywords

  • chemoinformatics
  • doping
  • sport
  • molecular biology

Cite this

@article{faaf91265fed48e6b4c376bb381e51cf,
title = "Chemoinformatics-based classification of prohibited substances employed for doping in sport",
abstract = "Representative molecules from 10 classes of prohibited substances were taken from the World Anti-Doping Agency (WADA) list, augmented by molecules from corresponding activity classes found in the MDDR database. Together with some explicitly allowed compounds, these formed a set of 5245 molecules. Five types of fingerprints were calculated for these substances. The random forest classification method was used to predict membership of each prohibited class on the basis of each type of fingerprint, using 5-fold cross-validation. We also used a k-nearest neighbors (kNN) approach, which worked well for the smallest values of k. The most successful classifiers are based on Unity 2D fingerprints and give very similar Matthews correlation coefficients of 0.836 (kNN) and 0.829 (random forest). The kNN classifiers tend to give a higher recall of positives at the expense of lower precision. A na{\~A}¯ve Bayesian classifier, however, lies much further toward the extreme of high recall and low precision. Our results suggest that it will be possible to produce a reliable and quantitative assignment of membership or otherwise of each class of prohibited substances. This should aid the fight against the use of bioactive novel compounds as doping agents, while also protecting athletes against unjust disqualification.",
keywords = "chemoinformatics , doping, sport, molecular biology",
author = "Cannon, {E. O.} and A. Bender and Palmer, {D. S.} and Mitchell, {J. B.}",
year = "2006",
doi = "10.1021/ci0601160",
language = "Undefined/Unknown",
volume = "46",
pages = "2369--2380",
journal = "Journal of Chemical Information and Modeling",
issn = "1549-9596",
publisher = "American Chemical Society",
number = "6",

}

Chemoinformatics-based classification of prohibited substances employed for doping in sport. / Cannon, E. O.; Bender, A.; Palmer, D. S.; Mitchell, J. B.

In: Journal of Chemical Information and Modeling , Vol. 46, No. 6, 2006, p. 2369-2380.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Chemoinformatics-based classification of prohibited substances employed for doping in sport

AU - Cannon, E. O.

AU - Bender, A.

AU - Palmer, D. S.

AU - Mitchell, J. B.

PY - 2006

Y1 - 2006

N2 - Representative molecules from 10 classes of prohibited substances were taken from the World Anti-Doping Agency (WADA) list, augmented by molecules from corresponding activity classes found in the MDDR database. Together with some explicitly allowed compounds, these formed a set of 5245 molecules. Five types of fingerprints were calculated for these substances. The random forest classification method was used to predict membership of each prohibited class on the basis of each type of fingerprint, using 5-fold cross-validation. We also used a k-nearest neighbors (kNN) approach, which worked well for the smallest values of k. The most successful classifiers are based on Unity 2D fingerprints and give very similar Matthews correlation coefficients of 0.836 (kNN) and 0.829 (random forest). The kNN classifiers tend to give a higher recall of positives at the expense of lower precision. A naïve Bayesian classifier, however, lies much further toward the extreme of high recall and low precision. Our results suggest that it will be possible to produce a reliable and quantitative assignment of membership or otherwise of each class of prohibited substances. This should aid the fight against the use of bioactive novel compounds as doping agents, while also protecting athletes against unjust disqualification.

AB - Representative molecules from 10 classes of prohibited substances were taken from the World Anti-Doping Agency (WADA) list, augmented by molecules from corresponding activity classes found in the MDDR database. Together with some explicitly allowed compounds, these formed a set of 5245 molecules. Five types of fingerprints were calculated for these substances. The random forest classification method was used to predict membership of each prohibited class on the basis of each type of fingerprint, using 5-fold cross-validation. We also used a k-nearest neighbors (kNN) approach, which worked well for the smallest values of k. The most successful classifiers are based on Unity 2D fingerprints and give very similar Matthews correlation coefficients of 0.836 (kNN) and 0.829 (random forest). The kNN classifiers tend to give a higher recall of positives at the expense of lower precision. A naïve Bayesian classifier, however, lies much further toward the extreme of high recall and low precision. Our results suggest that it will be possible to produce a reliable and quantitative assignment of membership or otherwise of each class of prohibited substances. This should aid the fight against the use of bioactive novel compounds as doping agents, while also protecting athletes against unjust disqualification.

KW - chemoinformatics

KW - doping

KW - sport

KW - molecular biology

U2 - 10.1021/ci0601160

DO - 10.1021/ci0601160

M3 - Article

VL - 46

SP - 2369

EP - 2380

JO - Journal of Chemical Information and Modeling

T2 - Journal of Chemical Information and Modeling

JF - Journal of Chemical Information and Modeling

SN - 1549-9596

IS - 6

ER -