Random forest models to predict aqueous solubility

D. S. Palmer, N. M. O'Boyle, R. C. Glen, J. B. Mitchell

Research output: Contribution to journalArticle

148 Citations (Scopus)

Abstract

Random Forest regression (RF), Partial-Least-Squares (PLS) regression, Support Vector Machines (SVM), and Artificial Neural Networks (ANN) were used to develop QSPR models for the prediction of aqueous solubility, based on experimental data for 988 organic molecules. The Random Forest regression model predicted aqueous solubility more accurately than those created by PLS, SVM, and ANN and offered methods for automatic descriptor selection, an assessment of descriptor importance, and an in-parallel measure of predictive ability, all of which serve to recommend its use. The prediction of log molar solubility for an external test set of 330 molecules that are solid at 25 degrees C gave an r2 = 0.89 and RMSE = 0.69 log S units. For a standard data set selected from the literature, the model performed well with respect to other documented methods. Finally, the diversity of the training and test sets are compared to the chemical space occupied by molecules in the MDL drug data report, on the basis of molecular descriptors selected by the regression analysis.
LanguageUndefined/Unknown
Pages150-158
Number of pages9
JournalJournal of Chemical Information and Modeling
Volume47
Issue number1
DOIs
Publication statusPublished - 2007

Keywords

  • aqueous solubility
  • forest models
  • random forest regression

Cite this

Palmer, D. S. ; O'Boyle, N. M. ; Glen, R. C. ; Mitchell, J. B. / Random forest models to predict aqueous solubility. In: Journal of Chemical Information and Modeling . 2007 ; Vol. 47, No. 1. pp. 150-158.
@article{2722daddde1b48eeaf6772c839a05d6e,
title = "Random forest models to predict aqueous solubility",
abstract = "Random Forest regression (RF), Partial-Least-Squares (PLS) regression, Support Vector Machines (SVM), and Artificial Neural Networks (ANN) were used to develop QSPR models for the prediction of aqueous solubility, based on experimental data for 988 organic molecules. The Random Forest regression model predicted aqueous solubility more accurately than those created by PLS, SVM, and ANN and offered methods for automatic descriptor selection, an assessment of descriptor importance, and an in-parallel measure of predictive ability, all of which serve to recommend its use. The prediction of log molar solubility for an external test set of 330 molecules that are solid at 25 degrees C gave an r2 = 0.89 and RMSE = 0.69 log S units. For a standard data set selected from the literature, the model performed well with respect to other documented methods. Finally, the diversity of the training and test sets are compared to the chemical space occupied by molecules in the MDL drug data report, on the basis of molecular descriptors selected by the regression analysis.",
keywords = "aqueous solubility , forest models , random forest regression",
author = "Palmer, {D. S.} and O'Boyle, {N. M.} and Glen, {R. C.} and Mitchell, {J. B.}",
year = "2007",
doi = "10.1021/ci060164k",
language = "Undefined/Unknown",
volume = "47",
pages = "150--158",
journal = "Journal of Chemical Information and Modeling",
issn = "1549-9596",
publisher = "American Chemical Society",
number = "1",

}

Random forest models to predict aqueous solubility. / Palmer, D. S.; O'Boyle, N. M.; Glen, R. C.; Mitchell, J. B.

In: Journal of Chemical Information and Modeling , Vol. 47, No. 1, 2007, p. 150-158.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Random forest models to predict aqueous solubility

AU - Palmer, D. S.

AU - O'Boyle, N. M.

AU - Glen, R. C.

AU - Mitchell, J. B.

PY - 2007

Y1 - 2007

N2 - Random Forest regression (RF), Partial-Least-Squares (PLS) regression, Support Vector Machines (SVM), and Artificial Neural Networks (ANN) were used to develop QSPR models for the prediction of aqueous solubility, based on experimental data for 988 organic molecules. The Random Forest regression model predicted aqueous solubility more accurately than those created by PLS, SVM, and ANN and offered methods for automatic descriptor selection, an assessment of descriptor importance, and an in-parallel measure of predictive ability, all of which serve to recommend its use. The prediction of log molar solubility for an external test set of 330 molecules that are solid at 25 degrees C gave an r2 = 0.89 and RMSE = 0.69 log S units. For a standard data set selected from the literature, the model performed well with respect to other documented methods. Finally, the diversity of the training and test sets are compared to the chemical space occupied by molecules in the MDL drug data report, on the basis of molecular descriptors selected by the regression analysis.

AB - Random Forest regression (RF), Partial-Least-Squares (PLS) regression, Support Vector Machines (SVM), and Artificial Neural Networks (ANN) were used to develop QSPR models for the prediction of aqueous solubility, based on experimental data for 988 organic molecules. The Random Forest regression model predicted aqueous solubility more accurately than those created by PLS, SVM, and ANN and offered methods for automatic descriptor selection, an assessment of descriptor importance, and an in-parallel measure of predictive ability, all of which serve to recommend its use. The prediction of log molar solubility for an external test set of 330 molecules that are solid at 25 degrees C gave an r2 = 0.89 and RMSE = 0.69 log S units. For a standard data set selected from the literature, the model performed well with respect to other documented methods. Finally, the diversity of the training and test sets are compared to the chemical space occupied by molecules in the MDL drug data report, on the basis of molecular descriptors selected by the regression analysis.

KW - aqueous solubility

KW - forest models

KW - random forest regression

U2 - 10.1021/ci060164k

DO - 10.1021/ci060164k

M3 - Article

VL - 47

SP - 150

EP - 158

JO - Journal of Chemical Information and Modeling

T2 - Journal of Chemical Information and Modeling

JF - Journal of Chemical Information and Modeling

SN - 1549-9596

IS - 1

ER -