Abstract
Solubility is a physicochemical property that plays a critical role in pharmaceutical formulation and processing. While COSMO-RS offers physics-based solubility estimates, its computational cost limits large-scale application. Building on earlier attempts to incorporate COSMO-RS-derived solubilities into Machine Learning (ML) models, we present a substantially expanded and systematic hybrid QSAR framework that advances the field in several novel ways. The direct comparison between COSMOtherm and openCOSMO revealed consistent hybrid augmentation across COSMO engines and enhanced reproducibility. Three widely used ML algorithms, eXtreme Gradient Boosting, Random Forest, and Support Vector Machine, were benchmarked under both 10-fold and leave-one-solute-out cross-validation. The comparison between four major descriptor sets, including MOE, Mordred, RDKit descriptors, and Morgan Fingerprints, offering the first descriptor-level assessment of how COSMO-RS calculated solubility augmentation interacts with diverse chemical feature space. The statistical Y-scrambling was conducted to confirm that the hybrid improvements are genuine and not artefacts of dimensionality. SHAP-based feature analysis further revealed substructural patterns linked to solubility, providing interpretability and mechanistic insight. This study demonstrates that combining physics-informed features with robust, interpretable ML algorithms enables scalable and generalisable solubility prediction, supporting data-driven pharmaceutical design.
| Original language | English |
|---|---|
| Number of pages | 18 |
| Journal | Digital Discovery |
| Early online date | 7 Jan 2026 |
| DOIs | |
| Publication status | E-pub ahead of print - 7 Jan 2026 |
Funding
All authors acknowledge support from the EPSRC “Dialling up performance for on-demand manufacturing” (EP/W017032/1). AKC and IC acknowledge additional support from the EPSRC's Physical Sciences Data Infrastructure Phase 1b (EP/X032701/1).
Keywords
- machine learning
- pharmaceutical formulation
- organic solvents
- drug solubility
Fingerprint
Dive into the research topics of 'A case study on hybrid machine learning and quantum-informed modelling for solubility prediction of drug compounds in organic solvents'. Together they form a unique fingerprint.Projects
- 1 Active
-
Dialling up 3D Printing Performance for On Demand Manufacturing (Programme Grant)
Florence, A. (Principal Investigator) & Johnston, B. (Co-investigator)
EPSRC (Engineering and Physical Sciences Research Council)
1/10/22 → 30/09/27
Project: Research
Research output
- 1 Correction
-
Correction: A case study on hybrid machine learning and quantum-informed modelling for solubility prediction of drug compounds in organic solvents
Wang, W., Cooley, I., Alexander, M. R., Wildman, R. D., Croft, A. K. & Johnston, B. F., 4 Feb 2026, (E-pub ahead of print) In: Digital Discovery.Research output: Contribution to journal › Correction
Open AccessFile
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver