A case study on hybrid machine learning and quantum-informed modelling for solubility prediction of drug compounds in organic solvents

Weiling Wang, Isabel Cooley, Morgan R. Alexander, Ricky D. Wildman, Anna K. Croft, Blair F. Johnston*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Solubility is a physicochemical property that plays a critical role in pharmaceutical formulation and processing. While COSMO-RS offers physics-based solubility estimates, its computational cost limits large-scale application. Building on earlier attempts to incorporate COSMO-RS-derived solubilities into Machine Learning (ML) models, we present a substantially expanded and systematic hybrid QSAR framework that advances the field in several novel ways. The direct comparison between COSMOtherm and openCOSMO revealed consistent hybrid augmentation across COSMO engines and enhanced reproducibility. Three widely used ML algorithms, eXtreme Gradient Boosting, Random Forest, and Support Vector Machine, were benchmarked under both 10-fold and leave-one-solute-out cross-validation. The comparison between four major descriptor sets, including MOE, Mordred, RDKit descriptors, and Morgan Fingerprints, offering the first descriptor-level assessment of how COSMO-RS calculated solubility augmentation interacts with diverse chemical feature space. The statistical Y-scrambling was conducted to confirm that the hybrid improvements are genuine and not artefacts of dimensionality. SHAP-based feature analysis further revealed substructural patterns linked to solubility, providing interpretability and mechanistic insight. This study demonstrates that combining physics-informed features with robust, interpretable ML algorithms enables scalable and generalisable solubility prediction, supporting data-driven pharmaceutical design.
Original languageEnglish
Number of pages18
JournalDigital Discovery
Early online date7 Jan 2026
DOIs
Publication statusE-pub ahead of print - 7 Jan 2026

Funding

All authors acknowledge support from the EPSRC “Dialling up performance for on-demand manufacturing” (EP/W017032/1). AKC and IC acknowledge additional support from the EPSRC's Physical Sciences Data Infrastructure Phase 1b (EP/X032701/1).

Keywords

  • machine learning
  • pharmaceutical formulation
  • organic solvents
  • drug solubility

Fingerprint

Dive into the research topics of 'A case study on hybrid machine learning and quantum-informed modelling for solubility prediction of drug compounds in organic solvents'. Together they form a unique fingerprint.

Cite this