Machine learning models for the prediction of pharmaceutical powder properties

Student thesis: Doctoral Thesis


Understanding how particle attributes affect the pharmaceutical manufacturing process performance remains a significant challenge for the industry, adding cost and time to the development of robust products and production routes. Tablet formation can be achieved by several techniques however, direct compression (DC) and granulation are the most widely used in industrial operations. DC is of particular interest as it offers lower-cost manufacturing and a streamlined process with fewer steps compared with other unit operations. However, to achieve the full potential benefits of DC for tablet manufacture, this places strict demands on material flow properties, blend uniformity, compactability, and lubrication, which need to be satisfied. DC is increasingly the preferred technique for pharmaceutical companies for oral solid dose manufacture, consequently making the flow prediction of pharmaceutical materials of increasing importance. Bulk properties are influenced by particle attributes, such as particle size and shape, which are defined during crystallization and/or milling processes. Currently, the suitability of raw materials and/or formulated blends for DC requires detailed characterization of the bulk properties. A key goal of digital design and Industry 4.0 concepts is through digital transformation of existing development steps be able to better predict properties whilst minimizing the amount of material and resources required to inform process selection during early- stage development.The work presented in Chapter 4 focuses on developing machine learning (ML) models to predict powder flow behaviour of routine, widely available pharmaceutical materials. Several datasets comprising powder attributes (particle size, shape, surface area, surface energy, and bulk density) and flow properties (flow function coefficient) have been built, for pure compounds, binary mixtures, and multicomponent formulations. Using these datasets, different ML models, including traditional ML (random forest, support vector machines, k nearest neighbour, gradient boosting, AdaBoost, Naïve Bayes, and logistic regression) classification and regression approaches, have been explored for the prediction of flow properties, via flow function coefficient. The models have been evaluated using multiple sampling methods and validated using external datasets, showing a performance over 80%, which is sufficiently high for their implementation to improve manufacturing efficiency. Finally, interpretability methods, namely SHAP (SHapley Additive exPlanaitions), have been used to understand the predictions of the machine learning models by determining how much each variable included in the training dataset has contributed to each final prediction.Chapter 5 expanded on the work presented in Chapter 4 by demonstrating the applicability of ML models for the classification of the viability of pharmaceutical formulations for continuous DC via flow function coefficient on their powder flow. More than 100 formulations were included in this model and the particle size and particle shape of the active pharmaceutical ingredients (APIs), the flow function coefficient of the APIs, and the concentration of the components of the formulations were used to build the training dataset. The ML models were evaluated using different sampling techniques, such as bootstrap sampling and 10-fold cross-validation, achieving a precision of 90%.Furthermore, Chapter 6 presents the comparison of two data-driven model approaches to predict powder flow: a Random Forest (RF) model and a Convolutional Neural Network (CNN) model. A total of 98 powders covering a wide range of particle sizes and shapes were assessed using static image analysis. The RF model was trained on the tabular data (particle size, aspect ratio, and circularity descriptors), and the CNN model was trained on the composite images. Both datasets were extracted from the same characterisation instrument. The data were split into training, testing, and validation sets. The results of the validation were used to compare the performance of the two approaches. The results revealed that both algorithms achieved a similar performance since the RF model and the CNN model achieved the same accuracy of 55%.Finally, other particle and bulk properties, i.e., bulk density, surface area, and surface energy, and their impact on the manufacturability and bioavailability of the drug product are explored in Chapter 7. The bulk density models achieved a high performance of 82%, the surface area models achieved a performance of 80%, and finally, the surface-energy models achieved a performance of 60%. The results of the models presented in this chapter pave the way to unified guidelines moving towards end-to-end continuous manufacturing by linking the manufacturability requirements and the bioavailability requirements.
Date of Award27 Jul 2023
Original languageEnglish
Awarding Institution
  • University Of Strathclyde
SponsorsEPSRC (Engineering and Physical Sciences Research Council)
SupervisorAlastair Florence (Supervisor) & Cameron Brown (Supervisor)

Cite this