Over recent years, deep learning (DL) has become more widely used within the field of cancer diagnostics. However, DL often requires large training datasets to prevent overfitting, which can be difficult and expensive to acquire. Data augmentation is a method that can be used to generate new data points to train DL models. In this study, we use attenuated total reflectance Fourier-transform infrared (ATR-FTIR) spectra of patient dried serum samples and compare non-generative data augmentation methods to Wasserstein generative adversarial networks (WGANs) in their ability to improve the performance of a convolutional neural network (CNN) to differentiate between pancreatic cancer and non-cancer samples in a total cohort of 625 patients. The results show that WGAN augmented spectra improve CNN performance more than non-generative augmented spectra. When compared with a model that utilised no augmented spectra, adding WGAN augmented spectra to a CNN with the same architecture and same parameters, increased the area under the receiver operating characteristic curve (AUC) from 0.661 to 0.757, presenting a 15% increase in diagnostic performance. In a separate test on a colorectal cancer dataset, data augmentation using a WGAN led to an increase in AUC from 0.905 to 0.955. This demonstrates the impact data augmentation can have on DL performance for cancer diagnosis when the amount of real data available for model training is limited.
- deep learning (DL)
- cancer diagnostics
- data augmentation
- generative adversarial networks (GANs)