Abstract
Introduction: Missing PM2.5 observations in environmental monitoring systems, caused by sensor malfunctions, communication failures, maintenance issues, and coverage gaps, compromise public health assessments and evidence-based air quality policymaking. Reliable imputation strategies are therefore essential to preserve data integrity and analytical validity.
Methods: This study evaluated five imputation techniques: Bayesian Regression (BR), K-Nearest Neighbors (KNN), missForest, Predictive Mean Matching (PMM), and Random Forest (RF), using daily PM2.5 measurements collected between May 2019 and December 2024 from monitoring stations in Islamabad, Karachi, Lahore, and Peshawar, Pakistan. Three missing data mechanisms, MCAR, MAR, and MNAR, were simulated at missing rates ranging from 5% to 25%. Model performance was assessed using Root Mean Square Error (RMSE) and Mean Absolute Error (MAE).
Results: Imputation under the MAR mechanism consistently yielded lower error values as missingness increased. Across all mechanisms and missing rates, missForest and KNN demonstrated superior performance. Notably, missForest achieved the lowest RMSE and MAE values overall and effectively preserved the temporal structure, range, and variability of the PM2.5 series.
Discussion: The findings suggest that machine-learning-based approaches, particularly missForest, provide robust and reliable imputation for PM2.5 datasets with varying missingness patterns. These results support the use of missForest as a preferred method for handling incomplete air quality data in similar monitoring contexts, thereby strengthening the reliability of environmental health analyses and air quality policy development.
Methods: This study evaluated five imputation techniques: Bayesian Regression (BR), K-Nearest Neighbors (KNN), missForest, Predictive Mean Matching (PMM), and Random Forest (RF), using daily PM2.5 measurements collected between May 2019 and December 2024 from monitoring stations in Islamabad, Karachi, Lahore, and Peshawar, Pakistan. Three missing data mechanisms, MCAR, MAR, and MNAR, were simulated at missing rates ranging from 5% to 25%. Model performance was assessed using Root Mean Square Error (RMSE) and Mean Absolute Error (MAE).
Results: Imputation under the MAR mechanism consistently yielded lower error values as missingness increased. Across all mechanisms and missing rates, missForest and KNN demonstrated superior performance. Notably, missForest achieved the lowest RMSE and MAE values overall and effectively preserved the temporal structure, range, and variability of the PM2.5 series.
Discussion: The findings suggest that machine-learning-based approaches, particularly missForest, provide robust and reliable imputation for PM2.5 datasets with varying missingness patterns. These results support the use of missForest as a preferred method for handling incomplete air quality data in similar monitoring contexts, thereby strengthening the reliability of environmental health analyses and air quality policy development.
| Original language | English |
|---|---|
| Article number | 1775982 |
| Number of pages | 15 |
| Journal | Frontiers in Environmental Science |
| Volume | 14 |
| DOIs | |
| Publication status | Published - 19 Feb 2026 |
Keywords
- air quality monitoring
- machine learning
- missForest
- Pakistan
- PM2.5 missing data imputation
Fingerprint
Dive into the research topics of 'Development of a novel imputation framework for PM2.5 particle data in Pakistani cities using machine learning and statistical techniques'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver