Details

Title

Preparation of water quality dataset using outlier detection and imputation methods

Journal title

Archives of Environmental Protection

Yearbook

2026

Volume

52

Issue

1

Authors

Affiliation

YETİK, Mehmet Kazim : Karabük University, Turkey

Keywords

water quality; ; outlier detection; ; missing data imputation; ; boxplot; ; normalization; ; Mean Squared Error(MSE);

Divisions of PAS

Nauki Techniczne

Coverage

3-15

Publisher

Polish Academy of Sciences

Bibliography

  1. Addico, G., Hardege, J., Komarek, J., Babica, P. & de Graft-Johnson, K. (2006). Cyanobacteria species identified in the Weija and Kpong reservoirs, Ghana, and their implications for drinking water quality with respect to microcystin. African Journal of Marine Science, 28(2), 451–456. https://doi.org/10.2989/18142320609504196
  2. Ahmad, S., Khan, I. H. & Parida, B. P. (2001). Performance of Stochastic Approaches for Forecasting River Water Quality. Wat. Res, 35(18), 4261–4266. https://doi.org/10.1016/S0043-1354(01)00167-1
  3. Azhar, S. C., Aris, A. Z., Yusoff, M. K., Ramli, M. F. & Juahir, H. (2015). Classification of River Water Quality Using Multivariate Analysis. Procedia Environmental Sciences, 30, 79–84. https://doi.org/10.1016/j.proenv.2015.10.014
  4. Barcellos, D. da S. & Souza, F. T. de. (2022). Optimization of water quality monitoring programs by data mining. Water Research, 221. https://doi.org/10.1016/j.watres.2022.118805
  5. Betrie, G. D., Sadiq, R., Tesfamariam, S. & Morin, K. A. (2016). On the Issue of Incomplete and Missing Water-Quality Data in Mine Site Databases: Comparing Three Imputation Methods. Mine Water and the Environment, 35(1), 3–9. https://doi.org/10.1007/s10230-014-0322-4
  6. Burchard-Levine, A., Liu, S., Vince, F., Li, M. & Ostfeld, A. (2014). A hybrid evolutionary data driven model for river water quality early warning. Journal of Environmental Management, 143, 8–16. https://doi.org/10.1016/j.jenvman.2014.04.017
  7. Chen, X., Strokal, M., van Vliet, M. T. H., Fu, X., Wang, M., Ma, L. & Kroeze, C. (2022). In-stream surface water quality in China: A spatially explicit modelling approach for nutrients. Journal of Cleaner Production, 334(May 2021), 130208. https://doi.org/10.1016/j.jclepro.2021.130208
  8. Dawson, R. (2011). How significant is a boxplot outlier? Journal of Statistics Education, 19(2). https://doi.org/10.1080/10691898.2011.11889610
  9. Dong, Y. & Peng, C. Y. J. (2013). Principled missing data methods for researchers (Expectation Maximaization explained). SpringerPlus, 2(1), 1–17.
  10. Efstathiou, C. E. (2006). Estimation of type I error probability from experimental Dixon’s “Q” parameter on testing for outliers within small size data sets. Talanta, 69(5), 1068–1071. https://doi.org/10.1016/j.talanta.2005.12.031
  11. Fouad, K. M., Ismail, M. M., Azar, A. T. & Arafa, M. M. (2021). Advanced methods for missing values imputation based on similarity learning. PeerJ Computer Science, 7, 1–38. https://doi.org/10.7717/PEERJ-CS.619
  12. Garlits, J., McAfee, S., Taylor, J.-A., Shum, E., Yang, Q., Nunez, E., Kameron, K., Fenech, K., Rodriguez, J., Torri, A., Chen, J., Sumner, G. & Partridge, M. A. (2023). Statistical approaches for establishing appropriate immunogenicity assay cut points: Impact of sample distribution, sample size, and outlier removal. The AAPS Journal, 25(37). https://doi.org/10.1208/s12248-023-00806-5
  13. Guo, Y.-H., Fan, X.-Y., Zhang, L., Fan, H. & Xu, Y.-J. (2015). Determination of La/CeO₂ content in ilmenite electrode coating. Rare Metals, 34(7), 505–509. https://doi.org/10.1007/s12598-014-0406-0
  14. Holt, B. & Benfer, R. A. (2000). Estimating missing data: An iterative regression approach. Journal of Human Evolution, 39(3), 289–296. https://doi.org/10.1006/jhev.2000.0418
  15. Horvat, Z., Horvat, M., Pastor, K., Bursić, V. & Puvača, N. (2021). Multivariate analysis of water quality measurements on the danube river. Water (Switzerland), 13(24), 1–20. https://doi.org/10.3390/w13243634
  16. Isaac, R. & Siddiqui, S. (2022). Application of water quality index and multivariate statistical techniques for assessment of water quality around Yamuna River in Agra Region, Uttar Pradesh, India. Water Supply, 22(3), 3399–3418. https://doi.org/10.2166/WS.2021.395
  17. Islam Khan, M. S., Islam, N., Uddin, J., Islam, S. & Nasir, M. K. (2022). Water quality prediction and classification based on principal component regression and gradient boosting classifier approach. Journal of King Saud University - Computer and Information Sciences, 34(8), 4773–4781. https://doi.org/10.1016/j.jksuci.2021.06.003
  18. Jancosek, M. & Pajdla, T. (2014). Exploiting visibility information in surface reconstruction to preserve weakly supported surfaces. International Scholarly Research Notices, 2014, Article ID 798595, 20 pages. https://doi.org/10.1155/2014/798595
  19. Kadengye, D. T., Cools, W., Ceulemans, E. & van den Noortgate, W. (2012). Simple imputation methods versus direct likelihood analysis for missing item scores in multilevel educational data. Behavior Research Methods, 44(2), 516–531. https://doi.org/10.3758/s13428-011-0157-x
  20. Li, X., Ding, J. & Ilyas, N. (2021). Machine learning method for quick identification of water quality index (WQI) based on Sentinel-2 MSI data: Ebinur Lake case study. Water Science and Technology: Water Supply, 21(3), 1291–1312. https://doi.org/10.2166/ws.2020.381
  21. Liu, C. (2001). A comparison of five distance‐based methods for spatial pattern analysis. Journal of Vegetation Science, 12(3), 411–416. https://doi.org/10.2307/3236855
  22. Maniruzzaman, M., Rahman, M. J., Al-MehediHasan, M., Suri, H. S., Abedin, M. M., El-Baz, A. & Suri, J. S. (2018). Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers. Journal of Medical Systems, 42(5), 1–17. https://doi.org/10.1007/s10916-018-0940-7
  23. Nicolson, A. & Paliwal, K. K. (2019). Deep learning for minimum mean-square error approaches to speech enhancement. Speech Communication, 111, 44–55. https://doi.org/10.1016/j.specom.2019.06.002
  24. Ortas, E., Burritt, R. L. & Christ, K. L. (2019). The influence of macro factors on corporate water management: A multi-country quantile regression approach. Journal of Cleaner Production, 226, 1013–1021. https://doi.org/10.1016/j.jclepro.2019.04.165
  25. Roger A. Falconer (1992). Flow and water quality modelling in coastal and inland water. Journal of Hydraulic Research, (30) issue 4, page: 437-452. https://doi.org/10.1080/00221689209498893
  26. Sadiq, Q., Ezeamaka, C. K., Daful, M. G. & Mustafa, I. A. (2022). Evaluation of the Water Quality of River Kaduna, Nigeria Using Water Quality Index. Environmental Technology and Science Journal, 13(1), 28–40. https://doi.org/10.4314/etsj.v13i1.3
  27. Sakizadeh, M. (2016). Artificial intelligence for the prediction of water quality index in groundwater systems. Modeling Earth Systems and Environment, 2(1). https://doi.org/10.1007/s40808-015-0063-9
  28. Salgado, C. M., Azevedo, C., Proen, H. & Vieira, S. M. (2016). Secondary Analysis of Electronic Health Records. Secondary Analysis of Electronic Health Records, 1–427. https://doi.org/10.1007/978-3-319-43742-2
  29. Tripathi, M. & Singal, S. K. (2019). Use of Principal Component Analysis for parameter selection for development of a novel Water Quality Index: A case study of river Ganga India. Ecological Indicators, 96, 430–436. https://doi.org/10.1016/j.ecolind.2018.09.025
  30. Wang, J., Zhang, Y., Cao, H. & Zhu, W. (2012). Dimension reduction method of independent component analysis for process monitoring based on minimum mean square error. Journal of Process Control, 22(2), 477–487. https://doi.org/10.1016/j.jprocont.2011.11.005
  31. Wilrich, P. T. (2013). Critical values of Mandel’s h and k, the Grubbs and the Cochran test statistic. AStA Advances in Statistical Analysis, 97(1), 1–10. https://doi.org/10.1007/s10182-011-0185-y
  32. Wu, Y., Kihara, K., Takeda, Y., Sato, T., Akamatsu, M. & Kitazaki, S. (2020). The relationship between drowsiness level and takeover performance in automated driving. In H. Krömker (Ed.), Human interaction and emerging technologies: Proceedings of the 2nd International Conference on Human Interaction and Emerging Technologies (IHIET 2020) (Lecture Notes in Computer Science, Vol. 12213, pp. 125–142). Springer. https://doi.org/10.1007/978-3-030-50537-0_11
  33. Xie, W., Chkrebtii, O. & Kurtek, S. (2020). Visualization and Outlier Detection for Multivariate Elastic Curve Data. IEEE Transactions on Visualization and Computer Graphics, 26(11), 3353–3364. https://doi.org/10.1109/TVCG.2019.2921541
  34. Yakut M.Ş & Baran B. (2025). A Hybrid Machine Learning and Stochastic Modeling Framework for Probabilistic Reliability Analysis of Kızılırmak River Water Quality. Water Environment Research, Vol 97 Issue 9 Sep 2025, 117536. https://doi.org/10.1002/wer.70169
  35. Yang, D., Luan, W., Li, Y., Zhang, Z. & Tian, C. (2023). Multi-scenario simulation of land use and land cover based on shared socioeconomic pathways: The case of coastal special economic zones in China. Journal of Environmental Management, 335(January), 117536. https://doi.org/10.1016/j.jenvman.2023.117536
  36. Yang, R. (2022). Analyses of Approaches to Deal with Missing Data in Water Quality Data Set. https://doi.org/https://doi.org/10.2991/aebmr.k.220405.184
  37. Zhang, Y. & Thorburn, P. J. (2022). Handling missing data in near real-time environmental monitoring: A system and a review of selected methods. Future Generation Computer Systems, 128, 63–72. https://doi.org/10.1016/j.future.2021.09.033

Date

25.02.2026

Type

Article

Identifier

DOI: 10.24425/aep.2026.158379

DOI

10.24425/aep.2026.158379

Abstracting & Indexing

Abstracting & Indexing


Archives of Environmental Protection is covered by the following services:


AGRICOLA (National Agricultural Library)

Arianta

Baidu

BazTech

BIOSIS Citation Index

CABI

CAS

DOAJ

EBSCO

Engineering Village

GeoRef

Google Scholar

Index Copernicus

Journal Citation Reports™

Journal TOCs

KESLI-NDSL

Naviga

ProQuest

SCOPUS

Reaxys

Ulrich's Periodicals Directory

WorldCat

Web of Science

×