Self-optimization of training dataset improves forecasting of cyanobacterial bloom by machine learning

Jayun Kim, Woosik Jung, Jusuk An, Hyun Je Oh, Joonhong Park

Research output: Contribution to journalArticlepeer-review


Data-driven model (DDM) prediction of aquatic ecological responses, such as cyanobacterial harmful algal blooms (CyanoHABs), is critically influenced by the choice of training dataset. However, a systematic method to choose the optimal training dataset considering data history has not yet been developed. Providing a comprehensive procedure with self-based optimal training dataset-selecting algorithm would self-improve the DDM performance. In this study, a novel algorithm was developed to self-generate possible training dataset candidates from the available input and output variable data and self-choose the optimal training dataset that maximizes CyanoHAB forecasting performance. Nine years of meteorological and water quality data (input) and CyanoHAB data (output) from a site on the Nakdong River, South Korea, were acquired and pretreated via an automated process. An artificial neural network (ANN) was chosen from among the DDM candidates by first-cut training and validation using the entire collected dataset. Optimal training datasets for the ANN were self-selected from among the possible self-generated training datasets by systematically simulating the performance in response to 46 periods and 40 sizes (number of data elements) of the generated training datasets. The best-performing models were screened to identify the candidate models. The best performance corresponded to 6–7 years of training data (∼18 % lower error) for forecasting 1–28 d ahead (1–28 d of forecasting lead time (FLT)). After the hyperparameters of the screened model candidates were fine-tuned, the best-performing model (7 years of data with 14 d FLT) was self-determined by comparing the forecasts with unseen CyanoHAB events. The self-determined model could reasonably predict CyanoHABs occurring in Korean waters (cyanobacteria cells/mL ≥ 1000). Thus, our proposed method of self-optimizing the training dataset effectively improved the predictive accuracy and operational efficiency of the DDM prediction of CyanoHAB.

Original languageEnglish
Article number161398
JournalScience of the Total Environment
Publication statusPublished - 2023 Mar 25

Bibliographical note

Publisher Copyright:
© 2023

All Science Journal Classification (ASJC) codes

  • Environmental Engineering
  • Environmental Chemistry
  • Waste Management and Disposal
  • Pollution


Dive into the research topics of 'Self-optimization of training dataset improves forecasting of cyanobacterial bloom by machine learning'. Together they form a unique fingerprint.

Cite this