Astronomy and Computing 48 (2024) 100857
F
P
M
D
a
I
b
c
I
A
K
M
S
S
S
F
1
c
2
o
f
n
c
p
n
M
i
s
S
h
R
A
2
n
Contents lists available at ScienceDirect
Astronomy and Computing
journal homepage: www.elsevier.com/locate/ascom
ull length article
redicting sunspot number from topological features in spectral images I:
achine learning approach
. Sierra-Porta a,∗, M. Tarazona-Alvarado b, D.D. Herrera Acevedo c
Universidad Tecnológica de Bolívar, Facultad de Ciencias Básicas, Parque Industrial y Tecnológico Carlos Vélez Pombo Km 1 Vía Turbaco, Cartagena de
ndias, 130010, Bolívar, Colombia
Universidad Industrial de Santander, Escuela de Física, Car 27 #9, Bucaramanga, 680001, Santander, Colombia
Universidad Tecnológica de Bolívar, Facultad de Ingeniería, Parque Industrial y Tecnológico Carlos Vélez Pombo Km 1 Vía Turbaco, Cartagena de
ndias, 130010, Bolívar, Colombia
R T I C L E I N F O A B S T R A C T
eywords: This study presents an advanced machine learning approach to predict the number of sunspots using a
achine learning comprehensive dataset derived from solar images provided by the Solar and Heliospheric Observatory (SOHO).
unspots prediction The dataset encompasses various spectral bands, capturing the complex dynamics of solar activity and
pectral images facilitating interdisciplinary analyses with other solar phenomena. We employed five machine learning models:
un’s dynamics
ractal features Random Forest Regressor, Gradient Boosting Regressor, Extra Trees Regressor, Ada Boost Regressor, and
Hist Gradient Boosting Regressor, to predict sunspot numbers. These models utilized four key heliospheric
variables — Proton Density, Temperature, Bulk Flow Speed and Interplanetary Magnetic Field (IMF) —
alongside 14 newly introduced topological variables. These topological features were extracted from solar
images using different filters, including HMIIGR, HMIMAG, EIT171, EIT195, EIT284, and EIT304. In total,
60 models were constructed, both incorporating and excluding the topological variables. Our analysis reveals
that models incorporating the topological variables achieved significantly higher accuracy, with the r2-score
improving from approximately 0.30 to 0.93 on average. The Extra Trees Regressor (ET) emerged as the best-
performing model, demonstrating superior predictive capabilities across all datasets. These results underscore
the potential of combining machine learning models with additional topological features from spectral analysis,
offering deeper insights into the complex dynamics of solar activity and enhancing the precision of sunspot
number predictions. This approach provides a novel methodology for improving space weather forecasting and
contributes to a more comprehensive understanding of solar-terrestrial interactions.. Introduction determining solar cycles, including their maxima, minima, and overall
duration. Understanding solar cycles allows scientists to predict space
The investigation into solar activity and its repercussions on Earth’s weather events, which can have significant impacts on Earth’s climate
limate (Solomon et al., 2019; Le Mouël et al., 2019; Floyd et al., and technological systems (National Research Council and Division
002; Zhang et al., 2021; Singh and Bhargawa, 2020) and technol- on Engineering and Physical Sciences and Space Studies Board and
gy (Spencer et al., 2019) has been a focal point of extensive research Committee on the Societal and Economic Impacts of Severe Space
or many years. Sunspots, characterized by regions of reduced bright- Weather Events and A Workshop, 2009; National Research Council and
ess on the Sun’s surface (Tlatov, 2022; Nandy, 2021), serve as a Division on Engineering and Physical Sciences and Aeronautics and
rucial indicator of solar activity. They are associated with various Space Engineering Board and Space Studies Board and Committee on
henomena such as solar flares, coronal mass ejections, and geomag- a Decadal Strategy for Solar and Space Physics (Heliophysics), 2013).
etic storms (Gour et al., 2021; Cliver et al., 2022; Alexakis and
avromichalaki, 2019). Accurately forecasting the quantity of sunspots For instance, high solar activity can lead to geomagnetic storms that
s, therefore, essential for understanding and mitigating the impacts of disrupt communication and navigation systems, damage satellites, and
olar activity on our planet. affect power grids. Additionally, long-term variations in solar activity
Predicting the number of sunspots is crucial for several reasons. have been linked to climate patterns on Earth (Drews et al., 2022;
unspots are key indicators of solar activity and are instrumental in Tsiropoula, 2003), such as temperature fluctuations and atmospheric
∗ Corresponding author.
E-mail address: dporta@utb.edu.co (D. Sierra-Porta).ttps://doi.org/10.1016/j.ascom.2024.100857
eceived 12 March 2024; Accepted 15 July 2024
vailable online 19 July 2024
213-1337/© 2024 The Author(s). Published by Elsevier B.V. This is an open access ar
c-nd/4.0/).ticle under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
D. Sierra-Porta et al. Astronomy and Computing 48 (2024) 100857circulation changes. Accurate sunspot predictions contribute to better
preparedness for these events, thereby mitigating their adverse effects
on society and infrastructure.
In astronomy, predicting solar cycles is also crucial for the planning
of surveys and nighttime observation schedules to minimize inter-
ference from cosmic rays and scattered plasma in the interplanetary
medium (Grauer and Grauer, 2021; Barentine, 2022). Accurate predic-
tions enable astronomers to select optimal observation times, reducing
noise and enhancing data quality. Additionally, understanding solar
cycles aids in developing resilience and adaptive strategies for ex-
treme events that impact terrestrial technological systems, ensuring
the continued functionality of communication, navigation, and power
infrastructures during periods of intense solar activity.
Recent advancements have shown significant improvement in the
use of machine learning models to predict sunspot numbers based
on an array of solar parameters (Khan et al., 2020; Xiao, 2021) but
also deep learning methods (Pala and Atici, 2019; Prasad et al., 2023;
Li et al., 2021) including proton density, temperature, field mag-
netic average (FMA), and bulk flow speed. Traditionally, these models
rely on datasets provided by space missions that measure various
characteristics of the photosphere, heliosphere, magnetic fields, and
chromosphere. Notable examples include NASA’s OMNIWEB services
(https://omniweb.gsfc.nasa.gov/ow.html) and spacecraft missions such
as NASA’s Advanced Composition Explorer (ACE) (Stone et al., 1998;
Hill et al., 2020), Wind missions (Wilson et al., 2021), and NOAA’s
Deep Space Climate Observatory (DSCOVR) mission (Burt and Smith,
2012; Marshak et al., 2018), situated at Lagrange point L1. These
missions routinely measure the Interplanetary Magnetic Field (IMF) and
other aspects of solar dynamics.
For instance, Dani and Sulistiani (2019) aimed to predict the peak
time and value of Solar Cycle 25 using four different machine learn-
ing regression methods: Linear Regression (LR), Random Forest (RF),
Radial Basis Function (RBF), and Support Vector Machine (SVM). This
study utilized monthly mean sunspot number data from 1856 to June
2018 (solar cycles 10–24) provided by the World Data Center SILSO.
The RF prediction suggested a lower maximum with a well-defined
double-peak, and all methods predicted Solar Cycle 25 to commence
in late 2019 or early 2020.
Similarly, Mahdi et al. (2019) explored the relationship between
sunspot counts and coronal mass ejections (CMEs) using various clas-
sification algorithms, including decision tree, nearest neighbor, sup-
port vector machine, discriminant, ensembles, and logistic regression.
They found that the Ensemble Bagged Tree model exhibited the best
performance with an accuracy of 90.8%.
Moreover, Dang et al. (2022) introduced a novel ensemble model,
XGBoost-DL, which integrates deep learning models with XGBoost. This
model demonstrated exceptional forecasting performance, surpassing
other models with a Root Mean Square Error (RMSE) of 25.70 and
Mean Absolute Error (MAE) of 19.82, highlighting the efficacy of
ensemble models in sunspot number prediction.
Recently, Tarazona-Alvarado and Sierra-Porta (2023) developed a
comprehensive dataset based on solar images from the Solar and He-
liospheric Observatory (SOHO). This multidisciplinary effort resulted
in a robust methodology for calculating spectral parameters and rel-
evant features from SOHO images, extracting 14 topological features
and fractal metrics correlated with sunspot numbers. These metrics
include entropy, mean intensity, standard deviation, skewness, kurto-
sis, relative smoothness, uniformity, fractal dimension, Taruma con-
trast, Taruma directionality, Taruma coarseness, Taruma linelikeness,
Taruma regularity, and Taruma roughness.
This study aims to develop robust regression models to predict
sunspot numbers by incorporating independent features such as proton
temperature, wind speed flow, proton density, and the Interplanetary
Magnetic Field (IMF), along with exogenous topological variables de-
rived from spectral solar images. These features, extracted from the
SOHO dataset, provide crucial information obtained through advanced
2
image processing techniques. Such techniques, relevant in computer vi-
sion and image processing, provide quantitative information about the
structure, content, and visual characteristics of solar images, thus facil-
itating tasks like classification, detection, and segmentation (Tamura
et al., 1978; Amadasun and King, 1989; Wu and Chen, 1992). We
aim to determine the best approach for predicting sunspot numbers by
comparing the performance of different machine learning models using
these diverse features obtained from various SOHO spectral filters.
Our dataset supports the training of five machine learning regres-
sion models: Random Forest Regressor, Gradient Boosting Regressor,
Extra Trees Regressor, Ada Boost Regressor, and Hist Gradient Boosting
Regressor. By leveraging this dataset, we compare the effectiveness of
these models in accurately predicting sunspot numbers. The objective
is to evaluate how the inclusion of topological variables from SOHO
images enhances the predictive accuracy of these machine learning
models.
The primary objectives of this research are: (1) to assess the per-
formance of various machine learning regression models in predicting
sunspot numbers, considering both traditional solar parameters and
novel topological attributes; and (2) to investigate the added value
of incorporating these new variables derived from SOHO images in
improving prediction accuracy. This study highlights the potential of
advanced machine learning approaches to provide high-precision pre-
dictions of sunspot numbers, thereby offering deeper insights into the
dynamics of solar activity.
While Mahdi et al. (2019) focused on predicting CME initiation
using traditional classification algorithms based on sunspot number, we
propose a research takes a different approach by incorporating unique
topological features derived from solar images. These topological fea-
tures, not considered in previous studies, capture intricate patterns of
solar activity, providing additional context and potentially enhancing
the predictive accuracy of our regression models for sunspot numbers.
Additionally, while XGBoost-DL (Dang et al., 2022) focuses on the
power of deep learning and ensemble techniques, our study takes a
different approach by incorporating these unique topological variables
derived from solar images.
2. Data, methods and techniques
2.1. Data acquisition, preparation and description
The data used in this study come from three primary sources: OMNI-
WEB (https://omniweb.gsfc.nasa.gov/ow.html), the Royal Observatory
of Belgium’s Solar Influences Data Analysis Center (SILSO) (https://
www.sidc.be/SILSO/datafiles), and a dataset from Mendeley Data and
Data in Brief (https://data.mendeley.com/datasets/5gh3xbvc92/1).
The dataset from OMNIWEB provides four main variables, including
wind flow speed (km/sec), proton density (n/cc), proton temperature
(Kelvin), and Interplanetary Magnetic Field (IMF) (nT). The Brussels
Observatory dataset offers corresponding sunspot values with daily
resolution. Meanwhile, the Mendeley Data and Data in Brief dataset
provides 14 topological and spectral variables. For a detailed descrip-
tion of how the data are constructed, refer to Sierra Porta and Tarazona-
Alvarado (2023), Tarazona-Alvarado and Sierra-Porta (2023). These
variables are generated from SOHO images taken with different solar
filters, such as HMIIGR, HMIMAG (Helioseismic and Magnetic Imager
Intensitygram), and EIT171, EIT195, EIT284, and EIT304 (Extreme
Ultraviolet Imaging Telescope).
For the new variables and features, we curated a diverse and com-
prehensive dataset derived from time series, encompassing data from
six distinct filters of the SOHO cameras. These datasets cover various
resolutions and frequencies, providing a detailed and temporally rich
view of solar activity. Noteworthy are the HMIIGR and HMIMAG filters,
which operate in the visible and near-infrared spectrum, respectively,
capturing images every hour and a half (Schou et al., 2012; Scherrer
et al., 2012). Additionally, the EIT filters (EIT171, EIT195, EIT284, and
D. Sierra-Porta et al. Astronomy and Computing 48 (2024) 100857Fig. 1. Examples of solar images analyzed in our study as of June 26,
2023. The top image shows the Sun captured by the SOHO EIT304 filter
(https://soho.nascom.nasa.gov/data/REPROCESSING/Completed/2023/eit304/
20230626/20230626_1319_eit304_1024.jpg) and the SOHO EIT195 filter
(https://soho.nascom.nasa.gov/data/REPROCESSING/Completed/2023/eit195/
20230626/20230626_1313_eit195_1024.jpg), which highlight the extreme ultraviolet
emission at wavelengths of 304 Å and 195 Å, respectively, revealing the structure of
the solar chromosphere, transition region, hot corona, and solar flares. Additionally,
bottom images include samples from the HMIIGR (https://soho.nascom.nasa.gov/data/
REPROCESSING/Completed/2023/hmiigr/20230626/20230626_1330_hmiigr_1024.jpg)
and HMIMAG (https://soho.nascom.nasa.gov/data/REPROCESSING/Completed/
2023/hmimag/20230626/20230626_1330_hmimag_1024.jpg) filters, which capture
intensitygrams and magnetograms of the solar surface, providing essential data for
topological feature extraction. These images are crucial for deriving the topological
features used in our machine learning models to predict sunspot numbers.
EIT304) in the extreme ultraviolet region capture two images per day
at different wavelengths (Delaboudiniere et al., 1995; Kohl et al., 1995)
(see Fig. 1).
This wide range of capture frequencies facilitates an in-depth ex-
ploration of solar activity patterns across various temporal scales.
Furthermore, the computation of 14 key parameters in the images —
including entropy, mean intensity, standard deviation, skewness, kur-
tosis, relative smoothness, uniformity, fractal dimension, and various
Taruma metrics — adds significant depth to our dataset (Tamura et al.,
1978; Amadasun and King, 1989; Wu and Chen, 1992). These param-
eters offer a comprehensive characterization of solar image properties,
encompassing texture, complexity, and regularity.
Entropy represents the randomness and diversity of the image,
calculated using Shannon’s entropy formula. Mean intensity is the
average of all pixel intensities in the image, providing an overall
measure of brightness. Standard deviation indicates the dispersion of
pixel intensities with respect to the mean intensity. Skewness measures
the symmetry of the intensity distribution in the image. Kurtosis quan-
tifies the shape of the intensity distribution, indicating the peakedness.
Uniformity indicates how evenly the pixel intensities are distributed in
the image. Relative smoothness provides a measure of the smoothness
or roughness of the image. Taruma Contrast evaluates the contrast
characteristics of the image based on standard deviation and kurtosis.
Taruma Directionality quantifies the predominant direction of features
within the image. Taruma Coarseness measures the texture or granu-
larity of the image based on the coefficients of the wavelet transform.3
Taruma Linelikeness measures the presence and prevalence of linear
patterns in the image. Taruma Regularity quantifies the uniformity
and repeatability of patterns present in the image. Taruma Roughness
evaluates the degree of irregularity or rough texture present in the
image.
The data mining process (see Fig. 2) involves organizing, cleaning,
and arranging the three different datasets. This process includes elimi-
nating missing data and merging the datasets based on the temporal
alignment of events. The sunspot resolution is daily, so daily aver-
ages of the heliospheric dynamics data are used, and the topological
characteristics data are also resampled using Exponential Weighted
Moving functions (EWM) method, which are useful for smoothing data
and emphasizing more on recent observations. From the above, six
datasets (one per filter in SOHO images) are produced with the same
characteristics and variables for each type of filter wavelength used to
obtain the SOHO images, covering the period from the beginning of
2011 to mid-2023.
To ensure the robustness of the predictive models, a thorough
collinearity analysis was conducted. This analysis identified significant
correlations between variables, which were then used to guide feature
selection and engineering processes. The details and results of the
correlation analysis are presented in the Results section.
For the moment, Fig. 3 illustrates the temporal behavior of the
sunspot number (upper-left panel) along with various variables re-
lated to heliosphere dynamics and topological features. Specifically, it
shows the time series for sunspot number, entropy, fractal dimension,
Taruma’s coarseness, Taruma’s uniformity, bulk flow speed, proton
density, proton temperature, and field magnetic average (FMA). These
variables provide insights into the intricate dynamics of solar activity
over the observed period.
After this process we obtain 6 datasets (one for each type of im-
age analyzed using a different filter according to the SOHO, that is:
HMIIGR, HMIMAG, EIT171, EIT195, EIT284 and EIT304), each of these
datasets contains 6 heliospheric variables, 14 topological variables and
the sunspot number, which will be our predicted variable.
2.2. Methods: Machine learning regressors
The main objective of machine learning (ML) techniques is to
develop models capable of performing tasks such as prediction or
estimation. In regression, the goal is to predict continuous values rather
than classify data into predefined classes. When developing regression
models using ML techniques, both training errors (errors on the training
data) and generalization errors (expected errors on the testing data)
can occur. A good regression model should fit the training set well and
accurately predict new, unseen data. Overfitting, where test error rates
increase while training error rates decrease, is related to model com-
plexity and should be minimized to achieve the lowest generalization
error. The bias–variance decomposition method formally analyzes the
expected generalization error by measuring the bias (error rate) and
variance (sensitivity to training set fluctuations) components, with the
overall expected error being the sum of both.
The primary objective of our study is to enhance the predictive
accuracy of sunspot numbers by incorporating both heliospheric and
topological features derived from solar images. Given the complex
and dynamic nature of solar activity, it is essential to employ ro-
bust machine learning methodologies that can effectively capture and
model these intricacies. Traditional approaches relying solely on he-
liospheric variables may fall short in capturing the full spectrum of
influences on sunspot numbers. Therefore, our approach leverages
advanced machine learning algorithms and a comprehensive dataset
that includes additional topological features from SOHO images. By
comparing models that use only heliospheric variables with those
that also incorporate topological features, we aim to demonstrate the
added predictive power of these features. This dual-scenario approach,
applied across six different types of solar images, allows for a thorough
D. Sierra-Porta et al. Astronomy and Computing 48 (2024) 100857Fig. 2. Overview of the methodological framework for data acquisition, preparation, and model training. The process begins with data acquisition, involving the collection of
topological and spectral features from solar images (F𝟭) and solar wind features from the OMNIWeb database (F𝟮). Sunspot number labels are sourced from the SILSO database
(L). The data preparation phase includes preprocessing steps such as collinearity analysis, feature engineering, and the selection of appropriate machine learning models.Fig. 3. Temporal behavior of the sunspot number (upper-left panel), as well as various variables related to heliosphere dynamics and topological features calculated from the
SOHO EIT171 filter images.evaluation of the effectiveness of topological information in improving
sunspot prediction models.
To develop regression models, we created a total of 60 models.
These consist of 5 regressors × 6 image types × 2 scenarios. The two sce-
narios include: one using only heliospheric variables, and another using
both heliospheric and topological features derived from SOHO images.
This approach allows us to compare the predictive power of models and
demonstrate how topological features can enhance traditional models.
The data was first split into training and testing sets using the
k-fold cross-validation method (Schaffer, 1993; Tougui et al., 2021;
Ramezan et al., 2019) and hyperparameter tuning (Yang and Shami,
2020; Weerts et al., 2020). The training set was used to train the
models, while the testing set was used to evaluate their performance.
The machine learning models used in this work are the following.
The Random Forest Regressor (RF) (Breiman, 2001) is an ensemble
method that constructs multiple decision trees during training and
outputs the mean prediction of the individual trees, thereby reducing
variance and overfitting. This robustness makes RF suitable for han-
dling the noisy, high-dimensional nature and nonlinear behavior of
solar activity data. The 11-year solar cycle leads to periodic variations
in sunspot numbers, which often exhibit near-normal distributions but
with noticeable skewness and asymmetry. The Gradient Boosting Re-
gressor (GB) (Friedman, 2001) builds an ensemble of weak prediction
models sequentially, with each new tree correcting the errors of the
previous ones. This approach is particularly effective for capturing the
complex, nonlinear relationships in solar activity data. Similarly, the
Extra Trees Regressor (ET) (Geurts et al., 2006) introduces additional4
randomness during tree construction, improving model diversity and
performance in datasets with small or noisy characteristics, such as
those found in sunspot observations.
The AdaBoost Regressor (AB) (Freund and Schapire, 1997) enhances
prediction accuracy by focusing on reducing variance and assigning
more weight to incorrectly predicted instances in each iteration. This
makes AB resistant to overfitting and suitable for the complex distribu-
tions observed in sunspot data. Lastly, the Extreme Gradient Boosting
Regressor (XGB) (Chen and Guestrin, 2016) sequentially trains weak
models to correct previous errors, optimizing a specific loss function
to create robust predictions. XGB’s flexibility and ability to handle
high-dimensional data with numerous features make it particularly
effective for capturing the intricacies of sunspot number variations over
the 11-year solar cycle, accommodating the inherent skewness and
asymmetry.
For the RF, GB, ET, AB, and XGB models, different values for
the number of trees (n_estimators), learning rate (learning_rate), and
maximum tree depth (max_depth) were explored using GridSearchCV
and hyperparameter tuning with a cross-validation strategy (Schaffer,
1993). The parameters evaluated included max_depth: [5, 10, 20, None],
max_features: [5, 10, 20, ‘auto’, ‘sqrt’, ‘log2’, None], n_estimators: [30,
50, 100, 200, 300], learning_rate: [0.01, 0.05, 0.1, 0.2, 0.5], max_iter:
[50, 100, 150].
The cross-validation strategy was used to select the best models
in each case, allowing evaluation of each model’s performance on an
independent validation set and optimization of its parameters through
grid search. Once the best estimators and parameters were determined
D. Sierra-Porta et al. Astronomy and Computing 48 (2024) 100857for each model, their performance was evaluated on the independent
test set.
For the design, implementation, and application of our regression
models, each dataset was split into three distinct subsets: a training
set, a testing set, and a validation set. The training set comprises
70% of the complete data, selected entirely at random. The testing
set consists of the remaining 30% of the data. Additionally, a third
validation set was created by randomly selecting 70% of the original
data, using a different seed to ensure that this set includes elements
from both the training and testing sets. This approach allows for robust
model evaluation and ensures that the models are not overfitted to
a particular subset of data, providing a comprehensive assessment of
their predictive performance.
2.3. Performance metrics
Once a regression model is obtained using one or more ML tech-
niques, it is important to estimate the model’s performance. The perfor-
mance analysis of each proposed model is measured in terms of r2-score
(r2), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE),
Explained Variance Score (EVS), and Mean Tweedie Deviance (MTD).
The r2-score measures the proportion of the variance in the de-
pendent variable that is predictable from the independent variable(s).
It ranges from 0 to 1, with 1 indicating a perfect fit between the
predicted and actual values. A value of 0 indicates that the model does
not explain any of the variability in the data. RMSE is a metric that
measures the average distance between the predicted and actual values
of the dependent variable. It is calculated by taking the square root of
the mean of the squared differences between the predicted and actual
values.
The EVS measures the proportion of the variance in the dependent
variable explained by the independent variable(s). It ranges from 0
to 1, with 1 indicating a perfect fit between the predicted and actual
values. A value of 0 indicates that the model does not explain any of
the variability in the data. The MTD metric measures the deviation of
the predicted values from the true values of the dependent variable.
It is calculated by taking the mean of the Tweedie deviances, which
measure the difference between the predicted and actual values. In
regression model evaluation, a lower MTD value indicates better model
performance, as it signifies that the predictions are closer to the true
values. However, the optimal MTD value may vary depending on the
specific context and application.
The predictive accuracy of the model is computed from the testing
set, which provides an estimation of the generalization errors. To obtain
reliable results regarding the predictive performance of a regression
model, it is crucial that training and testing samples are sufficiently
large and independent, with known labels for the testing sets. Common
methods for evaluating the performance of a regression model by split-
ting the initial labeled data into subsets include: (i) Holdout Method,
(ii) Random Sampling, (iii) Cross-Validation, and (iv) Bootstrap.
In our study, we employed the random sampling method with a
bootstrap option. This approach is similar to the Holdout method,
where data samples are partitioned into two separate sets: the training
set and the test set. However, random sampling involves repeating this
partitioning process multiple times, with training and test instances
selected randomly each time to better estimate accuracy. By using the
bootstrap option, samples are selected with replacement, meaning that
after being chosen for training, they are returned to the entire data
set. This method allows for a comprehensive evaluation of the model’s
predictive performance by ensuring that the model is tested on a variety
of data subsets.5
Table 1
Summary of variables retained after application of the collinearity between variables
removal algorithm. The 𝑁𝑇 column refers to the number of topological variables
retained from the initial 14 totals, while the 𝑅% column refers to the percentage
reduction of variables with respect to the original (20).
Dataset Keep variables 𝑁𝑇 𝑅%
HMIIGR FMA, Bulk Flow speed, 4 60
Proton Density, Temperature,
entropy, standard deviation,
fractal dimension, Taruma regularity
HMIMAG FMA, Bulk Flow speed, 4 60
Proton Density, Temperature,
entropy, standard deviation,
fractal dimension, Taruma roughness
EIT171 FMA, Bulk Flow speed, 7 45
Proton Density, Temperature,
standard deviation, relative smoothness,
fractal dimension, Taruma directionality,
Taruma linelikeness, Taruma regularity,
Taruma roughness
EIT195 FMA, Bulk Flow speed, 7 45
Proton Density, Temperature,
standard deviation, kurtosis,
relative smoothness, fractal dimension,
Taruma roughness, Taruma coarseness,
Taruma directionality
EIT284 FMA, Bulk Flow speed, 7 45
Proton Density, Temperature,
standard deviation, skewness,
fractal dimension, Taruma directionality,
Taruma linelikeness, Taruma regularity,
Taruma roughness
EIT304 FMA, Bulk Flow speed, 6 50
Proton Density, Temperature,
relative smoothness, fractal dimension,
Taruma directionality, Taruma regularity,
Taruma linelikeness, Taruma roughness
3. Results of ML applications in sunspot
In regression analysis, multicollinearity among independent vari-
ables can lead to unstable estimates and reduce the interpretability
of the model. To address this issue, we implemented a strategy to
identify and remove highly collinear variables based on a specified
correlation threshold. This process involves calculating the correlation
matrix for all independent variables and excluding those with corre-
lation coefficients exceeding the threshold of 0.7. By doing so, we
aim to ensure that the remaining variables provide unique and non-
redundant information, improving the model’s stability and predictive
power. In this sense, some Python libraries (https://pypi.org/project/
collinearity/) were used for this task.
For instance, Fig. 4 presents a heatmap of the correlation matrix for
the EIT171 dataset, illustrating the relationships between all variables
before collinearity removal.
The initial dataset comprised 20 independent variables: 4 helio-
spheric variables (FMA, Bulk Flow speed, Proton Density, Temperature)
and 14 topological variables (entropy, mean intensity, standard de-
viation, skewness, kurtosis, relative smoothness, uniformity, fractal
dimension, Taruma contrast, Taruma directionality, Taruma coarse-
ness, Taruma linelikeness, Taruma regularity, Taruma roughness). The
target variable to be predicted was the sunspot number (SSN).
After applying the collinearity removal process with a threshold
of 0.7, the variables retained for each dataset were significantly re-
duced. The Table 1 shows the variables that have been retained after
elimination due to collinearity between variables.
Table 2 presents the results of the model evaluation metrics for the
various datasets. The numbers inside parentheses in each cell represent
the metrics for the sunspot regression using flow speed, temperature,
density, and IMF as predictor variables, along with the 14 topological
D. Sierra-Porta et al. Astronomy and Computing 48 (2024) 100857Fig. 4. Heatmap of the correlation matrix for the EIT171 dataset. Highly correlated variables (correlation coefficient greater than 0.7) were removed to mitigate multicollinearity.Table 2
Performance metrics of the five machine learning algorithms for predicting sunspot number based on the six datasets with the same characteristics and variables for each of the
types of filter wavelengths to obtain the SOHO images. The metrics include r2-score, MAE, RMSE and MTD, and are shown for each of the six datasets: EIT171, EIT195, EIT284,
EIT304, HMIIGR, and HMIMAG. The values in parentheses represent the correspondent metric for scenario 2, that is, including topological variables in comparison with scenario
1 (only heliospheric variables, without parentheses). The best performance for each metric is highlighted in bold.
Metric Regr. EIT171 EIT195 EIT284 EIT304 HMIIGR HMIMAG
AB 0.11(0.74) 0.12(0.74) 0.11(0.78) 0.12(0.68) 0.12(0.42) 0.12(0.82)
ET 0.31(0.95) 0.31(0.95) 0.31(0.96) 0.31(0.96) 0.32(0.94) 0.32(0.97)
r2-score GB 0.25(0.85) 0.25(0.86) 0.25(0.91) 0.25(0.93) 0.25(0.93) 0.25(0.93)
RF 0.19(0.93) 0.19(0.93) 0.19(0.94) 0.19(0.94) 0.19(0.89) 0.19(0.95)
XGB 0.24(0.87) 0.24(0.91) 0.24(0.93) 0.24(0.94) 0.24(0.93) 0.24(0.92)
AB 38.83(19.84) 38.99(19.44) 39.02(18.01) 38.76(21.79) 39.2(31.76) 39.2(17.62)
ET 33.53(3.97) 33.58(3.79) 33.67(3.28) 33.55(3.32) 33.94(5.12) 33.94(3.38)
MAE GB 35.26(14.26) 35.26(13.64) 35.04(10.99) 35.18(8.11) 35.28(5.95) 35.28(11.48)
RF 36.77(8.59) 36.74(8.23) 36.68(7.3) 36.82(7.47) 36.89(10.1) 36.89(7.41)
XGB 35.4(12.81) 35.31(10.73) 35.18(8.42) 35.37(7.48) 35.43(5.86) 35.43(9.85)
AB 46.92(25.45) 46.8(25.38) 46.89(23.41) 46.87(28.41) 46.81(38.0) 46.81(22.08)
ET 41.36(11.04) 41.42(10.75) 41.52(9.5) 41.31(9.7) 41.69(12.52) 41.69(9.14)
RMSE GB 43.24(19.36) 43.3(18.72) 43.07(15.51) 43.13(12.79) 43.12(13.51) 43.12(15.5)
RF 44.83(13.36) 44.89(13.03) 44.78(11.94) 44.94(12.22) 44.89(15.49) 44.89(11.25)
XGB 43.44(17.65) 43.43(15.5) 43.32(13.02) 43.4(12.49) 43.36(13.37) 43.36(13.71)
AB 2201.73(647.88) 2190.15(644.32) 2199.09(548.1) 2196.37(806.89) 2191.3(1443.75) 2191.3(487.56)
ET 1710.36(121.85) 1715.53(115.57) 1723.62(90.2) 1706.13(94.06) 1738.46(156.87) 1738.46(83.6)
MTD GB 1870.07(374.91) 1874.81(350.37) 1854.97(240.69) 1860.33(163.51) 1859.64(182.47) 1859.64(240.18)
RF 2010.02(178.6) 2015.09(169.74) 2005.46(142.64) 2019.29(149.31) 2015.42(239.8) 2015.42(126.67)
XGB 1887.13(311.52) 1885.74(240.37) 1877.06(169.51) 1883.5(155.89) 1880.17(178.79) 1880.17(187.99)variables. In contrast, the numbers without parentheses reflect the
metrics when using only flow speed, temperature, density, and IMF as
predictor variables.
The results in Table 2 clearly indicate that the models incorporating
the remaining topological variables after collinearity analysis, per-
form better than those using only the heliospheric variables. Addition-
ally, the Extra Trees Regressor and Random Forest Regressor models
outperform the other models evaluated for all datasets considered.
After a thorough analysis and taking into account the results sum-
marized in Table 2, it is found that ETR is the best regression model
that best predicts sunspots for all considered dataset. For the case of
the HMIIGR image dataset, it was determined that the most important
variables contributing to the best result are: entropy (27.04%), standard
deviation (24.14%), Taruma regularity (23.03%), fractal dimension
(8.59%), and FMA (5.56%), covering up to 88% of the total importance
of the model. In this case, we see that only one of the heliospheric
variables contributes minimally. However, the power of the new model
lies in the inclusion of new topological variables.
Similarly, the best regression model using HMIMAG image features
dataset determined that the most important variables are: fractal di-
mension (40.98%), standard deviation (23.99%), entropy (19.41%),
and Taruma roughness (6.41%). These variables cover up to 91% of
the total importance of the model. This result suggests that topological
features extracted from HMIMAG images significantly impact model6
performance, providing new insights and potential for improved pre-
diction accuracy compared to previous models using only heliospheric
variables.
A similar conclusion is reached in the analyses performed on other
types of images, such as EIT171, 195, 284, and 304. It is observed
that heliospheric variables, obtained from the original OMNIWEB data,
account for a minimal proportion of the model’s significance, generally
less than 10%. This finding reinforces the idea that features extracted
from image topology are critical for improving predictive models. The
consistency in these results suggests that the inclusion of new topolog-
ical variables brings significant improvement in the predictive ability
of the models, regardless of the type of image analyzed. A complete
importance variable for all datasets are shown in Table 3.
In our study, we found that the use of topological variables sig-
nificantly improved the performance of the machine learning models.
Specifically, we observed that the average r2-score for all models
generated without the use of topological variables was approximately
0.31, while the inclusion of topological variables increased this r2-
score to over 0.94. This indicates a substantial improvement in the
model performance when topological variables are included. For clar-
ity, the reported averages are calculated across all models and datasets.
Additionally, the RMSE of the model decreased from 41.5 average to
approximately 10.4 average when topological variables were included.
Finally, we found that the Extra Trees Regressor (ET) was the best
D. Sierra-Porta et al. Astronomy and Computing 48 (2024) 100857Table 3
Percentage of importance of the variables that contribute most to the ETR model for all the datasets considered in this study. The first 8 most important variables are shown in
order of highest contribution.
HMIIGR Importance (%) HMIMAG Importance (%) EIT171 Importance (%)
entropy 27.04 fractal_dimension 40.98 standard_deviation 34.73
standard_deviation 24.14 standard_deviation 23.99 relative_smoothness 15.27
taruma_regularity 23.03 entropy 19.41 taruma_directionality 13.26
fractal_dimension 8.59 taruma_roughness 6.41 taruma_regularity 10.04
FMA 5.56 FMA 2.50 fractal_dimension 8.09
Proton Density 4.21 Proton Density 2.40 taruma_linelikeness 6.61
Bulk Flow speed 4.09 Bulk Flow speed 2.22 taruma_roughness 3.65
Temperature 3.31 Temperature 2.04 FMA 2.35
EIT195 Importance (%) EIT284 Importance (%) Variable Importance (%)
standard_deviation 34.47 standard_deviation 54.10 relative_smoothness 49.26
relative_smoothness 21.74 taruma_linelikeness 12.44 taruma_regularity 11.74
taruma_coarseness 19.27 skewness 7.57 fractal_dimension 9.98
taruma_roughness 4.70 fractal_dimension 6.96 taruma_roughness 8.23
kurtosis 4.34 taruma_regularity 5.44 taruma_linelikeness 6.81
taruma_directionality 3.88 taruma_roughness 4.40 taruma_directionality 5.22
fractal_dimension 3.06 taruma_directionality 2.79 Bulk Flow speed 2.33
FMA 2.56 Bulk Flow speed 1.63 FMA 2.32Fig. 5. Visual inspection of predicted SSN versus SSN with original measured data. Each panel shows the comparison of data and prediction using only OMNI variables (right)
and also using topological variables (left) for each type of imagery used.model among all the models considered in this study, based on all the
control metrics used.
Fig. 5 visually presents the SSN prediction performance of all the
models and the data obtained from each type of filter used on the Sun in
the SOHO images. The ET model, as shown in Table 2, is confirmed to
be the best regression model, demonstrating superior performance and
results. It is evident that when topological variables are not used, the
models fail to adequately predict the sunspot number. However, with
the inclusion of topological variables, the predictions are well-adjusted
and accurate.
Additionally, we can observe that the ET model outperforms all
other models for both high and low sunspot numbers. However, it is
also noted that for all models, there is higher dispersion at lower SSN
values and very high prediction accuracy for higher SSN values.
In other word, ETR model consistently achieves superior metrics
compared to other regressors, particularly when topological variables
are included. This conclusion is further reinforced by examining the7
disaggregated root mean squared error metrics for the training, test,
total, and validation sets (See Table 4).
The inclusion of topological variables not only improves prediction
accuracy but also demonstrates no evident overfitting. For example, in
the HMIIGR dataset, the training set and testing set have approximately
the same RMSE, both in two scenarios with and without topological
variables. This indicates that the model generalizes well to unseen data.
The validation set RMSE of approximately value further supports the
robustness of the model. Similar trends are observed in other datasets,
where the validation set MSE remains low and comparable to the
training and test set MSE values.
This study demonstrates that the inclusion of topological variables
from solar images significantly enhances the performance of sunspot
prediction models. While the XGBoost-DL model proposed by other
researchers (Dang et al., 2022) uses a two-level nonlinear ensemble
method to combine deep learning models, achieving an RMSE of 25.70
and an MAE of 19.82, our approach with the Extra Trees Regressor
(ET) and other models incorporates topological features. These features
D. Sierra-Porta et al. Astronomy and Computing 48 (2024) 100857Table 4
Disaggregated root mean squared error metric for the ET model across different datasets and evaluation sets.
Scenario Dataset HMIIGR HMIMAG EIT171 EIT195 EIT284 EIT304
Training 41.39 41.39 40.87 41.15 40.84 41.25
Only Heliospheric variables Testing 42.61 42.61 43.03 41.98 42.67 42.29All 41.69 41.69 41.42 41.36 41.31 41.52
Validation 41.95 41.95 41.21 41.26 41.11 41.41
Training 8.93 12.5 10.48 10.85 9.17 9.52
Heliospheric Topological variables Testing 9.76 12.59 11.52 11.59 10.42 10.21+ All 9.14 12.52 10.75 11.04 9.5 9.7
Validation 9.29 12.77 10.66 10.99 9.38 9.58capture detailed solar activity patterns, which heliospheric variables
alone might miss. The key difference in our study lies in the use of
these unique topological variables, which provide additional context
and improve model accuracy by accounting for the complex structures
observed in solar images.
Incorporating topological variables such as entropy into regression
prediction models can significantly enhance their predictive capabil-
ities by capturing the underlying complexity and randomness within
the data. Entropy measures the degree of disorder or uncertainty in a
dataset, making it a valuable predictor for models dealing with dynamic
systems like solar activity. For example, Shannon entropy has been
effectively used in molecular property predictions (Guha and Velegol,
2023), demonstrating that entropy-based descriptors can reduce pre-
diction errors and improve model accuracy by capturing intricate data
patterns. This suggests that entropy can similarly enhance solar activity
models by accounting for the chaotic and unpredictable nature of solar
phenomena.
Fractal dimension is another topological feature that quantifies the
complexity of structures within an image, offering a deeper understand-
ing of spatial patterns. Fractal analysis has been widely used in various
scientific fields to identify self-similar patterns that traditional metrics
might miss. For instance, fractal dimensions have been used to improve
the prediction accuracy of molecular properties in cheminformatics,
highlighting their potential to capture complex relationships within
data (Kozak and Juszczuk, 2023). By integrating fractal dimension into
solar activity models, we can better represent the intricate structures of
sunspots and other solar features, leading to more precise predictions.
Taruma features, including contrast, directionality, coarseness, line-
likeness, regularity, and roughness, provide a detailed characterization
of texture within images, capturing fine details and variations essential
for understanding solar dynamics. These features have been proven
effective in various machine learning applications, such as texture
classification and image analysis (Aggarwal and Kumar, 2021; Salem
and Abdelkrim, 2020). In the context of solar image analysis, Taruma
features can highlight critical patterns and structures that correlate
with solar activity, improving the model’s ability to predict sunspot
numbers accurately.
Overall, the integration of these topological variables enhances the
robustness and accuracy of regression prediction models by providing
a richer, multi-dimensional view of the data. Studies have shown that
combining multiple types of descriptors, such as Shannon entropy
and fractal dimensions, can significantly improve the performance of
machine learning models in different applications (Kozak and Juszczuk,
2023; Saroughi et al., 2024; Keller, 2019; Guha and Velegol, 2023).
By incorporating these advanced features, our models achieve higher
predictive accuracy, as evidenced by the substantial improvement in
r2-scores and reduction in mean squared errors. This approach not
only advances our understanding of solar dynamics but also sets a
new standard for predictive modeling in space weather forecasting,
highlighting the potential for further improvements through the use
of sophisticated data descriptors. This indicates that these topological
features capture essential non-linear patterns and relationships in the
solar images, which are not adequately represented by heliospheric
variables alone.8
Future work should explore the application of deep learning models,
such as Convolutional Neural Networks (CNNs) and Recurrent Neural
Networks (RNNs), to further enhance the prediction capabilities. These
models are well-suited to capture complex spatial and temporal depen-
dencies in data, respectively. For instance, CNNs can be employed to
automatically extract high-level features from solar images, while RNNs
or Long Short-Term Memory (LSTM) networks can model temporal
sequences of sunspot activity, potentially leading to even more accurate
and robust forecasts (Chollet, 2021). Advanced Feature Engineering
Another promising direction is to refine and expand the set of topo-
logical features. Higher-order statistics, advanced texture descriptors
like Local Binary Patterns (LBP), and wavelet transform coefficients
could provide additional valuable insights. These features can capture
more intricate details of the solar images’ texture and structure, con-
tributing to a more comprehensive understanding of solar dynamics.
Research in image processing and computer vision suggests that these
advanced features often enhance the performance of machine learning
models in various applications (Chollet, 2021; He et al., 2016; Guo
et al., 2022).
Developing hybrid models that combine different machine learning
techniques can also be beneficial. For instance, ensemble methods
that integrate predictions from traditional machine learning models
and deep learning models could leverage the strengths of both ap-
proaches. Techniques like stacking or blending, where multiple models
are trained and their outputs combined, could provide a more robust
and accurate prediction system. This approach has been shown to
improve model performance in various predictive tasks (De Alwis and
Samadi, 2024). Temporal Dynamics and Sequence Modeling
Incorporating models designed to handle temporal dynamics, such
as Temporal Convolutional Networks (TCNs) or Transformer models,
could further improve the predictive accuracy of sunspot numbers.
These models are particularly adept at capturing long-term dependen-
cies and trends, which are critical in forecasting sunspot activity. The
Transformer model, with its attention mechanisms, has been especially
successful in various sequence modeling tasks and could be adapted for
sunspot prediction.
Moreover, improving the interpretability of the models is crucial for
gaining insights into the underlying physical processes of solar activity.
Techniques such as SHAP (SHapley Additive exPlanations) and LIME
(Local Interpretable Model-agnostic Explanations) can help elucidate
the contributions of individual features to the model’s predictions. This
understanding can guide further feature engineering efforts and provide
valuable information to solar physicists (Wang et al., 2021).
Lastly, integrating topological variables with other types of solar
data, such as solar flare occurrences and coronal mass ejections, could
provide a more holistic view of solar activity. This multimodal ap-
proach can enhance the models’ ability to predict not only sunspot
numbers but also other related phenomena, thereby contributing to a
comprehensive space weather forecasting system (Benz, 2017).
4. Conclusions and future directions
Based on the findings of this study, the following conclusions can
be drawn:
D. Sierra-Porta et al. Astronomy and Computing 48 (2024) 100857
d
i
C
S
w
D
t
D
b
t
p
r
D
t
The study demonstrates that the inclusion of topological variables,
in addition to the four primary heliospheric variables (wind flow speed,
proton density, proton temperature, and IMF), significantly enhances
the performance of regression models. In general, the r2-score improves
from an average of 0.40 to over 0.90, and the variance explained
by the models increases from approximately 0.3 to 0.93. The most
important variables for the regression models include IMF, flow speed,
proton density, proton temperature, entropy, standard deviation, frac-
tal dimension, Taruma roughness, and Taruma regularity for HMIIGR
and HMIMAG images. For EIT171, EIT195, EIT284, and EIT304 im-
ages, the significant variables also include relative smoothness, Taruma
directionality, Taruma linelikeness, kurtosis, and Taruma coarseness.
The Extra Trees Regressor (ET) emerges as the best model among
all those evaluated in this study, achieving an average r2-score of over
0.97 for nearly all models and variables across all types of supporting
images.
The evaluation of model performance was based on four metrics: r2-
score, root mean squared error (RMSE), explained variance score (EVS),
and mean Tweedie deviance (MTD). Cross-validation and grid search
were employed for hyperparameter optimization. The results indicate
that model performance varies depending on the type of data used, with
certain models performing better than others for specific datasets.
In conclusion, this study provides valuable insights into the use of
machine learning algorithms for predicting sunspot numbers based on
heliospheric dynamics data and topological and spectral variables. The
results confirm that the inclusion of topological variables significantly
improves model performance, with the Extra Trees Regressor being
the most effective model among those considered. The study also
emphasizes the importance of using appropriate evaluation metrics
and hyperparameter optimization techniques to develop accurate and
reliable models.
Future research directions include exploring the use of Long Short-
Term Memory (LSTM) networks for sunspot forecasting and prediction.
The authors plan to use sunspot time series data to train and test LSTM
network models. The objective is to achieve a deeper understanding of
sunspot forecasting and to evaluate the effectiveness of LSTM networks
in this context. The process will involve the general steps of time series
analysis, data preparation, and model training using appropriate LSTM
network configurations.
CRediT authorship contribution statement
D. Sierra-Porta: Writing – review & editing, Writing – original
raft, Visualization, Validation, Supervision, Software, Project admin-
stration, Methodology, Investigation, Formal analysis, Data curation,
onceptualization. M. Tarazona-Alvarado: Writing – original draft,
oftware, Formal analysis, Data curation. D.D. Herrera Acevedo: Soft-
are, Methodology, Formal analysis, Data curation.
eclaration of competing interest
The authors declare the following financial interests/personal rela-
ionships which may be considered as potential competing interests:
avid Sierra Porta reports equipment, drugs, or supplies was provided
y Universidad Tecnológica de Bolívar. If there are other authors,
hey declare that they have no known competing financial interests or
ersonal relationships that could have appeared to influence the work
eported in this paper.
ata availability
Data is already in web and we share the links for dataset used in
his paper.
9
Acknowledgments
DSP is grateful to the Universidad Tecnológica de Bolívar for the
support received during the research process. While the research did
not receive any financial contribution from internal or external fund-
ing sources, the Research Direction provided logistical support and
computational equipment for the development of this study.
References
Aggarwal, A., Kumar, M., 2021. Image surface texture analysis and classification using
deep learning. Multimedia Tools Appl. 80 (1), 1289–1309. doi:10.1007/s11042-
020-09520-2.
Alexakis, P., Mavromichalaki, H., 2019. Statistical analysis of interplanetary coronal
mass ejections and their geoeffectiveness during the solar cycles 23 and 24.
Astrophys. Space Sci. 364 (11), 187. doi:10.1007/s10509-019-3677-y.
Amadasun, M., King, R., 1989. Textural features corresponding to textural properties.
IEEE Trans. Syst. Man Cybern. 19 (5), 1264–1274. doi:10.1109/21.44046.
Barentine, J.C., 2022. Night sky brightness measurement, quality assessment and
monitoring. Nat. Astron. 6 (10), 1120–1132. doi:10.1038/s41550-022-01756-2.
Benz, A.O., 2017. Flare observations. Living Rev. Solar Phys. 14, 1–59. doi:10.1007/
s41116-016-0004-3.
Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. doi:10.1023/A:
1010933404324.
Burt, J., Smith, B., 2012. Deep space climate observatory: The DSCOVR mission.
In: 2012 Ieee Aerospace Conference. IEEE, pp. 1–13. doi:10.1109/AERO.2012.
6187025.
Chen, T., Guestrin, C., 2016. Xgboost: A scalable tree boosting system. In: Proceedings
of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and
Data Mining. pp. 785–794. doi:10.1145/2939672.2939785.
Chollet, F., 2021. Deep learning with Python. Simon and Schuster, ISBN:
9781617294433.
Cliver, E.W., Pötzi, W., Veronig, A.M., 2022. Large sunspot groups and great magnetic
storms: magnetic suppression of CMEs. Astrophys. J. 938 (2), 136. doi:10.3847/
1538-4357/ac847d.
Dang, Y., Chen, Z., Li, H., Shu, H., 2022. A comparative study of non-deep learning,
deep learning, and ensemble learning methods for sunspot number prediction. Appl.
Artif. Intell. 36 (1), 2074129. doi:10.1080/08839514.2022.2074129.
Dani, T., Sulistiani, S., 2019. Prediction of maximum amplitude of solar cycle 25 using
machine learning. In: J. Phys. Conf. Series. 1231, (1), IOP Publishing, 012022.
doi:10.1088/1742-6596/1231/1/012022.
De Alwis, T.P., Samadi, S.Y., 2024. Stacking-based neural network for nonlinear time
series analysis. Stat. Methods Appl. 1–24. doi:10.1007/s10260-024-00746-0.
Delaboudiniere, J.-P., Artzner, G., Brunaud, J., Gabriel, A.H., Hochedez, J.-F., Millier, F.,
Song, X., Au, B., Dere, K., Howard, R.A., et al., 1995. EIT: extreme-ultraviolet
imaging telescope for the SOHO mission. SOHO Mission 291–312. doi:10.1007/
978-94-009-0191-9_8.
Drews, A., Huo, W., Matthes, K., Kodera, K., Kruschke, T., 2022. The sun’s role in
decadal climate predictability in the north atlantic. Atmos. Chem. Phys. 22 (12),
7893–7904. doi:10.5194/acp-22-7893-2022.
Floyd, L., Tobiska, W.K., Cebula, R.P., 2002. Solar UV irradiance, its variation, and its
relevance to the earth. Adv. Space Res. 29 (10), 1427–1440. doi:10.1016/S0273-
1177(02)00202-8.
Freund, Y., Schapire, R.E., 1997. A decision-theoretic generalization of on-line learning
and an application to boosting. J. Comput. Syst. Sci. 55 (1), 119–139. doi:10.1006/
jcss.1997.1504.
Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine.
Ann. Statist. 1189–1232, https://www.jstor.org/stable/2699986.
Geurts, P., Ernst, D., Wehenkel, L., 2006. Extremely randomized trees. Mach. Learn.
63, 3–42. doi:10.1007/s10994-006-6226-1.
Gour, P.S., Singh, N.P., Soni, S., Saini, S.M., 2021. Observation of coronal mass ejections
in association with sun spot number and solar flares. In: IOP Conference Series:
Materials Science and Engineering, Vol. 1120, No. 1. IOP Publishing, 012020.
doi:10.1088/1757-899X/1120/1/012020.
Grauer, A.D., Grauer, P.A., 2021. Linking solar minimum, space weather, and night sky
brightness. Sci. Rep. 11 (1), 23893. doi:10.1038/s41598-021-02365-1.
Guha, R., Velegol, D., 2023. Harnessing Shannon entropy-based descriptors in machine
learning models to enhance the prediction accuracy of molecular properties. J.
Cheminformat. 15 (1), 54. doi:10.1186/s13321-023-00712-0.
Guo, M.-H., Xu, T.-X., Liu, J.-J., Liu, Z.-N., Jiang, P.-T., Mu, T.-J., Zhang, S.-H.,
Martin, R.R., Cheng, M.-M., Hu, S.-M., 2022. Attention mechanisms in computer
vision: A survey. Comput. Visual Media 8 (3), 331–368. doi:10.1007/s41095-022-
0271-y.
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recog-
nition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. pp. 770–778. doi:10.1109/IEEESTD.1997.85951.
D. Sierra-Porta et al. Astronomy and Computing 48 (2024) 100857Hill, M., Allen, R., Kollmann, P., Brown, L., Decker, R., McNutt, R., Krimigis, S.,
Andrews, G., Bagenal, F., Clark, G., et al., 2020. Influence of solar disturbances
on galactic cosmic rays in the solar wind, heliosheath, and local interstellar
medium: Advanced composition explorer, new horizons, and voyager observations.
Astrophys. J. 905 (1), 69. doi:10.3847/1538-4357/abb408.
Keller, K., 2019. Entropy measures for data analysis: Theory, algorithms and
applications. Entropy 21 (10), 935. doi:10.3390/e23111496.
Khan, T., Arafat, F., Mojumdar, M.U., Rajbongshi, A., Siddiquee, S.M.T.,
Chakraborty, N.R., 2020. A machine learning approach for predicting the
sunspot of solar cycle. In: 2020 11th International Conference on Computing,
Communication and Networking Technologies. ICCCNT, IEEE, pp. 1–4.
doi:10.1109/ICCCNT49239.2020.9225427.
Kohl, J.L., Esser, R., Gardner, L.D., Habbal, S., Daigneau, P.S., Dennis, E., Nystrom, G.,
Panasyuk, A., Raymond, J., Smith, P., et al., 1995. The ultraviolet coronagraph
spectrometer for the solar and heliospheric observatory. SOHO Mission 313–356.
doi:10.1007/978-94-009-0191-9_9.
Kozak, J., Juszczuk, P., 2023. Entropy in Real-World Datasets and Its Impact on Ma-
chine Learning. MDPI-Multidisciplinary Digital Publishing Institute, doi:10.3390/
books978-3-0365-7849-1.
Le Mouël, J.-L., Lopes, F., Courtillot, V., 2019. A solar signature in many climate
indices. J. Geophys. Res.: Atmos. 124 (5), 2600–2619. doi:10.1029/2018JD028939.
Li, Q., Wan, M., Zeng, S.-G., Zheng, S., Deng, L.-H., 2021. Predicting the 25th solar cycle
using deep learning methods based on sunspot area data. Res. Astron. Astrophys.
21 (7), 184. doi:10.1088/1674-4527/21/7/184.
Mahdi, M.M., Tipu, M.A.N., Halder, C., Rahman, K.F., 2019. Comparative analysis
of prediction of coronal mass ejections (CME) based on sunspot activities using
various machine learning models. In: 2019 International Conference on Robotics,
Electrical and Signal Processing Techniques. ICREST, IEEE, pp. 588–591. doi:
10.1109/ICREST.2019.8644272.
Marshak, A., Herman, J., Adam, S., Karin, B., Carn, S., Cede, A., Geogdzhayev, I.,
Huang, D., Huang, L.-K., Knyazikhin, Y., et al., 2018. Earth observations from
DSCOVR EPIC instrument. Bull. Am. Meteorol. Soc. 99 (9), 1829–1850. doi:10.
1175/BAMS-D-17-0223.1.
Nandy, D., 2021. Progress in solar cycle predictions: Sunspot cycles 24–25 in per-
spective: Invited review. Sol. Phys. 296 (3), 54. doi:10.1007/s11207-021-01797-
2.
National Research Council and Division on Engineering and Physical Sciences and
Aeronautics and Space Engineering Board and Space Studies Board and Committee
on a Decadal Strategy for Solar and Space Physics (Heliophysics), 2013. Solar and
space physics: A science for a technological society. National Academies Press,
doi:10.17226/13060.
National Research Council and Division on Engineering and Physical Sciences and
Space Studies Board and Committee on the Societal and Economic Impacts of
Severe Space Weather Events and A Workshop, 2009. Severe space weather
events: Understanding societal and economic impacts: A workshop report. National
Academies Press, doi:10.17226/12507.
Pala, Z., Atici, R., 2019. Forecasting sunspot time series using deep learning methods.
Sol. Phys. 294 (5), 50. doi:10.1007/s11207-019-1434-6.
Prasad, A., Roy, S., Sarkar, A., Panja, S.C., Patra, S.N., 2023. An improved prediction
of solar cycle 25 using deep learning based neural network. Sol. Phys. 298 (3), 50.
doi:10.1007/s11207-023-02129-2.
Ramezan, C.A., Warner, T.A., Maxwell, A.E., 2019. Evaluation of sampling and
cross-validation tuning strategies for regional-scale machine learning classification.
Remote Sens. 11 (2), 185. doi:10.3390/rs11020185.
Salem, Y.B., Abdelkrim, M.N., 2020. Texture classification of fabric defects using
machine learning. Int. J. Electr. Comput. Eng. 10 (4), 4390. doi:10.11591/ijece.
v10i4.pp4390-4399.
Saroughi, M., Mirzania, E., Achite, M., Katipoğlu, O.M., Ehteram, M., 2024. Shannon
entropy of performance metrics to choose the best novel hybrid algorithm to predict
groundwater level (case study: Tabriz plain, Iran). Environ. Monit. Assess. 196 (3),
1–20. doi:10.1007/s10661-024-12357-z.10Schaffer, C., 1993. Selecting a classification method by cross-validation. Mach. Learn.
13, 135–143. doi:10.1007/BF00993106.
Scherrer, P.H., Schou, J., Bush, R., Kosovichev, A., Bogart, R., Hoeksema, J., Liu, Y.,
Duvall, T., Zhao, J., Title, A., et al., 2012. The helioseismic and magnetic imager
(HMI) investigation for the solar dynamics observatory (SDO). Sol. Phys. 275,
207–227. doi:10.1007/s11207-011-9834-2.
Schou, J., Scherrer, P.H., Bush, R.I., Wachter, R., Couvidat, S., Rabello-Soares, M.C.,
Bogart, R., Hoeksema, J., Liu, Y., Duvall, T., et al., 2012. Design and ground
calibration of the helioseismic and magnetic imager (HMI) instrument on the solar
dynamics observatory (SDO). Sol. Phys. 275, 229–259. doi:10.1007/s11207-011-
9842-2.
Sierra Porta, D., Tarazona-Alvarado, M., 2023. Dataset: Sun dynamics from topological
features extraction. Mendeley Data, V1, doi:10.17632/5gh3xbvc92.1.
Singh, A., Bhargawa, A., 2020. Ascendancy of solar variability on terrestrial climate:
A review. J. Basic Appl. Sci 16, 105–130. doi:10.29169/1927-5129.2020.16.14.
Solomon, S.C., Liu, H.-L., Marsh, D.R., McInerney, J.M., Qian, L., Vitt, F.M., 2019.
Whole atmosphere climate change: Dependence on solar activity. J. Geophys. Res.
Space Phys. 124 (5), 3799–3809. doi:10.1029/2019JA026678.
Spencer, D.A., Johnson, L., Long, A.C., 2019. Solar sailing technology challenges.
Aerosp. Sci. Technol. 93, 105276. doi:10.1016/j.ast.2019.07.009.
Stone, E.C., Frandsen, A., Mewaldt, R., Christian, E., Margolies, D., Ormes, J., Snow, F.,
1998. The advanced composition explorer. Space Sci. Rev. 86, 1–22. doi:10.1023/A:
1005082526237.
Tamura, H., Mori, S., Yamawaki, T., 1978. Textural features corresponding to visual
perception. IEEE Trans. Syst. Man Cybern. 8 (6), 460–473. doi:10.1109/TSMC.1978.
4309999.
Tarazona-Alvarado, M., Sierra-Porta, D., 2023. Dataset for sun dynamics from
topological features. Data Brief 51, 109728. doi:10.1016/j.dib.2023.109728.
Tlatov, A.G., 2022. The shape of sunspots and solar activity cycles. Sol. Phys. 297 (8),
110. doi:10.1007/s11207-022-02045-x.
Tougui, I., Jilbab, A., El Mhamdi, J., 2021. Impact of the choice of cross-validation tech-
niques on the results of machine learning-based diagnostic applications. Healthc.
Informat. Res. 27 (3), 189–199. doi:10.4258/hir.2021.27.3.189.
Tsiropoula, G., 2003. Signatures of solar activity variability in meteorological param-
eters. J. Atmospheric Solar-Terrestrial Phys. 65 (4), 469–482. doi:10.1016/S1364-
6826(02)00295-X.
Wang, J., Wiens, J., Lundberg, S., 2021. Shapley flow: A graph-
based approach to interpreting model predictions. In: International
Conference on Artificial Intelligence and Statistics. PMLR, pp. 721–
729, https://ui.adsabs.harvard.edu/link_gateway/2020arXiv201014592W/
doi:10.48550/arXiv.2010.14592.
Weerts, H.J., Mueller, A.C., Vanschoren, J., 2020. Importance of tuning hyperpa-
rameters of machine learning algorithms. doi:10.48550/arXiv.2007.07588, arXiv
preprint arXiv:2007.07588.
Wilson, III, L.B., Brosius, A.L., Gopalswamy, N., Nieves-Chinchilla, T., Szabo, A.,
Hurley, K., Phan, T., Kasper, J.C., Lugaz, N., Richardson, I.G., et al., 2021. A quarter
century of wind spacecraft discoveries. Rev. Geophys. doi:10.1029/2020RG000714.
Wu, C.-M., Chen, Y.-C., 1992. Statistical feature matrix for texture analysis. CVGIP,
Graph. Models Image Process. 54 (5), 407–419. doi:10.1016/1049-9652(92)90025-
S.
Xiao, Z., 2021. A review of machine learning methods applied in sunspot prediction.
In: 2021 International Conference on Networking, Communications and Information
Technology. NetCIT, IEEE, pp. 158–161. doi:10.1109/NetCIT54147.2021.00039.
Yang, L., Shami, A., 2020. On hyperparameter optimization of machine learning
algorithms: Theory and practice. Neurocomputing 415, 295–316. doi:10.1016/j.
neucom.2020.07.061.
Zhang, J., Temmer, M., Gopalswamy, N., Malandraki, O., Nitta, N.V., Patsourakos, S.,
Shen, F., Vršnak, B., Wang, Y., Webb, D., et al., 2021. Earth-affecting solar
transients: A review of progresses in solar cycle 24. Progr. Earth Planetary Sci.
8, 1–102. doi:10.1186/s40645-021-00426-7.