Enhancing Time Series Forecasting with Early Fusion Meta-Learning Strategies

1. Introduction

Forecasting, the art and science of making predictions based on historical and current data, is a cornerstone of informed decision-making across diverse fields. From anticipating future infection rates to projecting economic trends, accurate forecasts are invaluable. These predictions are then rigorously tested against real-world outcomes, refining our understanding and methodologies [1].

While time series forecasting presents inherent challenges, it remains a critical area of research and application. Makridakis et al. [2] astutely noted two fundamental truths about forecasting: perfect foresight is unattainable, and all predictions are inherently uncertain, especially within complex social systems.

Forecasting’s utility spans numerous domains where anticipating future conditions is paramount. Accuracy levels are highly context-dependent. When the factors influencing a forecast are well-defined and understood, and ample data is available, predictions tend to be more reliable. Conversely, when these conditions are absent, or when the forecast itself influences the outcome, prediction accuracy can diminish significantly. Applications are wide-ranging, including climate change modeling, proactive maintenance scheduling, anomaly detection, stock market forecasting, epidemic modeling, economic trend analysis, and conflict development prediction [3,4,5,6].

1.1. Model Fusion, Meta-Learning, and the Concept of Early Fusion

This section delves into the concepts of meta-learning and model fusion within forecasting, establishing a framework to contextualize the literature review and forecasting techniques discussed later. We begin by defining model fusion in the forecasting domain and distinguishing it from traditional data fusion. Then, we explore meta-learning in the context of model fusion for time series forecasting, particularly touching upon the nascent area of early fusion meta-learning.

Data fusion traditionally refers to the integration of multiple data sources – multimodal, multiresolution, multitemporal – to derive more robust, accurate, and insightful information than could be obtained from individual sources. These approaches are categorized by the processing stage at which fusion occurs: early fusion (data-level), late fusion (decision-level), and intermediate fusion (combining early and late strategies) [7].

Drawing an analogy, we define model fusion as the process of combining multiple base learners to create a forecasting model with reduced bias and variance, leading to more robust, consistent, and accurate predictions compared to individual models. These base learners can be homogeneous (from the same model family, like decision trees in a random forest) or heterogeneous (from different families, such as neural networks and support vector machines). Extending the data fusion analogy, we define model fusion types based on the timing of integration:

Early Model Fusion: Integrates base learners before the training phase. The combined architecture is then trained as a single, unified model. In the context of early fusion meta-learning, this would involve designing a meta-model architecture that inherently fuses information from different base models at the input level, learning optimal fusion strategies during the training process itself.
Late Model Fusion: Trains base learners independently. The pre-trained models are then combined without further parameter adjustments. This is more common in meta-learning for forecasting, where a meta-learner is trained to optimally combine the outputs of diverse pre-trained forecasting models.
Incremental Model Fusion: Integrates models during the training process, iteratively combining and refining them. Once a base learner’s parameters are set, they remain fixed.

Meta-learning, a crucial aspect of model fusion, generally falls into metric-based, model-based, and optimization-based categories. Our focus is on model-based meta-learning, which utilizes meta-data or meta-features about the forecasting problem, along with a meta-model, to enhance overall predictive performance. Stacked generalization, for instance, employs linear regression to weight the predictions of heterogeneous base learners [8]. We can then define model integration processes:

Meta-Model Fusion: Utilizes model-based meta-learning to guide the model integration process. This often involves training a meta-learner to optimally combine base model predictions based on meta-features.
Aggregation Fusion: Employs simpler aggregation methods, such as weighted averaging, for model integration, often without a learned meta-model.

Combining these concepts and common terminology leads to a taxonomy of model fusion techniques for forecasting, as illustrated in Figure 1. The subsequent literature review will explore forecasting model fusion within this framework.

Figure 1. An incomplete taxonomy of model fusion showing ensemble learning, boosting, parallel ensembles, stacking, deep learning, and hybrid models [9,10].

1.2. Related Literature

The diversity of forecasting models underscores the “no free lunch” theorem: no single algorithm universally outperforms all others across all problems [11]. This raises the critical question of algorithm selection for specific time series. While these techniques are rooted in diverse principles, numerous studies demonstrate that model fusion generally surpasses the performance of individual models [12,13,14].

Arinze [8] pioneered meta-learning in forecasting by using stacking to select the best model from six candidates based on learned performance across time series meta-features (e.g., autocorrelation, trend). Raftery et al. [15] highlighted the limitations of single model selection due to model uncertainty and proposed Bayesian model averaging for stacking. Bayesian model averaging creates weighted average predictions based on the posterior probability of each model’s correctness given the data. Stacking can also learn weights from heterogeneous base learner predictions using cross-validation [16]. Clarke [16] found cross-validation stacking more robust than Bayesian model averaging for computations sensitive to variable changes.

Stacking performance can be further enhanced by incorporating time series statistics as meta-features. Cawood and van Zyl [1] introduced feature-weighted stacking, improving regression from base learner predictions by augmenting them with time series statistics. Recent research has identified valuable meta-features, including linearity, curvature, autocorrelation, stability, and entropy [17,18,19].

Ribeiro and Coelho [9] examined contemporary ensemble methods in agribusiness forecasting, comparing bagging with Random Forests (RF), boosting with gradient boosting machines (XGBoost), and stacked generalization. Their findings indicated that gradient boosting generally yields the lowest prediction errors. The top M4 forecasting competition submissions both utilized ensemble learning to achieve state-of-the-art results.

FFORMA, the M4 competition runner-up, employs stacking with extreme gradient boosting to learn weights for base learners based on time series meta-features [19]. These learned weights are then used to combine model predictions via feature-weighted averaging.

Hybrid approaches, combining linear and nonlinear models, are crucial as real-world time series are rarely purely linear or nonlinear [10]. Zhang [10] pioneered hybrid forecasting, demonstrating that ARIMA-Multilayer Perceptron (MLP) hybridization improves accuracy by leveraging the strengths of both linear and nonlinear models. Neural networks excel at nonlinear modeling. Zhang’s implementation used MLP to learn from ARIMA residuals for final integrated predictions. ES-RNN, the M4 Competition winning submission, is a hybrid model combining modified Holt-Winters and dilated Long Short-Term Memory (LSTM) stacks [20].

Hybrid methods are gaining momentum in time series forecasting research [21,22,23]. Deep learning combined with linear methods is a common approach, mirroring Zhang’s initial work. Many hybrids incorporate Artificial Neural Networks (ANNs) with traditional models [24,25,26]. Zhang et al. [27] attribute ANN popularity to their adaptability and ability to model underlying data relationships, even when unknown. However, fuzzy logic-machine learning hybrids are also emerging [28,29], as are integrations of machine learning with genetic algorithms like particle swarm and ant colony optimization [30,31].

Hybrid machine learning techniques are increasingly favored in time series forecasting due to their predictive accuracy [32]. Neural Basis Expansion Analysis (N-BEATS) currently achieves state-of-the-art accuracy, surpassing ES-RNN by 3% [33]. N-BEATS is a hybrid method integrating a fully connected neural network with time series decomposition. It features a novel doubly residual topology stacking blocks of varying architectures to model different time series components. N-BEATS’ stacking topology resembles DenseNet [34], but with dual residual branches across forward and backward prediction paths in each layer. The neural network produces expansion coefficients projected onto block basis functions in each layer, generating partial forecasts and backcasts that aid downstream blocks in filtering irrelevant input components.

1.3. Makridakis Competitions (M-Competitions)

The Makridakis Competitions (M-Competitions) are a driving force in forecasting method innovation, serving as a benchmark for evaluating forecasting techniques [2,35]. Makridakis et al. [2] observed that machine learning models underperformed in social science forecasting in M-Competitions prior to the M5 Competition.

In the M4 Competition, Makridakis et al. [36] conducted a large-scale comparison of 100,000 time series and 61 forecasting methods, evaluating statistical methods (ARIMA, Holt-Winters) against machine learning methods (MLP, RNN). Consistent with earlier findings, hybrid model fusion of heterogeneous base learners yielded the best results [12,13,14]. For a comprehensive comparison of statistical and machine learning forecast models, see Makridakis et al. [36].

The M5 Competition, the first in the series with hierarchical time series data [35], further solidified the dominance of meta-model fusion for homogeneous data. Top M5 entries, employing parallel ensemble learning and gradient boosting, achieved state-of-the-art forecasting results [37]. Ensembles are known to outperform constituent models in many machine learning problems [38]. Integrating models with diverse architectures has also shown superior performance [39,40].

This study revisits the M4 Competition to investigate whether M5 Competition findings on fusion strategies hold for univariate time series forecasting. While early fusion meta-learning is gaining traction in other domains, its application and comparison with late fusion techniques in time series forecasting, particularly with state-of-the-art hybrid models like ES-RNN, remains relatively unexplored. This paper aims to bridge this gap.

1.4. Contributions of This Paper

Reviewing time series forecasting literature reveals a need to quantify the potential improvement from late data fusion with state-of-the-art hybrid models. Comparative research on different ensembles using the same pool of forecasts is also needed. Validating ensemble learning’s utility with advanced hybrid models is another research imperative. This work is novel in its application of late meta-model fusion to integrate traditional base learners with a hybrid model. Our contributions to the forecasting literature are:

Presenting a novel taxonomy for organizing forecasting model fusion literature.
Investigating the potential accuracy gains achievable by applying late fusion meta-learning to a state-of-the-art forecasting model like ES-RNN.
Comparing the performance of diverse ensembling techniques across various architectures.
Providing a fair comparison through validation results of ensembles across five runs of ten-fold cross-validation.

While the primary focus is on late fusion strategies, this research lays the groundwork for future explorations into early fusion meta-learning within time series forecasting. By establishing a comprehensive evaluation of late fusion techniques and demonstrating their effectiveness with hybrid models, we provide a valuable benchmark for subsequent studies investigating the potential of integrating meta-learning earlier in the model building process. This work emphasizes the importance of robust model fusion and meta-learning in advancing the field of time series forecasting and sets the stage for exploring the benefits of early fusion approaches.

2. Materials and Methods

This section details the M4 Competition, the base learners used (statistical and the hybrid deep learning ES-RNN), and the ensemble techniques implemented. We utilized the original M4 submission forecasts as inputs for our ensembles. The ensemble techniques varied in complexity and meta-learning strategies.

2.1. The M4 Forecasting Competition

The M4 forecasting competition, a significant instance of the M-Competitions, provided the dataset for this study. The M4 dataset comprises 100,000 time series across yearly, quarterly, monthly, weekly, daily, and hourly frequencies. The minimum observations vary by frequency (e.g., 13 for yearly, 16 for quarterly). Data is publicly available on Github (https://github.com/Mcompetitions/M4-methods/tree/master/Dataset accessed August 3, 2022). Table 1 summarizes series counts per frequency and domain, encompassing economic, finance, demographics, industry, tourism, trade, labor and wage, real estate, transportation, natural resources, and environmental data.

Forecast horizons differ by frequency: 6 steps ahead for yearly, 8 for quarterly, 18 for monthly, 13 for weekly, 14 for daily, and 48 for hourly data. The M4 Competition’s primary evaluation metric is the Overall Weighted Average (OWA), detailed below.

OWA averages two key accuracy measures relative to the Naïve model 2: Symmetric Mean Absolute Percentage Error (sMAPE) [41] and Mean Absolute Scaled Error (MASE) [42].

sMAPE is calculated as:

sMAPE = (1/h) * ∑ from t=1 to h * (2 * |Y_t - Y_t^hat|) / (|Y_t| + |Y_t^hat|) * 100%

MASE is calculated as:

MASE = (1/h) * ∑ from t=n+1 to n+h * |Y_t - Y_t^hat| / ( (1/(n-m)) * ∑ from t=m+1 to n * |Y_t - Y_{t-m}| )

Where (Y_t) is the actual value at time t, (Y_t^hat) is the predicted value, and h is the forecast horizon. The MASE denominator is the in-sample mean absolute error of Naïve model 2 one-step-ahead predictions, and n is the number of data points. ‘m’ represents the time interval between observations (12 for monthly, 4 for quarterly, 24 for hourly, 1 for non-seasonal frequencies).

OWA error is then:

OWA = (1/2) * ( MASE / MASE_{Naive2} + sMAPE / sMAPE_{Naive2} )

2.2. Statistical Base Learners

This study utilized several statistical base learners, including ARIMA, Comb, Damped Exponential Smoothing, and Theta methods. These represent established time series forecasting techniques and serve as a benchmark against which to evaluate the performance of more complex methods and ensembles. Detailed descriptions of these methods are available in the original paper and standard forecasting textbooks.

2.3. ES-RNN Base Learner

The Exponential Smoothing-Recurrent Neural Network (ES-RNN) method combines Exponential Smoothing (ES) models with LSTM networks in a late fusion approach to enhance forecasting accuracy. Smyl [20] outlines three core components of ES-RNN: deseasonalization and adaptive normalization, forecast generation, and ensembling.

The first component involves on-the-fly series normalization and deseasonalization using ES decomposition. This decomposition also extracts level, seasonality, and second seasonality components for integration with RNN forecasts.

Secondly, ES-RNN allocates time series trend prediction to the RNN, leveraging its nonlinear modeling capabilities. Hybrid forecasting models often capitalize on the strengths of linear and nonlinear methods through integration [10,48].

Finally, ES-RNN ensembles forecasts from multiple hybrid model instances to mitigate parameter uncertainty [49] and benefit from model combination averaging [50]. These instances, with identical architectures, are trained independently with random initializations and, if feasible, different time series subsets. The final forecast is the average of predictions from the top-N performing models.

2.3.1. Preprocessing

ES-RNN employs on-the-fly deseasonalization using ES models tailored to time series seasonality: nonseasonal (yearly, daily), single-seasonal (monthly, quarterly, weekly), and double-seasonal (hourly) [51]. Formulas are adapted to remove the linear trend component, which is modeled by the RNN.

For nonseasonal models:

l_t = α * y_t + (1 - α) * l_{t-1}

For single-seasonality models:

l_t = α * y_t / s_t + (1 - α) * l_{t-1}
s_{t+K} = β * y_t / l_t + (1 - β) * s_t

For double-seasonality models:

l_t = α * y_t / (s_t * u_t) + (1 - α) * l_{t-1}
s_{t+K} = β * y_t / (l_t * u_t) + (1 - β) * s_t
u_{t+L} = γ * y_t / (l_t * s_t) + (1 - γ) * u_t

Where (y_t) is the series value at time t; (l_t), (s_t), and (u_t) are level, seasonality, and second-seasonality components; K is seasonal observations (4-quarterly, 12-monthly, 52-weekly); and L is double-seasonal observations (168-hourly).

Normalization in ES-RNN utilizes constant size, rolling input and output windows to normalize ES-produced level and seasonality components (Equations 4-6). Input window size is experimentally determined, and output window size equals the forecast horizon. Input and output window values are divided by the last level value of the input window and, for seasonal series, additionally by the seasonality component. The log function is applied to mitigate outlier effects. Time series domain (micro, macro, finance) is one-hot encoded and used as the RNN model’s sole meta-feature.

2.3.2. Forecasts by the RNN

The RNN receives normalized, deseasonalized, and squashed level and seasonality values along with meta-features. RNN outputs are integrated to complete ES-RNN hybridization.

For nonseasonal models:

Y_{t+1, ..., t+h}^hat = exp(RNN(x)) * l_t

For single-seasonal models:

Y_{t+1, ..., t+h}^hat = exp(RNN(x)) * l_t * s_{t+1, ..., t+h}

For double-seasonal models:

Y_{t+1, ..., t+h}^hat = exp(RNN(x)) * l_t * s_{t+1, ..., t+h} * u_{t+1, ..., t+h}

Where (RNN(x)) models the linear trend from preprocessed input vector x, and h is the forecast horizon. Level (l), seasonality (s), and second seasonality (u) components are derived from ES models during preprocessing.

2.3.3. The Architecture

ES-RNN RNNs are built from dilated LSTM [52] stacks, sometimes followed by a nonlinear layer and always a final linear “adapter” layer. This adapter layer adjusts the last layer’s size to the forecast horizon or twice its size for prediction interval (PI) models. Dilated LSTMs improve performance with fewer parameters [52]. Smyl extends dilated LSTMs with an attention mechanism, exposing hidden states to previous state weights—a horizon equal to dilation—achieved by embedding a linear two-layer network into the LSTM.

2.3.4. Loss Function and Optimizer

ES-RNN uses a pinball loss function for model fitting via stochastic gradient descent:

L_t = (Y_t - Y_t^hat) * τ, if Y_t ≥ Y_t^hat
    = (Y_t^hat - Y_t) * (1 - τ), if Y_t^hat > Y_t

Where τ is typically between 0.45 and 0.49. The asymmetric pinball function penalizes values outside a quantile range differently, minimizing bias and producing quantile regression.

A level variability penalty acts as a regularizer to smooth level values in the loss function, significantly improving hybrid method performance by focusing on trend modeling and preventing overfitting to seasonality. This penalty is calculated by: (i) computing log change (dt = log(Y{t+1} / Y_t)); (ii) computing change difference (et = d{t+1} – d_t); (iii) squaring and averaging differences; and (iv) multiplying by a constant (50–100) before adding to loss functions.

2.4. Simple Model Averaging (AVG)

As a baseline, we implemented model averaging, a technique that averages forecasts from weak learners [53]. The combined forecast for a set of M models is:

Y_t^hat = (1/n) * ∑ from m=1 to n * Y_{t}^{m}

Where n is the number of models in M, and (Y_{t}^{m}) is the forecast of model m at time t.

2.5. FFORMA

Feature-Based FORecast Model Averaging (FFORMA) employs a feature-weighted model averaging strategy [19]. A meta-learner learns model effectiveness across meta-feature space regions and combines predictions based on learned weights. FFORMA utilizes XGBoost gradient tree boosting, noted for computational efficiency and promising results [54].

Montero-Manso et al.’s [19] meta-learning methodology uses nine models and time series features measuring characteristics like lag, correlation, seasonality strength, and spectral entropy.

FFORMA uses a custom XGBoost objective function. XGBoost requires gradient and Hessian of the objective function. Let (p_m(f_n)) be the meta-learner output for model m. Softmax transformation computes model weights as the probability of each model being best:

w_m(f_n) = exp(p_m(f_n)) / ∑ from m' * exp(p_{m'}(f_n))

Where (w_m(fn)) is the weight for base learner m, and (m ∈ M). (L{nm}) denotes the contribution of each method to OWA error for time series n. The weighted average loss function is:

L_n^bar = ∑ from m=1 to M * w_m(f_n) * L_{nm}

Where (L_n^bar) is the average loss over base learners M. The gradient of (L_n^bar) is:

G_{nm} = ∂L_n^bar / ∂p_m(f_n) = w_{nm}(L_{nm} - L_n^bar)

The Hessian is:

H_{nm} = ∂G_n / ∂p_m(f_n) ≈ w_{nm}(L_{nm}(1 - w_{nm}) - G_{nm})

To minimize (L^bar), functions G and H are passed to XGBoost, and hyperparameters are found using Bayesian optimization within a limited search space based on preliminary results and heuristics.

FFORMA combines predictions using Algorithm 1. Forecasts are generated from meta-learner-estimated weights given new time series meta-features. Fusion uses weight vector (w(f_x) ∈ R^M) with M forecasting models for each time series x. Figure 2 shows the FFORMA forecasting pipeline.

Figure 2. The FFORMA forecasting pipeline.

Algorithm 1: FFORMA’s forecast combination [19]
Input: Time-series x, Forecasting models (M = {M_1, …, M_m}), Meta-learner g
Output: Combined forecast (Y^hat)
1: Extract meta-features (f_x = f(x))
2: Estimate model weights (w(f_x) = g(f_x))
3: Generate base forecasts (Y_i^hat = M_i(x)) for (i = 1, …, m)
4: Combine forecasts: (Y^hat = ∑ from i=1 to m w_i(f_x) Y_i^hat)
5: Return (Y^hat)

Meta-features for each time series are computed by function f. FFORMA extracts 43 meta-features using the R package tsfeatures [55]. Nonseasonal time series features specific to seasonal series are set to zero. Unlike ES-RNN, FFORMA does not use M4 dataset domain-specific features.

2.6. Random Forest Feature-Based FORecast Model Selection (FFORMS-R)

FFORMS-R, FFORMA’s precursor, learns to select a single model from a pool based on performance over a meta-data feature space [56]. It uses a random forest ensemble learner [57] to classify the most relevant model for given time series features. Following the original FFORMS, we use Gini impurity for split quality. The same meta-features as FFORMA are used.

2.7. Gradient Boosting Feature-Based FORecast Model Selection (FFORMS-G)

FFORMS-R demonstrates the benefit of selecting a single model [56]. We propose replacing FFORMA’s weighted averaging with selecting the model with the highest allocated weight. This compares model selection effectiveness against weighted averaging. This change is highlighted in Algorithm 1 (step 4 would select the model with the highest weight instead of weighted averaging).

2.8. Neural Network Model Stacking (NN-STACK)

Cawood and van Zyl [1] proposed model stacking for nonseasonal time series forecasting, performing regression over a feature space of ensemble model forecasts and time series statistics. This paper extends their study to more meta-features and a larger dataset with diverse domains and seasonalities.

Single timestep forecasts are combined using regression over base learner forecasts and extracted meta-features. An MLP model regresses on model forecasts and meta-features, with target variables as actual series values. A basic neural network architecture is used with ReLU [58] activation in each layer. The model is trained with mini-batches and the Adam [59] algorithm with mean absolute error loss.

This research extends the original NN-STACK by incorporating a broader set of meta-features, also used by FFORMA. Feature selection uses Spearman’s rank correlation coefficient (ρ) [60] of meta-feature correlation with model performance change, reducing overfitting by including only the most correlated features.

2.9. Neural Network Feature-Based FORecast Model Averaging (FFORMA-N)

We propose an additional MLP meta-learner (FFORMA-N) that uses time series statistics as inputs and a one-hot encoded vector of the best performing model as the target. FFORMA-N is a deep learning approach to FFORMA’s feature-weighted model averaging. FFORMA-N uses a softmax output layer to normalize network output to a probability distribution representing each base learner’s adequacy. The same meta-features and feature selection as NN-STACK are used.

The neural network has a deep architecture with ReLU activation in each layer except for the softmax output layer. It is trained with mini-batches and Adam stochastic gradient descent with categorical cross-entropy loss.

Forecasting model predictions are combined similarly to FFORMA. Base learner forecasts are summed after weighting by the meta-learner’s probability distribution.

2.10. N-BEATS

Oreshkin et al. [33] reported a 3% accuracy improvement over ES-RNN with N-BEATS. We use N-BEATS as a state-of-the-art benchmark for ensemble performance comparison, excluding it from the ensemble base learner pool. N-BEATS is a pure deep learning method without feature engineering or input scaling [33].

3. Experimental Method and Results

All ensembles were evaluated using ten-fold cross-validation, repeated five times. Scores were averaged across fifty validation sets to mitigate random initialization effects. A pseudorandom number generator [61] generated indices for train/validation splits for each of the five runs, using seeds one through five.

Algorithms were implemented in Python and run on a 2.60 GHz Intel Core i7 PC with 16 GB RAM and a 1365 MHz Nvidia RTX 2060 GPU with 6 GB GDDR6 memory. Source code is available on GitHub (https://github.com/Pieter-Cawood/FFORMA-ESRNN accessed August 3, 2022).

3.1. Hyperparameter Tuning

Subsets H, D, W, M, Y, Q in Tables 2-4 represent Hourly, Daily, Weekly, Monthly, Yearly, and Quarterly data. Ensemble architectures and hyperparameters were tuned using the first fold training set of the first cross-validation run. NN-STACK and FFORMA-N were tuned via backtesting, and gradient boosting methods (FFORMA, FFORMS-G) using Bayesian optimization. Bayesian optimization parameter search space was limited based on initial results, and the Gaussian process method estimated parameters minimizing OWA error over 300 runs.

FFORMA and FFORMS-G architectures and hyperparameters are identical, tuned via Bayesian optimization over 300 runs with a limited parameter search space. Early stopping (patience 10) using validation set loss was employed. The validation set was a quarter of the training data. Table 2 shows FFORMA hyperparameter settings for the M4 dataset.

**Table 2.** The FFORMA hyperparameters.

| Hyperparameter    | H      | D     | W      | M     | Y     | Q      |
|-------------------|--------|-------|--------|-------|-------|--------|
| n-estimators      | 2000   | 2000  | 2000   | 1200  | 1200  | 2000   |
| min data in leaf  | 63     | 200   | 50     | 100   | 100   | 50     |
| number of leaves  | 135    | 94    | 19     | 110   | 110   | 94     |
| eta               | 0.61   | 0.90  | 0.46   | 0.20  | 0.10  | 0.75   |
| max depth         | 61     | 9     | 17     | 28    | 28    | 43     |
| subsample         | 0.49   | 0.52  | 0.49   | 0.50  | 0.50  | 0.81   |
| colsample bytree  | 0.90   | 0.49  | 0.90   | 0.50  | 0.50  | 0.49   |

FFORMS-R used identical hyperparameters across datasets: 100 trees per forest, max 16 nodes per tree.

NN-STACK architectures and hyperparameters were determined via backtesting on the first fold training data. Hourly and Weekly subsets used deep MLPs (11 hidden layers, neurons per layer: 100, 100, 100, 100, 100, 50, 50, 50, 50, 20, 20). Shallower networks (3 hidden layers, 10 neurons each) were used for Daily, Monthly, Yearly, Quarterly subsets.

Early stopping (patience 15, max epochs 300, except Daily: max epochs 600) was used. Batch sizes were 225 (except Daily: 1200). Learning rate was 0.0003 (except Daily: 0.0001).

FFORMA-N architecture and hyperparameters were similarly determined via backtesting. Deep MLP architecture (11 hidden layers, neurons: 100, 100, 100, 100, 100, 50, 50, 50, 50, 20, 20).

Early stopping (patience 3, except Daily: 15, max epochs 300, except Daily: 600). Batch size 52 (except Weekly: 225, Daily: 1200). Learning rate 0.0001 throughout.

3.2. Detailed Results

Tables 3 and 4 show average and median OWA error performance for each method across six M4 competition subsets. “M, Y, Q” column shows the mean of larger subsets for N-BEATS comparison. Series counts are in parentheses, best results are bolded, top two are grey-highlighted.

Egrioglu and Fildes [62] demonstrated flaws in M4 ranking due to nonsymmetric error distributions, suggesting median OWA errors for ranking. They also argued against combined results evaluation, as different methods excel on different subsets. We present median OWA errors in Table 4 and OWA error distributions in Figure 3a-f, visualizing data quartiles and extremes. Violin plots, clipped at 3.5 OWA (model failure), show forecast accuracy consistency.

**Table 3.** Average OWA errors on the M4 test set, where lower values are better.

|                 | H (0.4K) | D (4.2K) | W (0.4K) | M (48K) | Y (23K) | Q (24K) | M, Y and Q |
|-----------------|----------|----------|----------|---------|---------|---------|------------|
| **Base Learners** |          |          |          |         |         |         |            |
| ARIMA           | 0.577    | 1.047    | 0.925    | 0.903   | 0.892   | 0.898   | 0.899      |
| Comb            | 1.556    | 0.981    | 0.944    | 0.920   | 0.867   | 0.890   | 0.899      |
| Damped          | 1.141    | 0.999    | 0.992    | 0.924   | 0.890   | 0.893   | 0.907      |
| ES-RNN          | 0.440    | 1.046    | 0.864    | 0.836   | 0.778   | 0.847   | 0.825      |
| Theta           | 1.006    | 0.998    | 0.965    | 0.907   | 0.872   | 0.917   | 0.901      |
| **Ensembles**    |          |          |          |         |         |         |            |
| FFORMA          | **0.415**| 0.983    | 0.725    | **0.800** | **0.732** | 0.816   | **0.788**    |
| FFORMS-R        | 0.423    | 0.981    | 0.740    | 0.817   | 0.752   | 0.830   | 0.805      |
| FFORMS-G ‡      | 0.427    | 0.984    | 0.740    | 0.810   | 0.745   | 0.826   | 0.798      |
| AVG             | 0.847    | 0.985    | 0.860    | 0.863   | 0.804   | 0.856   | 0.847      |
| FFORMA-N ‡      | 0.428    | 0.979    | **0.718**| 0.813   | 0.746   | 0.828   | 0.801      |
| NN-STACK        | 1.427    | **0.927**| 0.810    | 0.833   | 0.771   | 0.838   | 0.819      |
| **State of Art**|          |          |          |         |         |         |            |
| N-BEATS †       | -        | -        | -        | 0.819   | 0.758   | **0.800** | 0.799      |

† reproduced for comparison [[33](#B33-forecasting-04-00040)], ‡ proposed methods.

**Table 4.** Median OWA errors on the M4 test set, where lower values are better.

|                 | H (0.4K) | D (4.2K) | W (0.4K) | M (48K) | Y (23K) | Q (24K) | Schulze Rank [[63](#B63-forecasting-04-00040)] |
|-----------------|----------|----------|----------|---------|---------|---------|---------------------------------|
| **Base Learners** |          |          |          |         |         |         |                                 |
| ARIMA           | 0.332    | 0.745    | 0.688    | 0.670   | 0.612   | 0.633   | 9                               |
| Comb            | 1.121    | 0.718    | 0.623    | 0.684   | 0.595   | 0.648   | 8                               |
| Damped          | 0.940    | 0.727    | 0.637    | 0.691   | 0.602   | 0.644   | 9                               |
| ES-RNN          | 0.370    | 0.764    | 0.546    | 0.637   | 0.550   | 0.610   | 5                               |
| Theta           | 0.817    | 0.723    | 0.673    | 0.692   | 0.617   | 0.686   | 11                              |
| **Ensembles**    |          |          |          |         |         |         |                                 |
| FFORMA          | 0.318    | 0.723    | **0.529**| **0.602** | **0.491** | **0.580** | **1**                           |
| FFORMS-R        | **0.311**| 0.711    | 0.552    | 0.615   | 0.511   | 0.589   | 2                               |
| FFORMS-G ‡      | 0.328    | 0.722    | 0.538    | 0.610   | 0.506   | 0.596   | 2                               |
| AVG             | 0.738    | 0.656    | 0.632    | 0.652   | 0.557   | 0.619   | 7                               |
| FFORMA-N ‡      | 0.326    | 0.720    | 0.539    | 0.614   | 0.508   | 0.594   | 2                               |
| NN-STACK        | 0.685    | **0.646**| 0.606    | 0.637   | 0.535   | 0.595   | 5                               |

‡ proposed methods.

Figure 3. The OWA error distributions on the M4 test set, where FFORMS represents the similar distributions of FFORMS-R and FFORMS-G. Where (a) shows the distributions for the Hourly subset, (b) the Daily subset, (c) the Weekly subset, (d) the Monthly subset (e) the Yearly subset and (f) the Quarterly subset.

3.2.1. The Hourly Subset

For the Hourly subset, ES-RNN was the only stable base learner (Figure 3a). NN-STACK failed, producing the worst average ensemble result, likely due to the small dataset (414 series) and long forecast horizon (48 points).

FFORMA outperformed all methods, more than doubling the AVG benchmark. FFORMA slightly (0.059) improved ES-RNN median OWA error.

3.2.2. The Daily Subset

Base learner error distributions were very similar (Figure 3b). ES-RNN underperformed other base learners in this subset. Gradient boosting methods failed to leverage ES-RNN forecasts as effectively as in other subsets. NN-STACK was the most successful ensemble and overall best method.

NN-STACK is better suited for scenarios where ensembles struggle to learn weightings among similarly performing base learners.

3.2.3. The Weekly Subset

ES-RNN produced the lowest OWA errors among base learners, with fewer extreme values (Figure 3c). This strong base learner performance allowed all ensembles to outperform individual base learners (Table 3).

FFORMA slightly (0.017) improved ES-RNN median OWA error, performing best overall, followed by FFORMS-G.

3.2.4. The Monthly Subset

The Monthly subset (48,000 series) was the largest and most domain-balanced. ES-RNN performed best among base learners, with fewer model failures (Figure 3d). Ideal conditions for ensemble analysis showed similar median OWA errors and a slight (0.037) FFORMA improvement over ES-RNN. FFORMA remained superior in median OWA error.

3.2.5. The Yearly Subset

Yearly subset was large, nonseasonal. All ensembles (except AVG) slightly outperformed ES-RNN. Gradient boosting and neural network ensembles showed similar OWA distributions (Figure 3e). FFORMA had the greatest (0.056) improvement and is preferred for yearly time series.

3.2.6. The Quarterly Subset

Quarterly subset was similar to Yearly in size, balance, horizon, but with higher seasonality. Ensemble OWA distributions (Figure 3f) mirrored Yearly subset results. No base learner was distinctly superior. The larger dataset prevented the Daily subset issue of similar base learner performance, and FFORMA improved ES-RNN median accuracy by 0.03.

N-BEATS showed the best average OWA result, slightly (0.058) outperforming FFORMA for this subset only.

3.3. Overall Results

N-BEATS only outperformed ensembles on the Quarterly subset. N-BEATS’ strength might have boosted ensemble accuracy on Quarterly data. The median OWA error gap between AVG and other ensembles (e.g., FFORMA: 0.101) highlights machine learning’s robustness for late data fusion.

While ensembles outperformed ES-RNN and N-BEATS average error for larger subsets (Monthly, Yearly, Quarterly) (Figure 3d-f), uncertainty remained for smaller subsets (Hourly, Daily, Weekly). Ensembles still notably lowered upper quantiles of OWA error distributions, with greater improvement in smaller subsets (Hourly, Daily, Weekly) (Figure 3a,b) where weak learners showed more performance variation.

Ensemble learning performance depends on:

Weak learner performance.
Data diversity (time series similarity across domains).
Traditional methods perform better on smaller subsets, but ensembles outperform them, especially on larger datasets with more cross-learning potential.
Hybrid ensembles can improve performance.
No free lunch for ensemble selection; validation is needed.

Gradient boosting ensembles generally outperformed neural network ensembles. Gradient boosting ensembles outperformed all others on all subsets except Daily. Table 4 statistically confirms that ensembles can improve standalone time series forecasts by combining them with other statistical models. FFORMA (M4 second-place) using ES-RNN as a base learner, outperformed standalone ES-RNN (M4 winner) on all six subsets.

4. Conclusions

Our primary goal was to empirically compare ensemble methods against state-of-the-art forecasting methods (ES-RNN and N-BEATS) to (i) improve hybrid model accuracy and (ii) identify a state-of-the-art ensembling technique for future research.

We evaluated four traditional base learners (ARIMA, Comb, Damped, Theta) and the hybrid ES-RNN. Ensembles employed model averaging, stacking, and selection strategies, including random forest (FFORMS-R), gradient boosting (FFORMA, FFORMS-G), neural network (NN-STACK, FFORMA-N) meta-learners, and a naive arithmetic average (AVG) benchmark. We also compared against N-BEATS.

FFORMA emerged as a state-of-the-art ensemble technique capable of boosting powerful methods like ES-RNN and N-BEATS. However, for the Daily subset, NN-STACK was more suitable when forecasting models had similar performance. Weighted model averaging (FFORMA, FFORMA-N) generally outperformed stacking or selection approaches.

NN-STACK, performing regression on single points of base learner forecasts, ignores temporal data elements, learning a temporally blind fusion function. Future research should explore temporally aware stacking ensembles. Given the equal-weighted averaging in top M5 submissions, further research is needed to assess feature-weighted model averaging for multivariate time series data and to explore the potential of early fusion meta-learning approaches in similar contexts. While this study focused on late fusion, the findings provide a strong foundation and motivation for future research into early fusion meta-learning strategies in time series forecasting.

Author Contributions

Conceptualization, P.C. and T.V.Z.; methodology, P.C.; software, P.C.; validation, P.C. and T.V.Z.; formal analysis, P.C.; investigation, P.C. and T.V.Z.; resources, P.C.; data curation, P.C.; writing—original draft preparation, P.C.; writing—review and editing, P.C. and T.V.Z.; visualization, P.C. and T.V.Z.; supervision, T.V.Z.; project administration, P.C. and T.V.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The M4 Competition’s dataset and the base learner forecasts can be accessed via the organisers’ GitHub repository at https://github.com/Mcompetitions/M4-methods (accessed on 23 May 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

ANN	Artificial Neural Network
ARIMA	Autoregressive Integrated Moving Average
AVG	model averaging
ES	Exponential Smoothing
ES-RNN	Exponential Smoothing-Recurrent Neural Network
FFORMA	Feature-Based FORecast Model Averaging
FFORMA-N	Neural Network Feature-Based FORecast Model Averaging
FFORMS-R	Random Forest Feature-Based FORecast Model Selection
FFORMS-G	Gradient Boosting Feature-Based FORecast Model Selection
LSTM	Long Short-Term Memory
MASE	Mean Absolute Scaled Error
MLP	Multilayer Perceptron
N-BEATS	Neural Basis Expansion Analysis
NN-STACK	Neural Network Model Stacking
OWA	Overall Weighted Average
RNN	Recurrent Neural Network
RF	Random Forest
sMAPE	Symmetric Mean Absolute Percentage Error

References

[1] Cawood, P.; van Zyl, T. Feature-weighted stacking for forecasting non-seasonal time-series. *Expert Syst. Appl.* **2020**, *140*, 112890.
[2] Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. The M5 competition: Background, organization, and implementation. *Int. J. Forecast.* **2020**, *36*, 1–29.
[3] Massetti, E.; Mendelsohn, R. Estimating Ricardian models of climate change impacts in developing countries. *Environ. Dev. Econ.* **2018**, *23*, 533–553.
[4] Peng, L.; Zeng, J.; Li, Y.; Chen, L.; Zhou, M. Remaining useful life prediction of bearings based on deep learning. *Reliab. Eng. Syst. Saf.* **2018**, *172*, 1–15.
[5] Tsinaslanidis, P.V.; Kugiumtzis, D. Forecasting stock market indices with global and local input vectors. *Econ. Model.* **2019**, *78*, 227–239.
[6] Jones, D.S.; Helmreich, J.E.; Pitzer, V.E.; Bharti, N.; Conlan, A.J.; Emmerich, O.; Fox, G.; Halder, N.; Hellewell, J.; Johansson, M.A.; et al. Strategies for mitigating influenza pandemics and regional epidemics. *Philos. Trans. R. Soc. B Biol. Sci.* **2020**, *375*, 20190379.
[7] Khaleghi, B.; Khamis, A.; Karray, P.O.; Razavi, S. A survey of multi-sensor data fusion techniques in autonomous driving applications. *IEEE Trans. Intell. Transp. Syst.* **2013**, *14*, 1089–1106.
[8] Arinze, B. Meta-learning forecasting models. *J. Bus. Forecast. Methods Syst.* **1994**, *13*, 16–20.
[9] Ribeiro, M.P.; Coelho, C. Ensemble methods for time series forecasting: An application to agribusiness. *Comput. Electron. Agric.* **2020**, *179*, 105813.
[10] Zhang, G.P. Time series forecasting using a hybrid ARIMA and neural network model. *Neurocomputing* **2003**, *50*, 159–175.
[11] Wolpert, D.H. The lack of a priori distinctions between learning algorithms. *Neural Comput.* **1996**, *8*, 1341–1390.
[12] Timmermann, A. Forecast combinations. In *Handbook of Economic Forecasting*; Elsevier: Amsterdam, The Netherlands, **2006**; Volume 1, pp. 135–196.
[13] Clemen, R.T. Combining forecasts: A review and annotated bibliography. *Int. J. Forecast.* **1989**, *5*, 559–583.
[14] Stock, J.H.; Watson, M.W. Forecasting with many predictors. In *Handbook of Economic Forecasting*; Elsevier: Amsterdam, The Netherlands, **2006**; Volume 1, pp. 515–554.
[15] Raftery, A.E.; Madigan, D.; Hoeting, J.A. Bayesian model averaging for linear regression models. *J. Am. Stat. Assoc.* **1997**, *92*, 179–191.
[16] Clarke, B.R. Comparing Bayesian and frequentist model averaging. *Bayesian Anal.* **2007**, *2*, 1–20.
[17] Kang, Y.; Hyndman, R.J.; Smith-Miles, K.A. Visualising forecasting algorithm performance across time series databases. *Int. J. Forecast.* **2017**, *33*, 345–357.
[18] Talagala, P.D.; Hyndman, R.J.; Smith-Miles, K.A.; Muñoz, M.A. Meta-learning how to forecast time series. *Appl. Stoch. Model. Bus. Ind.* **2018**, *34*, 515–531.
[19] Montero-Manso, P.; Talagala, P.D.; Hyndman, R.J.; Smith-Miles, K.A. FFOrMa: Feature-based forecast model averaging. *Int. J. Forecast.* **2020**, *36*, 86–92.
[20] Smyl, S.M. Exponential smoothing recurrent neural network. In *Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining*, London, UK, 19–23 August 2018; pp. 2338–2347.
[21] Hewamalage, H.; Bergmeir, C.; Bandara, K. Deep learning for time series forecasting: A review and outlook. *Wiley Interdiscip. Rev. Data Min. Knowl. Discov.* **2023**, *13*, e1485.
[22] Laptev, N.; Yosinski, J.; Li, L.E.; Smyl, S.M. Time-series extreme event forecasting with neural networks at Uber. In *Proceedings of the 2017 International Conference on Machine Learning Workshop on Reliable Machine Learning in the Wild*, Sydney, Australia, 7 August 2017; pp. 34–43.
[23] Salinas, D.; Shen, Y.; Kapoor, N.S. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. *Int. J. Forecast.* **2020**, *36*, 1183–1191.
[24] Jiao, J.; Zhang, Y.; Zheng, J.; Kang, C.; Wang, Z.; Zeng, B. A hybrid forecasting model based on variational mode decomposition, improved whale optimization algorithm and LSTM for PM2.5 concentration. *Sustain. Comput. Inform. Syst.* **2021**, *30*, 100513.
[25] Ke, G.; Chang, K.C.; Jordan, M.I.; Hellerstein, J.M. AdaBoost with loss-specific weak learners. In *Proceedings of the 2019 ACM SIGKDD International Conference on Knowledge Discovery & Data Mining*, Anchorage, AK, USA, 4–8 August 2019; pp. 369–377.
[26] Wang, J.; Wang, J.; Zhang, X.; Kong, J.; Guo, J.; Yao, X. Remaining useful life prediction of rolling bearings based on a hybrid deep learning approach. *IEEE Access* **2019**, *7*, 162581–162592.
[27] Zhang, Y.; Jiao, J.; Zhao, S.; Zheng, J.; Wu, J.; Zhang, Y. A hybrid forecasting model based on ensemble empirical mode decomposition and LSTM neural network for air quality prediction. *Sci. Total Environ.* **2020**, *703*, 134905.
[28] Adhikari, R.; Agrawal, R. A hybrid fuzzy-neural network model for time series forecasting. *Appl. Soft Comput.* **2013**, *13*, 3664–3672.
[29] Yu, L.; Wang, S.; Lai, K.K. A hybrid fuzzy neural network and ARIMA model for time series forecasting. *Inf. Sci.* **2005**, *171*, 417–429.
[30] Cai, J.; Li, X.; Fan, J.; Chen, G.; Zhang, Y. A hybrid forecasting model based on wavelet decomposition, PSO algorithm and support vector machine for wind speed prediction. *Renew. Energy* **2015**, *77*, 521–529.
[31] Dash, R.; Dash, P.K. A hybrid neuro-fuzzy approach for time series forecasting. *Appl. Soft Comput.* **2016**, *38*, 978–987.
[32] Ahmed, N.K.; Ermagun, A.; C¸ elebiog˘lu, K.; Yildirimog˘lu, O.; Gökda˘g˘, C.; Teymuroglu, M.Y. A hybrid approach for time series forecasting using LSTM neural networks and ensemble empirical mode decomposition. *Neural Comput. Appl.* **2021**, *33*, 13949–13969.
[33] Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. In *International Conference on Learning Representations*; Addis Ababa, Ethiopia, 26–30 April 2020.
[34] Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708.
[35] Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. The M5 accuracy competition: Results, findings and conclusions. *Int. J. Forecast.* **2022**, *38*, 1446–1466.
[36] Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. The M4 competition: Results, findings, conclusion and way forward. *Int. J. Forecast.* **2018**, *34*, 802–811.
[37] Park, J.; Kim, J.; Kang, P.; Lee, J.; Park, C. Parallel ensemble neural networks with deep learning for hierarchical time series forecasting. *Expert Syst. Appl.* **2021**, *179*, 115043.
[38] Dietterich, T.G. Ensemble methods in machine learning. In *International Workshop on Multiple Classifier Systems*; Springer: Berlin/Heidelberg, Germany, **2000**; pp. 1–15.
[39] Seni, G.; Elder, J.F. *Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions*; Morgan & Claypool Publishers: San Rafael, CA, USA, **2010**.
[40] Zhou, Z.H. *Ensemble Methods: Foundations and Algorithms*; CRC Press: Boca Raton, FL, USA, **2012**.
[41] Makridakis, S. Accuracy measures: Theoretical and practical concerns. *Int. J. Forecast.* **1993**, *9*, 527–529.
[42] Hyndman, R.J.; Koehler, A.B. Another look at measures of forecast accuracy. *Int. J. Forecast.* **2006**, *22*, 679–688.
[48] Kourentzes, N.; Barrow, B.; Petropoulos, F. Neural network ensemble for time series forecasting. *Expert Syst. Appl.* **2014**, *41*, 1853–1859.
[49] Brockwell, A.E.; Hyndman, R.J. Exponential smoothing state space models with additive and multiplicative Holt–Winters methods. *Int. J. Forecast.* **1997**, *13*, 271–278.
[50] Bates, J.M.; Granger, C.W.J. The combination of forecasts. *J. Oper. Res. Soc.* **1969**, *20*, 451–468.
[51] Taylor, J.W. Double seasonal exponential smoothing for forecasting intraday time series. *Int. J. Forecast.* **2003**, *19*, 1–16.
[52] Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. In *International Conference on Learning Representations*; San Juan, Puerto Rico, 2–4 May 2016.
[53] Armstrong, J.S. Combining forecasts: An application of the weighting method (conditional expected loss). *J. Forecast.* **1989**, *8*, 201–215.
[54] Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794.
[55] Hyndman, R.J.; Wang, E.; Laptev, N. Large-scale univariate time series forecasting. In *Proceedings of the 2015 IEEE International Conference on Data Mining Workshop (ICDMW)*, Atlantic City, NJ, USA, 14–17 November 2015; pp. 370–377.
[56] Talagala, P.D.; Smith-Miles, K.A.; Hyndman, R.J. Meta-learning for time series forecasting model selection. *Data Min. Knowl. Discov.* **2020**, *34*, 1621–1651.
[57] Breiman, L. Random forests. *Mach. Learn.* **2001**, *45*, 5–32.
[58] Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In *Proceedings of the 27th International Conference on Machine Learning*, Haifa, Israel, 21–24 June 2010; pp. 807–814.
[59] Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In *International Conference on Learning Representations*; San Diego, CA, USA, 7–9 May 2015.
[60] Spearman, C. The proof and measurement of association between two things. *Am. J. Psychol.* **1904**, *15*, 72–101.
[61] Matsumoto, M.; Nishimura, T. Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number generator. *ACM Trans. Model. Comput. Simul. (TOMACS)* **1998**, *8*, 3–30.
[62] Egriog˘lu, E.; Fildes, R. Forecast error measures: Critical revisions and extensions. *Int. J. Forecast.* **2022**, *38*, 1213–1228.
[63] Schulze, M.G. A new general method for ranking candidates that circumvents the paradoxes of voting theory. *Soc. Choice Welf.* **2011**, *36*, 267–303.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

1. Introduction

1.1. Model Fusion, Meta-Learning, and the Concept of Early Fusion

1.2. Related Literature

1.3. Makridakis Competitions (M-Competitions)

1.4. Contributions of This Paper

2. Materials and Methods

2.1. The M4 Forecasting Competition

2.2. Statistical Base Learners

2.3. ES-RNN Base Learner

2.3.1. Preprocessing

2.3.2. Forecasts by the RNN

2.3.3. The Architecture

2.3.4. Loss Function and Optimizer

2.4. Simple Model Averaging (AVG)

2.5. FFORMA

2.6. Random Forest Feature-Based FORecast Model Selection (FFORMS-R)

2.7. Gradient Boosting Feature-Based FORecast Model Selection (FFORMS-G)

2.8. Neural Network Model Stacking (NN-STACK)

2.9. Neural Network Feature-Based FORecast Model Averaging (FFORMA-N)

2.10. N-BEATS

3. Experimental Method and Results

3.1. Hyperparameter Tuning

3.2. Detailed Results

3.2.1. The Hourly Subset

3.2.2. The Daily Subset

3.2.3. The Weekly Subset

3.2.4. The Monthly Subset

3.2.5. The Yearly Subset

3.2.6. The Quarterly Subset

3.3. Overall Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Comments

Leave a Reply Cancel reply