In this article, we delve into the effectiveness of the OP-LSTM (Operation-LSTM) architecture within the realm of meta-learning. Our investigation seeks to address critical questions regarding its performance in few-shot learning scenarios. We conduct a series of experiments using both synthetic and real-world datasets to rigorously evaluate OP-LSTM against established meta-learning techniques.
Our experimental framework utilizes few-shot sine wave regression as an illustrative task to provide intuitive insights, alongside widely recognized few-shot image classification benchmarks including Omniglot, miniImageNet, and CUB datasets. We establish a comparative baseline by employing prominent meta-learning algorithms such as Model-Agnostic Meta-Learning (MAML), Prototypical Networks, Subspace Adaptation Prior (SAP), and Warp-MAML. MAML and Prototypical Networks are chosen for their popularity and their conceptual relation to OP-LSTM, as they can be approximated by OP-LSTM, allowing us to examine the expressive advantages of OP-LSTM. SAP and Warp-MAML represent state-of-the-art gradient-based meta-learning methods, offering a benchmark to assess OP-LSTM’s competitive standing. It’s important to note that OP-LSTM is designed to be complementary to methods like Warp-MAML and could potentially enhance them, but exploring this synergistic potential is reserved for future research.
All experiments are executed on a single GPU (PNY GeForce RTX 2080TI) under a consistent computational budget of 2 days per run. Each experiment is repeated three times with different random seeds to ensure robustness and account for variability in neural network initialization, training tasks, validation tasks, and testing tasks. It is crucial to emphasize that our primary objective is not to achieve state-of-the-art performance records. Instead, we aim to determine if a standard LSTM is a viable approach for few-shot learning in contemporary benchmarks and, more critically, whether OP-LSTM offers tangible improvements over standard LSTM, MAML, and Prototypical Networks.
We explore two primary experimental settings: sine wave regression and few-shot image classification, each designed to probe different aspects of meta-learning capabilities.
Sine wave regression was initially introduced as a meta-learning challenge to evaluate the adaptability of algorithms to new tasks with limited data. In this setup, each task (mathcal {T}_j) is defined by a unique sine wave function (s_j = A_j cdot sin (x – p_j)), where the amplitude (A_j) and phase (p_j) are randomly sampled for each task from the intervals [0.1, 5.0] and [0, (pi)], respectively. The learning objective is to accurately predict the output y for a given input x for a new task, based on a small support set of k examples. Performance is then assessed on a query set of 50 input-output pairs. For our implementation of the plain LSTM, we utilize a multi-layer LSTM architecture trained using Backpropagation Through Time (BPTT) with the Adam optimizer. During meta-training, the LSTM is exposed to 70,000 training tasks. Meta-validation is conducted every 2,500 tasks on a set of 1,000 tasks. The model that achieves the best validation performance is then evaluated on 2,000 meta-test tasks to gauge its generalization capability.
Few-shot image classification experiments involve training all methods for 80,000 episodes on training tasks, with meta-validation performed every 2,500 episodes to monitor progress and select the optimal model. The final evaluation is carried out on 600 held-out test tasks. Each task includes a support set with a varying number of examples per class (ranging from 1 to 10 shots) and a query set of 15 examples per class. To ensure statistical reliability, each experiment is repeated 3 times with different random seeds, affecting weight initialization and task sampling, while maintaining consistent class splits across training, validation, and testing sets. For the Omniglot dataset, we employ a fully-connected neural network as the base-learner for both MAML and OP-LSTM, consistent with prior work. This network comprises 4 fully-connected blocks with decreasing dimensions (256-128-64-64), each block including a linear layer, BatchNorm, and ReLU activation. In the OP-LSTM configuration, every layer of the base-learner network is replaced with an OP-LSTM block. The plain LSTM approach uses a standard LSTM as its base-learner. For MAML, we adopt the best hyperparameters reported in the original MAML paper. Hyperparameter tuning for LSTM and OP-LSTM was performed using random search and grid search, respectively, to optimize their performance. It’s important to acknowledge that direct comparisons with MAML and Prototypical Networks should be interpreted cautiously due to computational constraints preventing hyperparameter optimization under identical conditions for all methods.
For experiments on miniImageNet and CUB datasets, we uniformly adopt the Conv-4 base-learner network across all methods. This network architecture consists of 4 convolutional blocks, each with 64 feature maps generated by (3 times 3) kernels, followed by BatchNorm and ReLU nonlinearity. For predictions, MAML uses a linear output layer. The plain LSTM operates on flattened features extracted from the convolutional layers because LSTMs are not computationally scalable for direct image input. OP-LSTM is applied as an OP-LSTM block on these flattened convolutional features. Critically, OP-LSTM is utilized only in the final layer due to current limitations in backpropagating messages through max-pooling layers.
Our performance evaluation begins with within-domain assessments, where test tasks are drawn from the same dataset as the training tasks (but with unseen classes). Subsequently, we extend our analysis to cross-domain performance, training models on one dataset and evaluating them on another. Specifically, we examine the scenarios of miniImageNet to CUB transfer (training on miniImageNet, testing on CUB) and the reverse transfer (CUB to miniImageNet).
5.1 Permutation Invariance for the Plain LSTM
We first investigate the impact of processing support data sequentially versus as a set on the performance of the plain LSTM. This is crucial for understanding whether the inherent order-sensitivity of standard LSTMs affects meta-learning efficacy. We compare processing support data as a sequence ((textbf{x}_1, textbf{y}_1), ldots , (textbf{x}_k, textbf{y}_k)) against batch processing, treating the support set as an unordered set ({ ( textbf{x}_1, textbf{y}_1), ldots , (textbf{x}_k, textbf{y}_k) }). Experiments are conducted on few-shot sine wave regression and few-shot Omniglot classification. For sine wave regression, each task includes 50 query examples, while Omniglot tasks have 10 query examples per class. The sequentially processed LSTM was tuned via random search (details in appendix). We compare its performance to a batched LSTM (using identical hyperparameters) to evaluate if permutation invariance enhances performance and training stability. Training stability is assessed by computing the confidence interval over the mean performances of 3 runs, rather than concatenating all performances, maintaining consistency with subsequent experiments and broader literature practices.
Fig. 5
The results, depicted in Fig. 5, show the average accuracy of plain LSTMs with sequential and batch support data processing on sine wave regression (left) and Omniglot classification (right) across varying numbers of training examples per task. Lower MSE (left) and higher accuracy (right) indicate better performance. Results are averaged over 3 runs (each with 600 meta-test tasks), with 95% confidence intervals displayed as shaded regions. Scatter marks indicate average performance per run. Batch processing performs comparably or better than sequential processing and enhances training stability across different runs.
The results for sine wave regression (left subfigure) indicate that batching performs on par with or outperforms sequential LSTM, evidenced by a smaller or equal MSE score. Performance generally improves with more training data. A similar, yet more pronounced, trend is observed in Omniglot classification (right subfigure), where batched LSTM significantly surpasses sequential LSTM across different numbers of training examples per class. Surprisingly, Omniglot performance does not improve with more examples for sequential LSTM. This is attributed to training instability in sequential LSTM, as indicated by wider confidence intervals. Some runs of sequential LSTM fail to learn effectively, yielding near-random performance, while others learn only after a burn-in period and fail to converge within 80,000 meta-iterations (detailed learning curves in Appendix B.2). Batched LSTM does not exhibit such instability, suggesting that batching not only improves performance but also significantly stabilizes training. The confidence interval of sequential LSTM exceeding batched LSTM’s performance is an artifact of symmetrical confidence intervals; sequential LSTM never outperforms batched LSTM. The MSE loss decreases for both approaches as support set size increases, indicating more data aids learning. However, batched LSTM performance improves with more training data, unlike sequential LSTM, which struggles to achieve competitive performance. Overall, permutation invariance via batching appears to be a beneficial inductive bias for enhancing few-shot learning, leading to our decision to use batched LSTM for subsequent experiments.
5.2 Performance Comparison on Few-Shot Sine Wave Regression
Next, we compare the performance of batched plain LSTM, our proposed OP-LSTM, and MAML on few-shot sine wave regression. To ensure a fair comparison with MAML, we adopted the same hyperparameter tuning approach as used for plain LSTM in the previous section for 5-shot sine wave regression. We started with a default base-learner architecture (two hidden layers, 40 ReLU nodes, and a 1-node output layer). We then explored architectures with varying parameter counts to ensure that parameter expressivity did not limit MAML’s performance. The same base-learner architecture was used for OP-LSTM without further tuning.
Table 1 Average test MSE on few-shot sine wave regression
Method | 5-shot | 10-shot | 20-shot |
---|---|---|---|
MAML | 0.189 ± 0.018 | 0.154 ± 0.012 | 0.131 ± 0.010 |
LSTM (batched) | 0.132 ± 0.009 | 0.111 ± 0.008 | 0.095 ± 0.006 |
OP-LSTM | 0.142 ± 0.011 | 0.103 ± 0.007 | 0.086 ± 0.005 |
Table 1 presents the test performances on sine wave regression. MAML, despite having a comparable parameter count (models with more parameters performed worse), is outperformed by both LSTM and OP-LSTM. This suggests that LSTM and OP-LSTM are more effective at discovering efficient learning algorithms for sine wave tasks. When comparing LSTM and OP-LSTM, plain LSTM achieves the best performance in the 5-shot setting, while OP-LSTM outperforms LSTM in 10-shot and 20-shot scenarios.
5.3 Performance Comparison on Few-Shot Image Classification
Within-domain Performance We now examine the within-domain performance of OP-LSTM and plain LSTM on few-shot image classification tasks: Omniglot, miniImageNet, and CUB. Table 2 shows the results for Omniglot. Note that plain LSTM has significantly more parameters due to its multi-layered fully-connected architecture with large hidden dimensions, which was found to yield optimal validation performance. Despite this, plain LSTM (with batching) performs poorly compared to other methods, even with its higher parameter count and theoretical capacity to learn any learning algorithm. This indicates that optimizing plain LSTM to find effective learning algorithms for complex few-shot image classification is challenging. In contrast, OP-LSTM, which decouples the learning process from input representation, achieves competitive performance against MAML and ProtoNet in both 1-shot and 5-shot settings, using fewer parameters than plain LSTM.
Table 2 The mean test accuracy (%) on 5-way Omniglot classification across 3 different runs
Method | 1-shot | 5-shot |
---|---|---|
MAML | 98.7 ± 0.1 | 99.9 ± 0.0 |
ProtoNet | 98.4 ± 0.1 | 99.7 ± 0.0 |
LSTM (batched) | 93.7 ± 0.9 | 94.4 ± 0.5 |
OP-LSTM | 98.8 ± 0.1 | 99.8 ± 0.0 |
Tables 3 presents the results for miniImageNet and CUB. Again, plain LSTM utilizes more parameters due to its large fully-connected layers, optimized for validation performance. However, it operates on Conv-4 backbone representations, shared with other methods. Plain LSTM performs at chance level, reinforcing that finding effective learning algorithms is too complex for this approach. Conversely, OP-LSTM demonstrates competitive or superior performance compared to all baselines on both miniImageNet and CUB, regardless of the number of shots. This highlights the advantage of separating input representation from the learning mechanism.
Table 3 Meta-test accuracy scores on 5-way miniImageNet and CUB classification over 3 runs
Dataset | Method | 1-shot | 5-shot |
---|---|---|---|
miniImageNet | MAML | 49.9 ± 0.5 | 63.1 ± 0.9 |
ProtoNet | 49.4 ± 0.6 | 66.1 ± 0.5 | |
LSTM (batched) | 20.1 ± 0.2 | 20.3 ± 0.2 | |
OP-LSTM | 52.3 ± 0.6 | 67.6 ± 0.7 | |
CUB | MAML | 58.4 ± 0.6 | 72.1 ± 0.7 |
ProtoNet | 55.5 ± 0.7 | 72.5 ± 0.6 | |
LSTM (batched) | 20.1 ± 0.1 | 20.1 ± 0.1 | |
OP-LSTM | 61.2 ± 0.8 | 75.2 ± 0.5 |
Cross-domain Performance We further assess the cross-domain performance of LSTM and OP-LSTM, where test tasks originate from a different dataset than training tasks. We evaluate two scenarios: training on miniImageNet and testing on CUB (MIN (rightarrow) CUB), and vice versa (CUB (rightarrow) MIN). Table 4 shows the results. Plain LSTM again underperforms, not exceeding random classification, while OP-LSTM exhibits superior performance in all cross-domain scenarios, demonstrating its robustness in these challenging conditions.
Table 4 Average cross-domain meta-test accuracy scores over 5 runs using a Conv-4 backbone
Scenario | LSTM (batched) | OP-LSTM |
---|---|---|
MIN (rightarrow) CUB | 19.9 ± 0.1 | 43.8 ± 0.8 |
CUB (rightarrow) MIN | 20.0 ± 0.1 | 44.1 ± 0.9 |
5.4 Analysis of the Learned Weight Updates
Finally, we analyze how OP-LSTM updates the weights of the base-learner network. We measure the cosine similarity and Euclidean distance between OP-LSTM updates and updates from gradient descent or Prototypical Networks. Let (textbf{H}^{(L)}_0) be the initial final classifier weight matrix. The OP-LSTM update direction after T updates is (Delta _{ OP } = vec{mathbf{textbf{H}}}^{(L)}_T – vec{mathbf {textbf{H}}}^{(L)}_0), where (vec{mathbf {textbf{H}}}) vectorizes the matrix. Similarly, updates for nearest-prototype classification ((textbf{H}^{(L)}_{ Proto })) and gradient descent (textbf{H}^{(L)}_{ GD }) are (Delta _{ Proto } = vec{mathbf {textbf{H}}}^{(L)}_{Proto} – vec{mathbf {textbf{H}}}^{(L)}_0) and (Delta _{ GD } = vec{mathbf {textbf{H}}}^{(L)}_{GD} – vec{mathbf {textbf{H}}}^{(L)}_0), with gradient descent using a learning rate of 0.01 for T steps. We compute the Euclidean distance and cosine similarity between (Delta _{ OP }) and (Delta _{ Proto }) and (Delta _{ GD }). Measurements are taken on validation tasks every 2,500 episodes and averaged over 3 runs.
The average cosine similarity (left) and Euclidean distance (right) between OP-LSTM and prototype-based/gradient-based classifier weight update directions over time on 5-way 1-shot miniImageNet classification. Each point on the x-axis represents a validation step every 2,500 episodes. Results are averaged over 3 runs, with 95% confidence intervals (shaded regions, often smaller than symbols). OP-LSTM updates become increasingly similar to gradient descent and prototype-based updates over time (increasing cosine similarity).
Fig. 6 displays the results. Cosine similarity between OP-LSTM updates and gradient descent/prototype-based classifiers increases with training time. OP-LSTM quickly learns to update weights in directions similar to gradient descent, followed by a slight decline and subsequent increase in similarity, potentially to incorporate prototype-based updates. Euclidean distance shows a similar pattern for prototype-based similarity, decreasing over time. The Euclidean distance to gradient updates slightly increases, possibly due to sensitivity to scale and magnitude. Cosine similarity offers a better measure of directional similarity, abstracting away from vector magnitude.
In conclusion, our experiments demonstrate that OP-LSTM is a highly competitive method for few-shot learning. It consistently outperforms plain LSTM and shows comparable or superior performance to established methods like MAML and Prototypical Networks, particularly in complex within-domain and challenging cross-domain scenarios. The analysis of weight updates reveals that OP-LSTM dynamically adjusts its learning strategy, aligning its updates with both gradient-based and prototype-based approaches, showcasing its adaptability and effectiveness in meta-learning. This highlights the potential of OP-LSTM as a robust and versatile architecture for advancing few-shot learning research and applications.