A Comprehensive Survey of Few-Shot Learning: Principles, Techniques, and Applications

3.1 Problem Definition

In the realm of artificial intelligence, machine learning stands as a cornerstone, enabling systems to learn from data and improve performance without explicit programming. Within machine learning, a significant paradigm shift is occurring with the rise of few-shot learning (FSL). This approach directly addresses a critical limitation of traditional machine learning: the need for vast amounts of labeled data. To fully appreciate the significance of FSL, it’s crucial to understand its context within broader machine learning and its relevance to specific applications like Facial Expression Recognition (FER).

3.1.1 FER

Facial Expression Recognition (FER) is a pivotal task within computer vision, driven by the fundamental role of facial expressions in human communication. Expressions are not merely superficial displays; they are powerful, natural, and universal signals for conveying and understanding human emotion [18-[20](/article/10.1007/s11554-023-01310-x#ref-CR20 “Zeng, J., Shan, S., Chen, X.: Facial expression recognition with inconsistently annotated datasets. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 222–237 (2018)”)]. This inherent communicative power makes FER invaluable in Human-Computer Interaction (HCI) systems. Imagine e-learning platforms providing personalized feedback based on student engagement, driver fatigue surveillance systems enhancing road safety, or robots interacting more naturally with humans – all powered by robust FER.

The progress in FER has been significantly fueled by the availability of large-scale datasets, enabling the development of automated facial expression analysis systems applicable across diverse scenarios. Pioneering work by Tian et al. [21] highlighted the universality of facial expressions across cultures, categorizing them into six basic emotions: anger, disgust, fear, happiness, sadness, and surprise. Subsequently, a neutral emotion was included, establishing a common set of seven emotion labels for classification tasks. Contempt was later added to the repertoire of basic emotions [22]. While these six (or seven, or eight) basic emotions have been central to FER research [23], the field is expanding to recognize the complexity of human emotion. Researchers are increasingly exploring compound emotion recognition, acknowledging that emotional states are often blends of multiple emotions rather than singular, basic categories [24-[28](/article/10.1007/s11554-023-01310-x#ref-CR28 “Kamińska, D., Aktas, K., Rizhinashvili, D., Kuklyanov, D., Sham, A.H., Escalera, S., Nasrollahi, K., Moeslund, T.B., Anbarjafari, G.: Two-stage recognition and beyond for compound facial emotion recognition. Electronics 10(22), 2847 (2021)”)].

A typical FER system operates through three key modules: face detection, feature extraction, and classification. Face detection algorithms, including MTCNN [29], Dlib [30], Retinaface [31], and FAN [32], are employed to locate and align faces within images. Subsequently, feature extractors capture salient expression-related features, which are then fed into classifiers to categorize expressions into predefined emotion labels. Historically, texture and facial shape were paramount in FER. Early methods utilized techniques like Histogram of Oriented Gradients (HOG) [33], Gabor wavelets [34], Local Binary Patterns (LBP) [35], Local Ternary Patterns (LTP) [36], and Non-negative Matrix Factorization (NMF) [37]. These methods were often evaluated on controlled laboratory datasets like CK+, MMI [38], Oulu-CASIA [39], CFEE [14], and other constrained datasets [40–41]. Classifiers like Support Vector Machines (SVM) [42] and AdaBoost [43] were commonly used to categorize these extracted features. The combination of LBP for robust feature extraction under varying lighting and SVM for effective classification became a popular and well-studied approach.

The emergence of large-scale facial datasets from the internet, spurred by challenges like EmotiW and datasets such as RAF-DB, AffectNet, and EmotioNet, shifted the landscape of FER. The rise of Convolutional Neural Networks (CNNs) [44], capable of extracting deeper, more spatially informed features, led to their widespread adoption in FER. CNN-based models demonstrated superior performance on these larger, more diverse datasets. Further advancements in FER incorporated techniques like network ensembles, cascade networks, and Generative Adversarial Network (GAN)-based models, continually pushing the boundaries of accuracy and robustness.

Fig. 2 General pipeline of a facial expression recognition task. This diagram illustrates the typical stages involved in FER, from preprocessing and feature learning to final classification.

Modern FER systems typically follow a three-stage pipeline: pre-processing, feature learning, and classification (Fig. 2). Pre-processing steps often include face alignment, data augmentation to increase dataset diversity, and normalization to standardize input data. However, the necessity of normalization can vary across models. The feature learning stage leverages deep learning models, particularly CNNs, Deep CNNs, and Recurrent Neural Networks (RNNs), to automatically extract relevant features while preserving crucial spatial information. The final classification stage maps these extracted features to emotion categories. Recent trends favor end-to-end learning, integrating feature extraction and classification into a unified deep network, streamlining the pipeline from pre-processing to emotion prediction. Loss functions are incorporated in the final stage to guide and regularize the learning process, optimizing the network for accurate emotion recognition.

Despite significant progress, challenges remain in FER, especially concerning data limitations. While deep learning models excel with large datasets like ImageNet [45] (approximately 1.2 million images), FER datasets like RAF-DB (around 30,000 images) and AffectNet (around 450,000 images), though substantial, can still lead to overfitting. Overfitting occurs when models learn training data too well, compromising their ability to generalize to new, unseen images. The scarcity of facial expression data is further compounded by privacy concerns, making the acquisition of large, diverse datasets difficult. Pre-trained models, trained on massive datasets like ImageNet, offer a potential solution. However, ImageNet primarily contains coarse-grained images (broad categories like dogs, cats, objects), whereas facial expression images are fine-grained (subtle variations within emotions). Fine-grained datasets, while having less overall variance than coarse-grained ones, may not benefit optimally from pre-training on coarse-grained data. Therefore, maximizing the utility of available FER datasets becomes crucial, and few-shot learning emerges as a powerful strategy to address these data scarcity challenges.

3.1.2 FSL

The advent of massive datasets like ImageNet and advancements in deep learning architectures, such as ResNet [46] and LSTM networks [47], have propelled machine learning to remarkable levels of performance. Yet, a fundamental challenge persists: the degradation of model performance when training with limited data. With few samples, models tend to overfit and converge prematurely, hindering their generalization capability. Deep learning models, while powerful, often become increasingly complex to handle high-dimensional data, exacerbating the issue of data dependency.

Early explorations into few-shot learning were inspired by human learning processes [48–49], particularly the ability of humans to learn new concepts from very few examples [50]. Fei-Fei et al. [48] investigated learning from a single example to infer other instances of the same category, emphasizing the role of prior knowledge in enabling generalization to new categories [48–52]. Fink [49] focused on class similarity, proposing that by minimizing within-class distances and maximizing between-class distances, effective learning could be achieved even with just one example per class, thus enabling efficient learning from scarce data.

Few-shot learning (FSL) is fundamentally about developing algorithms that can generalize effectively from limited data. Since the seminal works of Fink and Fei-Fei et al., FSL has become a vibrant research area. The term “n-shot k-way learning” has become standard for defining various FSL strategies [53–55]. The core objective of FSL is to maximize learning efficiency when data is scarce. Wang et al. [56] define FSL as a machine learning paradigm where only a small number of examples provide the supervised information for learning a target task. FSL has found applications across diverse fields, including computer vision, robotics, and natural language processing (NLP).

In computer vision, benchmark datasets like Omniglot [57] and miniImageNet [58] have become crucial for evaluating FSL models. These models have achieved impressive accuracy on these benchmarks and have been successfully applied to various computer vision tasks: image generation [59–60], object detection [61–63], object tracking [64], image classification [65–67], semantic segmentation [68–70], image retrieval [71], motion prediction [72], video classification [73], and 3D object reconstruction [74]. Furthermore, FSL principles are being applied to robotics, aiming to develop systems that learn and generalize from minimal interaction, mirroring human-like learning. In NLP, FSL is making strides in tasks like text classification [75–76], parsing [77], and translation [78]. The introduction of specialized benchmark datasets for FSL in NLP, such as the few-shot relation classification dataset FewRel [79], is further stimulating progress in this domain.

The application of FSL is expanding across diverse machine learning tasks, and its specific implementation is often tailored to address the unique challenges of each problem. Wang et al. [56] outlined three primary scenarios where FSL proves particularly valuable:

Scenario 1: Mimicking Human Learning: This scenario draws direct inspiration from human cognition. FSL models are designed to leverage pre-existing knowledge and relationships as prior information to facilitate the learning, generation, classification, and detection of novel data samples, mirroring how humans apply prior experience to understand new concepts.
Scenario 2: Learning from Rare Events: In situations where certain data categories are inherently rare, FSL provides a mechanism to learn effectively. Similar data from related categories can serve as prior knowledge, enabling models to generalize even with limited examples of the rare target category.
Scenario 3: Overcoming Data Acquisition and Computational Bottlenecks: FSL offers a practical approach to reduce the dependence on massive labeled datasets. By utilizing a small set of labeled examples for each target class, potentially combined with unlabeled data from other classes or pre-trained models, FSL can significantly reduce data gathering efforts and computational costs associated with training large models.

FSL is often categorized based on the number of examples available per class. The “n-way k-shot” terminology precisely defines the learning setup: “n” represents the number of classes the model must distinguish, and “k” indicates the number of labeled samples per class. One-shot learning [48] refers to learning from a single example per class (k=1), while zero-shot learning [80] aims to classify unseen data categories without any training examples (k=0), relying on prior knowledge or semantic descriptions of the categories.

Because FSL approaches are tailored to specific domains and challenges, various techniques have been developed to enhance their performance. Data augmentation, model architecture innovations, and algorithmic refinements are key areas of focus. Overfitting and poor generalization are inherent risks in FSL due to the scarcity of training data. Leveraging prior knowledge is a central strategy to mitigate these risks and improve FSL effectiveness. Enhancing FSL through prior knowledge involves addressing challenges across three dimensions: data, model, and algorithm.

Data augmentation becomes critical in FSL to artificially expand the limited training set and introduce variability. Unlike supervised learning with abundant, perfectly annotated data, FSL necessitates augmenting the few available samples to extract more information and potentially leverage weakly labeled or unlabeled data as prior knowledge. Labeling strategies [81–83] have been developed to utilize weakly labeled and unlabeled datasets. Techniques using similar datasets [84–86], such as Generative Adversarial Networks (GANs) [85], can generate synthetic data to augment the training set.

Model refinement in FSL encompasses strategies like multi-task learning and embedding learning. Multi-task learning [87] leverages inductive transfer, improving generalization by incorporating domain-specific information from related tasks as inductive bias. In FSL, multi-task learning [88–89] often involves parameter sharing and tying to reduce the hypothesis space, enhancing accuracy with limited data. Researchers like Motiian et al. and Hu et al. [88–90] have explored parameter tying as a regularization technique, penalizing deviations between parameters across different training batches [91–92].

Embedding learning, conversely, focuses on reducing data dimensionality, leading to a smaller hypothesis space and requiring fewer training samples. It learns from prior knowledge by extracting relevant features and obtaining additional information from the training data [93–96]. Task-invariant embedding models, which utilize similarity metrics to compare embeddings of training batches and test samples, have gained prominence. Prototypical networks [55], matching networks [97], and relation networks [67] have marked significant advancements in FSL, and are notably used in FER tasks within the FSL paradigm. Hybrid embedding models, designed to adapt to new tasks by learning task-specific information from training datasets, further enhance flexibility [98–101]. Additionally, incorporating external memory into models has been explored to improve FSL performance [54–102].

From an algorithmic perspective, research focuses on adapting search strategies based on prior knowledge. Refinements include fine-tuning parameters, aggregating parameter sets, and incorporating new parameters into existing models. Meta-learning, which aims to learn how to learn, plays a crucial role in optimizing parameters or learning optimizers themselves [103–104], enabling models to efficiently identify optimal hypotheses even with limited data.

3.2 FER Issues

As highlighted in Section 3.1.1, current FER research faces limitations, prompting the development of diverse methodologies to address them. This section reviews recent FER research achieving state-of-the-art performance in tackling challenges related to facial expression datasets and current FER methods. While CNN-based models have proven effective in extracting features from both controlled lab and real-world “in-the-wild” datasets, and are widely used in FER, deep learning does not invariably guarantee optimal performance for all FER problems. Therefore, this section delves into the inherent issues of deep learning-based FER, concerning both data and methodological aspects, to elucidate the rationale behind employing FSL in FER. Tables 1 and 2 summarize data-related and method-related FER issues, respectively.

3.2.1 Data

This section focuses on the inherent challenges within facial expression datasets, particularly “in-the-wild” datasets, which are prone to noise, occlusions, and variations in age and pose. Recent research addressing these data-related issues and strategies for enhancing FER performance are discussed.

Table 1 Various issues existing in facial datasets

| Issue | Description | Datasets Primarily Affected | Proposed Solutions