Contrastive Learning: 10 Frameworks Revolutionizing Self-Supervised Learning

Contrastive Learning has emerged as a powerhouse in the field of machine learning, particularly for self-supervised learning. This approach enables models to learn powerful representations from unlabeled data by contrasting similar and dissimilar data points. In essence, it teaches models to understand what makes data instances alike and different without explicit labels. This article dives into ten influential contrastive learning frameworks that have significantly advanced the field, offering a comprehensive overview for educators and learners alike at learns.edu.vn.

1. SimCLR: Simple Framework for Contrastive Learning of Visual Representations

SimCLR, a groundbreaking model developed at Google Brain and detailed in this paper, provides a straightforward yet highly effective framework for contrastive learning, specifically for visual representations. It was designed to tackle both Self-Supervised and Semi-Supervised Learning challenges through the lens of contrastive methods.

The core principle of SimCLR is to maximize the agreement between different augmented views of the same data sample within a learned latent space. This is achieved using a contrastive loss function that encourages representations of augmented versions of the same image to be close together, while pushing representations of augmentations from different images far apart. The architecture of SimCLR is illustrated below.

Source: Paper

The SimCLR framework is composed of four key modules:

  1. Data Augmentation Module: This module is crucial for creating diverse views of each input image. For a given image, it randomly applies a sequence of transformations to generate two distinct, yet related versions (denoted as “x_i” and “x_j” in the diagram). These paired views form the “positive pairs” in contrastive learning. SimCLR employs a combination of augmentations, including random cropping and resizing with random flips, color distortions, and Gaussian blur. The research indicated that random cropping and color distortion are particularly vital for achieving optimal performance.

  2. Neural Network Base Encoder: Represented as “f(.)” in the architecture, this encoder extracts feature vectors from the augmented images. While SimCLR is flexible and can accommodate various network architectures, the authors opted for ResNet models for their simplicity and proven effectiveness in image representation learning. Features are extracted after the final average pooling layer of a ResNet-50 model.

  3. Projection Head: A small neural network, “g(.)”, known as the projection head, maps the features from the encoder into a lower-dimensional latent space. This projection is essential for applying the contrastive loss effectively. SimCLR uses a simple Multilayer Perceptron (MLP) with a single hidden layer for this projection. The researchers found that performing the contrastive loss in this projected space yields superior results compared to applying it directly to the features extracted from ResNet-50.

  4. Contrastive Loss Function: SimCLR utilizes the NT-Xent loss function, a normalized temperature-scaled cross entropy loss. This loss function quantifies the similarity between positive pairs (augmented views of the same image) and dissimilarity between negative pairs (augmented views of different images) in the latent space, driving the learning process.

Source: Paper

SimCLR achieved state-of-the-art results at the time of its publication, setting a new benchmark for self-supervised visual representation learning. Its simplicity and effectiveness have made it a foundational framework, inspiring numerous subsequent contrastive learning models.

2. NNCLR: Leveraging Nearest Neighbors for Contrastive Learning

While many contrastive learning methods, like SimCLR, rely on treating different augmentations of the same image as positive pairs, Nearest-Neighbor Contrastive Learning (NNCLR), introduced in this paper, takes a different approach. NNCLR explores using positives from other instances within the dataset. The core idea is to utilize different images that are semantically similar, rather than solely relying on augmented versions of the same image.

The NNCLR framework is visualized below.

Source: Paper

NNCLR enhances the diversity of positive pairs by sampling nearest neighbors from the dataset directly in the latent space. These nearest neighbors, representing semantically similar images, are then treated as positive samples during contrastive learning. This method broadens the scope of positive examples, potentially leading to more robust and generalizable representations.

Source: Paper

Similar to SimCLR, NNCLR also employs the InfoNCE loss function. However, in NNCLR, the positive sample is defined as the nearest neighbor of the anchor image in the latent space. The specific loss function used in NNCLR is defined as:

Source: Paper

In this formula, “Q” represents the support set from which neighbors are selected, and “NN” denotes the nearest neighbor function. By incorporating nearest neighbors as positives, NNCLR encourages the model to learn representations that group semantically similar instances together, even if they are not simply augmented versions of each other.

3. ORE: Contrastive Learning for Open World Object Detection

Joseph et al. introduced Open World Object Detection (ORE) in this research paper, addressing the challenge of object detection in open-world scenarios. ORE tackles two key problems:

  1. Unknown Object Identification: Identifying objects that the model has not been explicitly trained to recognize, categorizing them as “unknown” without requiring explicit supervision.
  2. Incremental Learning: Gradually learning to recognize these “unknown” categories as labels become available, without forgetting previously learned classes.

The ORE framework aims to create a more adaptable and intelligent object detection system. An overview of the ORE framework is shown below.

The ORE framework tackles the problem of incremental object detection, enabling the detection of previously unseen objects in images. As more information about these initially unknown classes becomes available, the system integrates this new knowledge into its existing understanding. This capability is essential for real-world applications where the set of objects a system might encounter is not fixed and predefined.

Supervising the clustering of unknown objects is crucial in ORE. However, manually annotating even a small fraction of potentially infinite unknown classes is impractical. To overcome this, ORE utilizes an auto-labeling mechanism based on the Region Proposal Network (RPN). The RPN generates bounding box predictions for both foreground and background instances, which are then used to pseudo-label unknown instances.

This inherent separation of auto-labeled unknown instances within the latent space aids the energy-based classification head in ORE in distinguishing between known and unknown objects. To mitigate catastrophic forgetting of previously learned classes during incremental learning, ORE employs a replay mechanism. A small set of examples from older classes are “replayed” during each training iteration, ensuring the model retains its knowledge of previously learned categories. ORE has demonstrated superior performance compared to several state-of-the-art methods in open-world object detection.

Source: Paper

4. CURL: Contrastive Learning for Reinforcement Learning

CURL, short for Contrastive Unsupervised Representations for Reinforcement Learning (RL), was introduced in this paper. CURL integrates contrastive representation learning directly with the RL objective. In CURL, representation learning is framed as an auxiliary task that can be seamlessly integrated with any model-free RL algorithm, enhancing sample efficiency and learning robustness.

CURL employs contrastive learning by maximizing the agreement between augmented versions of the same observation, where each observation is a sequence of temporally sequential frames. By performing contrastive learning concurrently with an off-policy RL algorithm, CURL significantly improves sample efficiency compared to previous pixel-based methods in RL.

The design of CURL emphasizes simplicity and reproducibility, minimizing architectural overhead and learning complexity. This is crucial for practical application in RL, where computational efficiency is often paramount.

The contrastive learning objective in CURL operates within the same latent space and architecture typically used for model-free RL. This seamless integration within the existing RL training pipeline eliminates the need for extensive hyperparameter tuning and simplifies implementation.

5. PCRL: Preservational Contrastive Representation Learning for Medical Imaging

Preservational Contrastive Representation Learning (PCRL), detailed in this paper, is specifically designed for learning self-supervised representations in the domain of medical imaging. Medical imaging often requires highly sensitive and information-rich representations, and PCRL aims to address this need. The overview of the PCRL model is shown below.

Source: Paper

While traditional contrastive learning focuses on learning invariant representations by contrasting image pairs, implicitly preserving maximal information, PCRL argues for the benefit of explicitly preserving more information in addition to the contrastive loss.

The authors posit that simply adding a direct reconstruction branch to restore original inputs does not significantly enhance learned representations. To overcome this limitation, PCRL reconstructs diverse contexts using representations learned from the contrastive loss. This approach forces the representations to retain more detailed and nuanced information.

PCRL implements two key modules to facilitate diverse context restoration:

  1. Transformation-conditioned Attention: This module enables the reconstruction of diverse contexts by conditioning the reconstruction process on the specific augmentations applied.
  2. Cross-model Mixup: This module shuffles feature representations to create more diverse restoration targets, further encouraging information preservation.

PCRL utilizes a triple encoder, single decoder architecture for self-supervised learning. The encoders learn representations through contrastive loss, while the decoder is tasked with reconstructing diverse contexts, ensuring information richness in the learned representations, crucial for the detail-sensitive nature of medical image analysis.

6. SwAV: Swapping Assignments for Contrastive Unsupervised Visual Learning

Swapping Assignments between multiple Views (SwAV), developed by Facebook AI Research and presented in this paper, is an unsupervised contrastive clustering approach. SwAV leverages the benefits of contrastive methods but uniquely avoids direct pairwise feature comparisons.

Instead of directly comparing features, SwAV simultaneously performs data clustering and enforces consistency between cluster assignments generated for different augmented views of the same image. In simpler terms, SwAV employs a “swapped” prediction mechanism: it predicts the cluster assignment (code) of one view from the representation of another view of the same image.

The architecture of SwAV is depicted below.

Source: Paper

SwAV introduces a Multi-Crop augmentation strategy to generate multiple views of the same image without a significant increase in computational cost. This involves creating two standard high-resolution views (e.g., 224×224 pixels) and several lower-resolution views (e.g., 96×96 pixels). The inclusion of low-resolution views allows the model to train with a richer set of image samples while maintaining computational efficiency.

The base encoder in SwAV is typically a ResNet backbone (with varying depths). A key innovation of SwAV is its online cluster assignment process using mini-batches. Traditional clustering methods often operate offline, requiring the entire dataset to be processed at once. SwAV’s online approach makes it more scalable and adaptable to large datasets.

The swapped loss function used in SwAV is defined as:

Here, “z_t” and “z_s” represent features extracted from two different views of the same image, and “q_t” and “q_s” are their corresponding intermediate codes (cluster assignments), obtained by matching features to a set of prototype vectors. “l(z, q)” measures the compatibility between features “z” and a code “q.” SwAV achieved impressive results, demonstrating the effectiveness of its swapped prediction and multi-crop strategies.

Source: Paper

7. MoCo: Momentum Contrast for Unsupervised Visual Representation Learning

Momentum Contrast (MoCo) is a self-supervised learning algorithm that utilizes a contrastive loss, as described in this influential paper. MoCo addresses a key challenge in contrastive learning: the need for a large and consistent negative sample set.

Source: Paper

Contrastive loss methods can be viewed as building dynamic dictionaries. In this analogy, “keys” (tokens) in the dictionary are data samples (e.g., images or image patches) and are represented by an encoder network. Unsupervised learning in this context trains encoders to perform dictionary look-up: an encoded “query” sample should be similar to its matching key and dissimilar to other keys in the dictionary. Learning is formulated as minimizing a contrastive loss function.

To implement this efficiently with a large negative sample set, MoCo employs a queue of mini-batches encoded by a “momentum encoder” network. As a new mini-batch is processed, its encodings are added to the queue, and the oldest encodings are removed. This queue mechanism decouples the dictionary size from the batch size, allowing for a much larger dictionary of negative samples to be used for contrastive learning. The momentum encoder is updated as a moving average of the main encoder, ensuring consistency of the keys in the dynamic dictionary.

8. Supervised Contrastive Segmentation: Pixel-Wise Contrastive Learning for Semantic Segmentation

Traditional semantic segmentation methods often focus on “local” context, i.e., relationships between pixels within a single image. These methods may overlook the “global” context of the training data – the rich semantic relationships between pixels across different images.

The fully-supervised contrastive segmentation framework, presented in this paper, provides a solution by applying contrastive learning in a supervised setting for semantic segmentation. It enforces pixel embeddings belonging to the same semantic class to be more similar to each other than to embeddings of pixels from different classes.

The overview diagram of this method is shown below.

Source: Paper

This pixel-wise contrastive learning approach for semantic segmentation shifts the conventional image-wise training strategy to an inter-image, pixel-to-pixel paradigm. It learns a well-structured pixel semantic embedding space by fully exploiting global semantic similarities among labeled pixels across the entire dataset.

To effectively explore the large visual data space and facilitate pixel-to-region contrast, the authors introduce a region memory module. This module, combined with pixel-to-pixel contrast computation, effectively leverages semantic correlations both among individual pixels and between pixels and broader semantic regions.

The model employs a weighted average of the pixel-wise cross-entropy loss and the supervised NCE loss. This combined loss function provides superior clustering results compared to using cross-entropy loss alone, leading to improved semantic segmentation performance.

Source: Paper

The quantitative results obtained by this method demonstrate significant improvements in semantic segmentation accuracy.

Source: Paper

9. PCL: Prototypical Contrastive Learning Bridging Clustering and Contrastive Learning

Prototypical Contrastive Learning (PCL), detailed in this paper, is an unsupervised representation learning method that effectively bridges the gap between contrastive learning and clustering techniques. PCL learns robust low-level features for instance discrimination while simultaneously encoding semantic structures revealed by clustering into the learned embedding space.

The training framework for the PCL model is shown below.

Source: Paper

PCL reframes instance discrimination methods within the Expectation-Maximization (EM) framework. It introduces prototypes as latent variables to facilitate maximum-likelihood estimation of network parameters. The training process iteratively performs two steps:

  1. E-step (Expectation): Finding the distribution of prototypes through clustering the data representations.
  2. M-step (Maximization): Optimizing the network parameters using contrastive learning, guided by the prototype distributions from the E-step.

PCL also introduces ProtoNCE, a novel contrastive loss function that enhances the widely used InfoNCE loss. ProtoNCE dynamically estimates the concentration of the feature distribution around each prototype. It also includes a standard InfoNCE term, allowing instance embeddings to be interpreted as instance-based prototypes.

PCL has demonstrated superior performance compared to instance-wise contrastive learning methods across multiple benchmarks, particularly showing significant improvements in low-resource transfer learning scenarios. PCL also yields improved clustering results, highlighting its ability to learn semantically meaningful representations.

Source: Paper

Source: Paper

10. SSCL: Self-Supervised Contrastive Learning for Aspect Detection

The Self-Supervised Contrastive Learning (SSCL) framework, detailed in this paper, addresses the aspect detection problem in natural language processing. Aspect detection involves identifying interpretable aspects and extracting aspect-specific segments (like sentences) from online reviews or text documents.

Traditional deep learning-based topic models, especially aspect-based autoencoders, often suffer from issues like extracting noisy aspects and struggling to accurately map model-discovered aspects to aspects of actual interest. SSCL aims to overcome these challenges.

To address these issues, SSCL proposes a framework comprising a Smooth Self Attention (SSA) model combined with a high-resolution selective mapping (HRSMap) method. The overview of the SSCL method is shown below.

Source: Paper

SSCL constructs two types of representations for each review segment in a corpus: (i) word embeddings and (ii) aspect embeddings. A contrastive learning mechanism is then employed to map aspect embeddings into the word embedding space. In the diagram, “alpha” represents smooth self-attention parameters, while “beta” represents soft-labels (probability distribution) over model-inferred aspects for a review segment.

Selective mapping in HRSMap ensures that the model avoids mapping noisy or irrelevant aspects to gold-standard aspects. For aspect mapping, SSCL utilizes a high-resolution mapping approach, where the number of model-inferred aspects is significantly larger (at least three times more) than the number of gold-standard aspects. This over-parameterization provides better coverage and granularity in aspect representation. This high-resolution mapping is pictorially represented below.

Source: Paper

Applications of Contrastive Learning

Contrastive learning has proven to be a versatile technique with broad applicability across various domains. Here are some of the most prominent applications:

Semi-supervised Learning: Leveraging Unlabeled Data

Acquiring large amounts of labeled data can be a significant bottleneck, particularly in specialized fields like astronomy, remote sensing, and biomedical engineering. In many cases, datasets are predominantly unlabeled, with only a small fraction of samples annotated.

Semi-Supervised Learning aims to effectively utilize both unlabeled and labeled data to train models. Research in 2020, as highlighted in this paper, demonstrated that deeper and wider self-supervised models are powerful semi-supervised learners.

The typical approach involves pre-training a model in an unsupervised manner using the abundant unlabeled data. Subsequently, the model is fine-tuned using the limited labeled samples available. This pre-training stage, often leveraging contrastive learning, allows the model to learn generalizable representations from the vast unlabeled data, which are then refined for specific tasks using labeled data.

Source: Paper

After pre-training and fine-tuning a large deep network, the model can be further distilled into a smaller, more efficient network using “Knowledge Distillation.” This process leverages the unlabeled examples a second time, but in a task-specific way, with minimal loss in classification accuracy.

Contrastive learning has significantly boosted the performance of semi-supervised learning. For instance, researchers achieved a 10x improvement in label efficiency on the ImageNet dataset using only 1% of the labels (less than 13 images per class) compared to previous state-of-the-art methods. With 10% of labels, they even surpassed the performance of fully supervised learning models trained on the same dataset with traditional methods.

Source: Paper

Supervised Learning: Enhancing Performance with Contrastive Loss

Contrastive learning is increasingly being applied in fully-supervised settings, moving beyond its initial focus on self-supervision. In fully-supervised learning, class labels are readily available, enabling more effective formulation of contrastive loss functions. Positive pairs are no longer limited to augmented versions of the same sample; they can be any sample belonging to the same class.

This research paper bridges the gap between self-supervised and fully supervised learning by introducing the SupCon (Supervised Contrastive Learning) loss function. SupCon encourages embeddings from the same class to be closer together in the embedding space, while pushing embeddings from different classes further apart.

SupCon simplifies positive pair selection and avoids potential false negatives by accommodating multiple positives per anchor point. This leads to a more diverse and semantically relevant selection of positive examples. Unlike conventional contrastive learning methods primarily used for downstream tasks, SupCon allows label information to actively participate in representation learning.

Models trained with SupCon have demonstrated robustness to image corruptions and variations in hyperparameters.

Source: Paper

Natural Language Processing (NLP): Learning Sentence Embeddings

Contrastive learning has found valuable applications in Natural Language Processing (NLP), as exemplified by models like SimCSE. In NLP, contrastive learning aims to learn embedding spaces where semantically similar sentences are located close to each other, while dissimilar sentences are far apart.

However, text augmentation, crucial for creating positive pairs in contrastive learning, is more challenging in NLP than in computer vision. Maintaining the meaning of a sentence while applying augmentations is essential. Several text augmentation techniques are used in contrastive NLP:

  1. Back-Translation: This technique generates augmented sentences by translating a sentence to another language (e.g., English to Japanese) and then back-translating it to the original language (Japanese back to English). CERT is a framework that utilizes back-translation for text augmentation.

  2. Lexical Edits: This approach applies simple sets of operations to a sentence to create augmentations:

    • Random Insertion: Inserting synonyms of non-stop words at random positions.
    • Random Swap: Randomly swapping words within a sentence.
    • Random Deletion: Randomly deleting words with a certain probability.
    • Synonym Replacement: Replacing words with their synonyms.
  3. Cutoff: Proposed in this paper, cutoff techniques operate on sentence embeddings. Once a sentence is embedded into a vector representation, strategies like Feature Cutoff (removing features), Token Cutoff (removing token information), or Span Cutoff (removing continuous text chunks) are applied.

  4. Dropout: As implemented in SimCSE, dropout leverages the inherent dropout mechanism in Transformer networks. By feeding the same input sentence to the encoder twice with different dropout masks, different representations are generated, serving as positive pairs for contrastive learning.

Computer Vision: Expanding Applications Beyond Representation Learning

Contrastive learning has been most extensively explored and applied in Computer Vision. We’ve already discussed several image-based contrastive learning frameworks. Beyond these, contrastive learning is driving innovation in various computer vision tasks:

  1. Video Sequence Prediction: VideoMoCo uses contrastive learning for unsupervised video representation learning. It employs temporally adversarial examples for augmentation, dropping frames from sequences to create positive and negative pairs.

  2. Object Detection: DetCo is a self-supervised contrastive approach for object detection. It uses contrastive learning between global images and local patches, combined with multi-level supervision of intermediate representations.

  3. Semantic Segmentation: Contrastive learning for semantic image segmentation has been explored in this paper. It utilizes a supervised contrastive loss for pre-training models, followed by traditional cross-entropy fine-tuning for segmentation.

  4. Remote Sensing: GLCNet employs a self-supervised pre-training and supervised fine-tuning approach for remote sensing image segmentation. It utilizes both global image-level representations via a “global style contrastive learning module” and local region representations using a “local features matching contrastive learning module.”

  5. Perceptual Audio Similarity: CDPAM is an unsupervised contrastive learning method for classifying audio samples based on perceptual similarity. The model is fine-tuned using human judgments to improve generalization across diverse audio perturbations.

Contrastive Learning: A Powerful Tool for Representation Learning – Summary

Contrastive Learning has rapidly become a dominant technique in self-supervised learning and a valuable enhancement for supervised learning approaches. Its core principle of contrasting data samples to learn representations has proven highly effective. By pushing representations of similar samples closer and dissimilar samples farther apart in an embedding space, contrastive learning enables models to understand data relationships without extensive labeled data.

Various strategies exist for generating positive contrast samples, from augmenting entire samples to creating subsamples. Contrastive learning-based methods have significantly improved performance in Semi-Supervised Learning and Representation Learning tasks across diverse domains.

This article has explored ten popular contrastive learning frameworks and discussed the diverse loss functions and architectural innovations developed within this field. We have also highlighted numerous applications of contrastive learning, spanning computer vision, NLP, audio processing, and beyond.

Ongoing research continues to explore methods to minimize supervision in contrastive learning and achieve performance levels comparable to or exceeding traditional supervised learning. Contrastive learning is undoubtedly a key driver in the advancement of deep learning, offering powerful tools for representation learning and pushing the boundaries of what’s possible in various AI applications. As deep learning continues to be the go-to technique for solving complex tasks, particularly in computer vision and NLP, contrastive learning will undoubtedly play an increasingly critical role in overcoming data limitations and improving model performance.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *