Artificial intelligence has witnessed remarkable advancements, particularly in creating systems adept at learning from vast datasets of meticulously labeled information. This supervised learning model has proven highly effective for developing specialized AI that excels in specific tasks for which they are trained. However, relying solely on supervised learning presents limitations for the future progress of AI.
Supervised learning becomes a bottleneck when aiming to develop more versatile, generalist AI models capable of handling diverse tasks and acquiring new skills without requiring massive amounts of labeled data. The reality is that labeling every aspect of the world is simply not feasible. Furthermore, certain tasks inherently lack sufficient labeled data, such as building translation systems for less common languages. For AI to truly advance and approach human-level intelligence, it needs to develop a more profound and nuanced understanding of the world, extending beyond the confines of explicit training datasets.
Human infants learn about the world primarily through observation. We develop generalized predictive models about objects and their behaviors by grasping fundamental concepts like object permanence and gravity. As we mature, we continue to learn by observing, interacting with the world, and refining our understanding through trial and error, forming hypotheses to explain how our actions influence our surroundings.
It’s hypothesized that this generalized world knowledge, often termed “common sense,” forms the bedrock of biological intelligence in both humans and animals. This seemingly innate common sense, something we take for granted, has remained a significant challenge in AI research since its inception. In a sense, common sense represents the “dark matter” of artificial intelligence – a crucial but elusive component.
Common sense empowers humans to learn new skills efficiently, without needing extensive instruction for each individual task. For instance, after seeing just a few drawings of cows, young children can typically recognize any cow they encounter. In contrast, AI systems trained through supervised learning require numerous examples of cow images and may still struggle to identify cows in unfamiliar contexts, like lying on a beach. Consider driving: humans can learn to drive a car with approximately 20 hours of practice and minimal supervision, while achieving fully autonomous driving remains a challenge for even the most sophisticated AI, despite being trained on thousands of hours of human driving data. The key difference lies in humans’ reliance on pre-existing background knowledge about how the world operates.
So, how can we equip machines with this same capability?
Self-supervised learning (SSL) emerges as a highly promising avenue for imbuing AI systems with this essential background knowledge and approximating a form of common sense.
Self-supervised learning enables AI models to learn from significantly larger datasets, potentially orders of magnitude greater than what’s feasible with labeled data. This capability is crucial for recognizing and understanding subtle patterns and less common representations of the world. Self-supervised learning has already revolutionized the field of natural language processing (NLP), driving advancements in models like the Collobert-Weston 2008 model, Word2Vec, GloVE, fastText, and more recently, BERT, RoBERTa, XLM-R, among others. Systems pre-trained using self-supervision consistently outperform those trained solely with supervised methods.
Our recent research project, SEER, leverages SwAV and similar techniques to pre-train a large network on a billion unlabeled images. This has yielded top-tier accuracy across a diverse range of computer vision tasks. This progress underscores that self-supervised learning can excel in computer vision tasks within complex, real-world environments.
Today, we delve into the reasons why self-supervised learning holds the key to unlocking the “dark matter” of intelligence and represents the next major frontier in AI. We will also explore what we believe are the most promising new directions: energy-based models for prediction under uncertainty, joint embedding methods, and latent-variable architectures for self-supervised learning and reasoning in AI systems.
Self-Supervised Learning: Learning to Predict
At its core, self-supervised learning derives its supervisory signals directly from the data itself, often by exploiting the inherent structure within the data. The fundamental principle of self-supervised learning is to train a model to predict any unobserved or hidden part (or attribute) of the input based on the observed or unhidden parts. For example, a common technique in NLP involves masking parts of a sentence and training the model to predict the missing words from the remaining context. Similarly, in video analysis, a model can be trained to predict future or past frames (hidden data) from the current frame (observed data). Because self-supervised learning capitalizes on the intrinsic structure of the data, it can utilize a wide array of supervisory signals across different modalities (e.g., video and audio) and across massive datasets – all without the need for explicit labels.
The term “self-supervised learning” has gained wider acceptance than the previously used term “unsupervised learning” precisely because of these supervisory signals. “Unsupervised learning” is considered an ill-defined and somewhat misleading term, as it implies learning without any supervision whatsoever. In reality, self-supervised learning is not entirely unsupervised; it leverages significantly more feedback signals than traditional supervised and even reinforcement learning methods.
NLP vs. Computer Vision: Divergent Paths in Self-Supervised Learning
Self-supervised learning has profoundly impacted NLP, enabling the training of powerful models like BERT, RoBERTa, and XLM-R on vast amounts of unlabeled text data. These models are first pre-trained in a self-supervised manner and then fine-tuned for specific downstream tasks, such as text classification. During the self-supervised pre-training phase in NLP, the system is presented with short text segments (typically around 1,000 words) where some words have been deliberately masked or replaced. The model’s objective is to predict these masked or replaced words. Through this process, the system learns to represent the meaning of the text in a way that allows it to accurately fill in the “correct” words – those that contextually fit.
Predicting missing parts of the input is a standard and effective pre-training task in SSL. To complete a sentence like “The (blank) chases the (blank) in the savanna,” the model must learn that predators like lions or cheetahs might chase prey like antelope or wildebeests, but that domestic cats chase mice indoors, not in the savanna. As a consequence of this training, the system develops an understanding of word meaning, syntactic roles, and the overall semantic context of texts.
However, directly applying these successful NLP techniques to domains like computer vision (CV) has proven challenging. Despite promising early results, SSL has not yet achieved the same transformative improvements in CV as seen in NLP – though this is rapidly changing.
A primary reason for this disparity lies in the increased difficulty of representing uncertainty in predictions for images compared to words. When a missing word is not definitively predictable (e.g., is it “lion” or “cheetah”?), the NLP system can assign a probability or score to each word in its vocabulary. High scores are given to likely words like “lion” and “cheetah,” while low scores are assigned to less probable words.
Training these NLP models at scale also necessitated model architectures that were both computationally efficient in terms of runtime and memory usage, without sacrificing accuracy. Fortunately, recent architectural innovations, such as the RegNet model family developed by FAIR, have emerged to meet these demands. RegNets are ConvNets designed to scale to billions or even trillions of parameters and can be optimized for different runtime and memory constraints.
However, representing uncertainty efficiently when predicting missing frames in a video or missing patches in an image presents a significant hurdle. We cannot simply list all possible video frames and assign scores, as the possibilities are virtually infinite. While this challenge has historically limited the performance gains from SSL in computer vision, new techniques like SwAV are starting to overcome these limitations and achieve record-breaking accuracy in vision tasks. The SEER system, utilizing a large convolutional network trained on billions of examples, exemplifies this progress.
Modeling Prediction Uncertainty: Energy-Based Models
To better grasp this challenge, we need to examine how prediction uncertainty is modeled in NLP versus CV. In NLP, predicting missing words involves calculating a prediction score for each word in the vocabulary. While the vocabulary is extensive and predicting a missing word involves uncertainty, it’s feasible to generate a list of all possible words along with probability estimates for their appearance in that location. Typical machine learning systems accomplish this by framing the prediction as a classification problem and using a “softmax layer” to transform raw scores into a probability distribution across all words. This technique effectively represents prediction uncertainty as a probability distribution across a finite set of outcomes.
Conversely, in CV, the analogous task of predicting “missing” frames, patches, or speech segments involves predicting high-dimensional continuous objects rather than discrete outcomes. There is an infinite spectrum of plausible video frames that could follow a given clip. Explicitly representing all possible video frames and assigning prediction scores becomes impossible. In fact, we may never develop techniques to represent suitable probability distributions across high-dimensional continuous spaces like the space of all possible video frames.
This seems like a fundamentally intractable problem.
A Unified Perspective: Self-Supervised Learning as Energy-Based Modeling
However, we can conceptualize SSL within the unified framework of energy-based models (EBMs). An EBM is a trainable system that, given two inputs, x and y, quantifies their incompatibility. For instance, x could be a short video clip, and y another proposed video clip. The EBM would assess how well y serves as a continuation for x. To express this incompatibility, the EBM outputs a single numerical value called “energy.” Low energy signifies compatibility between x and y, while high energy indicates incompatibility.
Training an EBM involves two key steps: (1) presenting it with examples of compatible (x, y) pairs and training it to produce low energy for these pairs, and (2) ensuring that for a given x, incompatible y values result in higher energy than compatible y values. The first step is straightforward, but the second step presents the core challenge.
In the context of image recognition, an EBM might take two images, x and y, as inputs. If x and y are slightly altered versions of the same underlying image, the model is trained to output low energy. For example, x could be a photograph of a car, and y could be a photo of the same car taken from a slightly different angle, at a different time of day, resulting in shifts, rotations, size variations, and subtle changes in color and shadows.
Joint Embedding and Siamese Networks
A particularly well-suited deep learning architecture for implementing EBMs is the Siamese network or joint embedding architecture. The foundational ideas trace back to research from Geoff Hinton’s and Yann LeCun’s labs in the early 1990s and mid-2000s. While initially overlooked, this approach has experienced a resurgence since late 2019. A joint embedding architecture consists of two identical (or nearly identical) copies of the same neural network. One network processes input x, and the other processes input y. These networks generate output vectors called “embeddings,” which represent x and y in a lower-dimensional space. A third module, connecting the networks at their outputs, calculates the energy as the distance between these two embedding vectors. When the model is presented with distorted versions of the same image, the network parameters are adjusted to bring their output embeddings closer together. This ensures that the network produces similar representations (embeddings) for the same object, regardless of variations in viewpoint or appearance.
The challenge lies in ensuring that the networks produce high energy (i.e., dissimilar embedding vectors) when x and y are different images. Without a specific mechanism to enforce this, the two networks could trivially ignore their inputs and always produce identical output embeddings. This undesirable outcome is known as a “collapse.” When a collapse occurs, the energy for non-matching x and y pairs is not higher than for matching pairs, rendering the model ineffective.
Two primary categories of techniques exist to prevent collapse: contrastive methods and regularization methods.
Contrastive Energy-Based SSL
Contrastive methods are based on the straightforward concept of creating pairs of x and y that are intentionally incompatible and adjusting the model’s parameters to increase the energy for these incompatible pairs.
The technique of masking or substituting words in NLP systems falls under the umbrella of contrastive methods, although it typically employs a predictive architecture rather than a joint embedding architecture. In this predictive approach, the model directly generates a prediction for y. Starting with a complete text segment y, it is corrupted, for instance, by masking some words to generate the observed input x. This corrupted input is fed into a large neural network trained to reconstruct the original, uncorrupted text y. An uncorrupted text will be reconstructed with minimal error (low reconstruction error), while a corrupted text will be reconstructed as its uncorrupted version, resulting in a larger reconstruction error. If we interpret this reconstruction error as energy, it aligns with the desired property: low energy for “clean” text and higher energy for “corrupted” text.
This general approach of training a model to restore a corrupted input is known as a denoising auto-encoder. While early forms of this idea emerged in the 1980s, it was revitalized in 2008 by Pascal Vincent and colleagues at the University of Montréal, introduced to NLP by Collobert and Weston, and popularized by the BERT paper.
As previously mentioned, predictive architectures of this type typically produce a single prediction for a given input. To handle situations with multiple plausible outcomes, the prediction is not a single set of words but rather a series of scores for each word in the vocabulary for each missing word location.
However, directly applying this trick to images is problematic because we cannot enumerate all possible images. Is there a solution? While a definitive solution remains elusive, latent-variable predictive architectures offer a promising direction.
Latent-variable predictive models incorporate an additional input variable (z), termed “latent” because its value is never directly observed during training. In a well-trained model, as the latent variable z varies across a defined set, the output prediction varies across the set of plausible predictions compatible with the input x.
Latent-variable models can be trained using contrastive methods. A prominent example is the Generative Adversarial Network (GAN). The discriminator (or critic) in a GAN can be interpreted as computing an energy that indicates how realistic an input y appears. The generator network is trained to generate “contrastive” samples that the discriminator is trained to assign high energy to.
However, contrastive methods suffer from a significant drawback: training inefficiency. In high-dimensional spaces like images, there are countless ways images can differ. Discovering a set of contrastive images that adequately captures all these differences is a computationally intensive and nearly impossible task. To paraphrase Tolstoy’s Anna Karenina, “Happy families are all alike; every unhappy family is unhappy in its own way.” This principle seems to apply to families of high-dimensional objects as well.
Is it possible to ensure that incompatible pairs have higher energy than compatible pairs without explicitly increasing the energy of numerous incompatible pairs?
Non-Contrastive Energy-Based SSL
Non-contrastive methods applied to joint embedding architectures represent perhaps the most active and promising area of research in SSL for computer vision today. This domain is still largely uncharted, but initial results are highly encouraging.
Non-contrastive methods for joint embedding encompass techniques like DeeperCluster, ClusterFit, MoCo-v2, SwAV, SimSiam, Barlow Twins, BYOL from DeepMind, and others. These methods employ various strategies, such as computing virtual target embeddings for groups of similar images (DeeperCluster, SwAV, SimSiam) or introducing subtle differences between the two joint embedding architectures through architectural variations or parameter vectors (BYOL, MoCo). Barlow Twins focuses on minimizing redundancy between individual components of the embedding vectors.
In the long term, developing non-contrastive methods for latent-variable predictive models may prove to be a more effective alternative. The primary obstacle lies in minimizing the capacity of the latent variable. The range over which the latent variable can vary directly influences the range of outputs that can achieve low energy. By constraining this range, we can automatically shape the energy function in the desired way.
A successful example of such a method is the Variational Auto-Encoder (VAE), where the latent variable is made “fuzzy” to limit its capacity. However, VAEs have not yet demonstrated the ability to produce representations that are sufficiently effective for downstream visual tasks. Sparse modeling offers another example, but its application has been limited to simpler architectures. Currently, a perfect recipe for limiting the capacity of latent variables remains elusive.
The challenge for the coming years is to devise non-contrastive methods for latent-variable energy-based models that can successfully generate high-quality representations for images, videos, speech, and other signals, and achieve top-tier performance in downstream supervised tasks, all while minimizing the reliance on large amounts of labeled data.
Advancing Self-Supervised Learning for Computer Vision
Recently, we created and open-sourced SEER, a novel billion-parameter self-supervised CV model that demonstrates efficient performance with complex, high-dimensional image data. SEER is built upon the SwAV method applied to a convolutional network architecture (ConvNet) and can be trained using vast quantities of random images without any metadata or annotations. The ConvNet’s large capacity enables it to capture and learn a wide range of visual concepts from this extensive and complex data. After pre-training on a billion random, unlabeled, and uncurated public Instagram images and subsequent supervised fine-tuning on ImageNet, SEER surpassed the most advanced state-of-the-art self-supervised systems, achieving 84.2 percent top-1 accuracy on ImageNet.
These results convincingly demonstrate that self-supervised learning is ushering in a paradigm shift in computer vision.
Self-Supervised Learning in Action at Facebook
At Facebook, we are not only pushing the boundaries of self-supervised learning techniques across diverse domains through fundamental, open scientific research, but we are also actively deploying this cutting-edge work in production to rapidly enhance the accuracy of content understanding systems that help maintain safety and security on our platforms.
Self-supervision research, exemplified by our pre-trained language model XLM, is accelerating critical applications at Facebook today – including the proactive detection of hate speech. Furthermore, we have deployed XLM-R, a model leveraging our RoBERTa architecture, to improve our hate speech classifiers across multiple languages on Facebook and Instagram. This advancement enables hate speech detection even in languages with limited training data.
We are encouraged by the significant strides made in self-supervision in recent years, although much progress is still needed to fully unlock the “dark matter” of AI intelligence. Self-supervision represents a crucial step toward achieving human-level intelligence, but it is undoubtedly one of many steps on a long journey. Long-term progress will be built incrementally. This is why we are committed to collaborating with the broader AI community to realize our shared goal of building machines with human-level intelligence. Our research is publicly available and published at leading conferences. We have also organized workshops and released open-source libraries to accelerate research in this vital area.