Abstract structure, illustration
Abstract structure, illustration

Deep Learning for AI: Advancements, Challenges, and Future Directions

Abstract structure, illustrationAbstract structure, illustration

The architecture of a deep neural network, illustrating multiple layers of abstraction. Alt text: Deep learning neural network architecture diagram showing input layer, multiple hidden layers, and output layer, representing the hierarchical feature extraction process in deep learning for AI.

Artificial neural networks, inspired by the human brain, operate on the principle that intelligence arises from interconnected networks of simple, non-linear neurons. These neurons learn by adjusting the strength of their connections. A key question in computational science is how these networks can learn complex internal representations necessary for sophisticated tasks like object recognition or language understanding. Deep learning seeks to answer this question by employing multiple layers of activity vectors as representations. It learns the connection strengths that produce these vectors by following the stochastic gradient of an objective function, which assesses the network’s performance. The effectiveness of this conceptually straightforward approach, especially when applied to large datasets with substantial computational power, is remarkable. Depth appears to be a critical factor, as shallow networks often underperform compared to their deeper counterparts in complex AI tasks.

Several years ago, we provided an overview of the fundamental concepts and significant achievements of deep learning. Here, we revisit the origins of deep learning, highlight recent progress, and discuss future challenges in the field. These challenges include learning with minimal or no external supervision, adapting to test examples from different distributions than training data, and applying deep learning to tasks that require deliberate, step-by-step human thought – what Daniel Kahneman refers to as “system 2” tasks. These contrast with “system 1” tasks like object recognition or immediate natural language understanding, which feel effortless and intuitive, and are areas where Deep Learning For Ai has already shown great promise.

From Hand-Coded Symbolic Expressions to Learned Distributed Representations

There are two distinct paradigms in the realm of Artificial Intelligence. The logic-inspired paradigm emphasizes sequential reasoning as the core of intelligence. It aims to implement reasoning in computers using manually designed inference rules operating on symbolic expressions that formalize knowledge. Conversely, the brain-inspired paradigm considers learning representations from data as the essence of intelligence. This approach focuses on implementing learning by manually designing or evolving rules that modify connection strengths within simulated networks of artificial neurons, which is the foundation for deep learning for AI.

In the logic-inspired paradigm, a symbol’s meaning is derived from its relationships with other symbols, represented by symbolic expressions or relational graphs. In contrast, the brain-inspired paradigm converts external symbols into internal vectors of neural activity, which possess a rich similarity structure. These activity vectors can model the inherent structure within symbol strings. This is achieved by learning appropriate activity vectors for each symbol and non-linear transformations that allow for the completion of missing elements in a symbol string. This was initially demonstrated by Rumelhart et al. using toy data and later by Bengio et al. with real sentences. A notable recent example is BERT, which leverages self-attention to dynamically connect groups of units, as elaborated later.

The primary advantage of using neural activity vectors to represent concepts and weight matrices to capture relationships is the automatic generalization they facilitate. If concepts like “Tuesday” and “Thursday” are represented by similar vectors, they will have comparable causal effects on other neural activity vectors. This similarity enables analogical reasoning, suggesting that intuitive analogical reasoning is our primary mode of thought, with logical sequential reasoning being a later, more developed cognitive function. This has significant implications for how we approach deep learning for AI, especially in tasks requiring reasoning and generalization.

The Rise of Deep Learning

Deep learning revitalized neural network research in the early 2000s by introducing crucial elements that simplified the training of deeper networks. The advent of GPUs and the availability of large datasets were pivotal in enabling deep learning advancements. These were further amplified by the development of open-source, adaptable software platforms with automatic differentiation, such as Theano, Torch, Caffe, TensorFlow, and PyTorch. These platforms streamlined the training of complex deep networks and facilitated the reuse of cutting-edge models and their components. The increased depth of these networks allowed for more complex non-linearities, leading to surprisingly effective outcomes in perception tasks, a key area for deep learning for AI applications.

Why Depth? The idea that deeper neural networks could be more powerful predates modern deep learning techniques. However, a series of advancements in architecture and training procedures ushered in the remarkable progress associated with the rise of deep learning. The power of depth isn’t merely about increasing the number of parameters. Deep networks often generalize better than shallow networks with a comparable number of parameters. Practical applications confirm this; for example, ResNet-50, a popular convolutional network architecture for computer vision, features 50 layers. Other beneficial components include image deformations, dropout, and batch normalization.

The effectiveness of deep networks is attributed to their ability to exploit compositionality. Features from one layer are combined in various ways to create more abstract features in subsequent layers.

This type of compositionality is particularly effective for perception tasks and aligns with evidence suggesting its use in biological perceptual systems. This inherent hierarchical feature learning is a cornerstone of deep learning for AI in areas like image and speech recognition.

Unsupervised Pre-training. When labeled training examples are scarce relative to the complexity of the neural network, leveraging alternative information sources to pre-train feature detectors becomes crucial. This involves creating layers of feature detectors before fine-tuning them with limited labeled data. In transfer learning, this source is another supervised learning task with abundant labels. However, it’s also possible to pre-train feature detectors without labels by stacking auto-encoders.

This process begins by learning a layer of feature detectors that reconstruct the input. Then, a second layer is trained to reconstruct the activities of the first layer, and so on for several hidden layers. After this unsupervised pre-training, the network is used to predict labels from the activities in the final hidden layer, and errors are backpropagated through all layers to fine-tune the initially discovered feature detectors. While pre-training might capture irrelevant structures, it transforms the input into a representation that simplifies classification, which is beneficial when computation is inexpensive and labeled data is costly.

Unsupervised pre-training not only enhances generalization but also initializes weights in a way that facilitates fine-tuning deep neural networks with backpropagation. Historically, pre-training was vital for overcoming the perception that deep networks were difficult to train. However, the advent of rectified linear units (ReLUs) and residual connections has lessened its importance in optimization. Nevertheless, pre-training’s impact on generalization remains significant. It enables the training of very large models using vast amounts of unlabeled data, particularly in natural language processing, where large corpora are readily available. The principle of pre-training and fine-tuning has become an essential tool in deep learning for AI, especially in transfer learning and meta-learning scenarios.

The Mysterious Success of Rectified Linear Units. Early deep networks often used logistic sigmoid or hyperbolic tangent non-linearities, often with unsupervised pre-training. While rectified linear units (ReLUs) had been hypothesized in neuroscience and used in some RBM and convolutional neural network variants, their true potential was unexpectedly revealed. It was discovered that rectifying non-linearities (ReLUs and their variants) simplified the training of deep networks using backpropagation and stochastic gradient descent, eliminating the need for layer-wise pre-training. This technical advancement was instrumental in deep learning’s superior performance in object recognition compared to previous methods, marking a significant milestone in deep learning for AI.

Breakthroughs in Speech and Object Recognition. An acoustic model converts sound waves into probability distributions over phoneme fragments. Early efforts demonstrated the potential of neural networks for acoustic modeling, but it was in 2009 that pre-trained deep neural networks, utilizing GPUs, slightly outperformed state-of-the-art methods on the TIMIT dataset. This result renewed interest in neural networks among leading speech research groups. By 2010, deep networks were shown to surpass state-of-the-art large vocabulary speech recognition without speaker-dependent training. Google’s subsequent deployment of a production version in 2012 significantly improved voice search on Android, showcasing the disruptive power of deep learning for AI in real-world applications.

Around the same time, deep learning achieved a dramatic victory in the 2012 ImageNet competition, nearly halving the error rate for recognizing a thousand object classes in natural images. Key factors in this breakthrough were the large-scale labeled image dataset collected by Fei-Fei Li and collaborators and Alex Krizhevsky’s efficient use of multiple GPUs. Modern hardware, including GPUs, favors large mini-batches to amortize memory access costs across many weight uses. However, pure online stochastic gradient descent, which uses each weight once, can converge faster, suggesting potential future hardware optimizations.

The winning deep convolutional neural network incorporated innovations like ReLUs for faster learning and dropout for overfitting prevention. However, it was fundamentally a feed-forward convolutional neural network, building on years of development by Yann LeCun and collaborators. The computer vision community’s response to this breakthrough was transformative. Recognizing the clear superiority of convolutional neural networks, the community rapidly shifted from hand-engineered approaches to deep learning, marking a paradigm shift in computer vision and deep learning for AI.

Recent Advances

Here, we selectively explore recent advances in deep learning, acknowledging that we are omitting many important areas like deep reinforcement learning, graph neural networks, and meta-learning.

Soft Attention and the Transformer Architecture. A significant development in deep learning, especially for sequential processing, is the use of multiplicative interactions, particularly soft attention. This addition to the neural network toolbox is transformative, shifting neural networks from purely vector transformation machines to architectures that can dynamically select inputs and store information in differentiable associative memories. These architectures can effectively operate on various data structures, including sets and graphs, expanding the applicability of deep learning for AI.

Soft attention allows modules within a layer to dynamically choose which vectors from the previous layer to combine for output computation. This enables outputs to be independent of input order (treating inputs as sets) or to utilize relationships between inputs (treating them as graphs).

The transformer architecture, now dominant in many applications, stacks multiple layers of “self-attention” modules. In each module, scalar products compute the match between a query vector and key vectors of other modules in the same layer. These matches are normalized to sum to 1, and the resulting coefficients form a convex combination of value vectors from the previous layer’s modules. This resulting vector becomes input for the next stage. Modules can be multi-headed, each computing multiple query, key, and value vectors, allowing diverse inputs, each selected differently from previous stage modules. The order and number of modules are flexible, enabling operations on sets of vectors rather than single vectors as in traditional neural networks. For instance, in language translation, a system can focus on corresponding words in the input sentence when producing an output word, regardless of their position. While multiplicative gating is not new, its recent forms in attention mechanisms have become mainstream. Attention mechanisms also dynamically route information through selected modules and combine them in novel ways for improved out-of-distribution generalization, a crucial aspect of robust deep learning for AI.

We believe that deep networks excel because they exploit a particular form of compositionality in which features in one layer are combined in many different ways to create more abstract features in the next layer.

Transformers have dramatically improved performance, revolutionizing natural language processing and are now widely used in industry. These systems are pre-trained in a self-supervised manner to predict missing words in text segments, demonstrating the power of self-supervision in deep learning for AI.

Surprisingly, transformers have also been used to symbolically solve integral and differential equations. A promising trend uses transformers atop convolutional networks for state-of-the-art object detection and localization in images. The transformer performs differentiable post-processing and object-based reasoning, enabling end-to-end system training, highlighting the versatility of transformers in deep learning for AI across different domains.

Unsupervised and Self-Supervised Learning. Supervised learning, while successful, typically demands large amounts of human-labeled data. Similarly, reward-based reinforcement learning requires extensive interactions. These methods often yield task-specific, brittle systems that struggle outside their training domain. Reducing the need for labeled samples or interactions and enhancing out-of-domain robustness is vital for applications like low-resource language translation, medical image analysis, autonomous driving, and content filtering.

Humans and animals learn vast amounts of background knowledge about the world primarily through observation, in a task-independent manner. This knowledge underpins common sense and enables rapid learning of complex tasks, like driving. A key question for the future of AI is understanding how humans learn so much from observation alone, and how deep learning for AI can replicate this.

A key question for the future of AI is how do humans learn so much from observation alone?

In supervised learning, a label from N categories conveys at most log2(N) bits of information. Reward-based reinforcement learning similarly conveys limited information. In contrast, audio, images, and video are high-bandwidth modalities implicitly conveying rich information about world structure. This motivates self-supervised learning, which involves prediction or reconstruction to “fill in the blanks” by predicting masked or corrupted data portions. Self-supervised learning has been successful in training transformers to extract context-dependent word meaning vectors, which are highly effective for downstream tasks, showcasing its importance in advancing deep learning for AI.

For text, transformers predict missing words from a discrete set. However, in high-dimensional continuous domains like video, the set of plausible continuations is vast and complex. Representing the distribution of plausible continuations remains a significant challenge in deep learning for AI.

Contrastive Learning. One approach to tackle this is through latent variable models that assign an energy (a measure of incompatibility) to video examples and potential continuations.

Given an input video X and a continuation Y, we aim for a model to indicate Y‘s compatibility with X using an energy function E(X, Y). Low values indicate compatibility, high values indicate incompatibility.

E(X, Y) can be computed by a deep neural network trained contrastively. For a given X, it learns to assign low energy to Y values compatible with X (e.g., (X, Y) pairs from training data) and high energy to incompatible Y values. Inference for a given X involves finding or sampling Y values that minimize E(X, Y). This energy-based approach to representing the dependence of Y on X allows modeling of diverse, multi-modal plausible continuations, a key capability for deep learning for AI in complex, real-world scenarios.

Animation showing a video clip (X) and two possible continuations (Y). The compatible continuation aligns with the video’s context, while the incompatible one does not, illustrating the concept of energy function in contrastive learning for deep learning AI.

The main challenge in contrastive learning is selecting good “negative” samples – Y values whose energy should be increased. When the set of negative examples is small, all can be considered, as with softmax. In this case, contrastive learning resembles standard supervised or self-supervised learning over a finite discrete set of symbols. However, in high-dimensional real-valued spaces, there are numerous ways a vector could differ from Y. Model improvement requires focusing on Y values that should have high energy but currently have low energy. Early methods for negative sample selection used Monte-Carlo methods, such as contrastive divergence for restricted Boltzmann machines and noise-contrastive estimation.

Generative Adversarial Networks (GANs) train a generative neural network to produce contrastive samples by applying a neural network to latent samples from a known distribution (e.g., Gaussian). The generator learns to produce outputs to which the model assigns low energy. It does this using backpropagation to get the gradient of the energy function with respect to its output. The generator and model are trained simultaneously: the model aims to give low energy to training samples and high energy to generated contrastive samples.

Diagram illustrating contrastive learning with negative samples, showing data points (Y) that are incompatible with input (X) and should be assigned high energy by the deep learning model. Alt text: Contrastive learning negative samples diagram showing input X, positive sample Y+, and negative sample Y-, illustrating the concept of pushing up energy for negative samples in deep learning for AI.

While GANs can be tricky to optimize, adversarial training ideas have been highly productive, yielding impressive results in image synthesis and opening new applications in content creation, domain adaptation, and style transfer. These advancements highlight the broad impact of contrastive learning in deep learning for AI.

Making Representations Agree Using Contrastive Learning. Contrastive learning offers a way to discover effective feature vectors without pixel reconstruction or generation. The core idea is to train a feed-forward neural network to produce highly similar output vectors for different crops of the same image or different views of the same object, but dissimilar vectors for crops from different images or views of different objects. The squared distance between output vectors serves as an energy function, lowered for compatible pairs and raised for incompatible pairs.

Recent research using convolutional networks to extract agreeing representations has shown promising results in visual feature learning. Positive pairs consist of different distorted versions of the same image (cropping, scaling, rotation, color shifts, blurring, etc.). Negative pairs are similarly distorted versions of different images, possibly selected via hard negative mining or simply all distorted versions of other images in a minibatch. The hidden activity vector of a higher-level network layer is then used as input to a supervised linear classifier. This Siamese network approach has achieved excellent results on standard image recognition benchmarks. Recently, two Siamese network approaches, SwAV and BYOL, have eliminated the need for contrastive samples. SwAV quantizes one network’s output to train the other, while BYOL smooths the weight trajectory of one network, apparently preventing collapse. These methods represent significant progress in self-supervised representation learning within deep learning for AI.

Variational Auto-Encoders. A popular self-supervised learning method is the Variational Auto-Encoder (VAE). It comprises an encoder network mapping images to a latent code space and a decoder network generating images from latent codes. VAEs limit latent code information capacity by adding Gaussian noise to the encoder output before passing it to the decoder. This is analogous to packing noisy spheres into a minimum-radius larger sphere. Information capacity is limited by the number of noisy spheres that fit. Noisy spheres repel each other because good reconstruction requires minimal overlap between codes for different samples. Mathematically, the system minimizes a free energy obtained by marginalizing the latent code over the noise distribution. However, minimizing this free energy directly is intractable, necessitating variational approximation methods from statistical physics to minimize an upper bound of the free energy. VAEs provide a powerful framework for probabilistic modeling and representation learning in deep learning for AI.

The Future of Deep Learning

The performance of deep learning systems often improves dramatically with scale. More data and computation generally lead to better results. GPT-3, a language model with 175 billion parameters, generates noticeably better text than GPT-2 with 1.5 billion parameters. Chatbots like Meena and BlenderBot also improve with size. Significant effort is directed towards scaling up, which will enhance existing systems. However, fundamental limitations of current deep learning cannot be overcome by scaling alone, pointing to future research directions in deep learning for AI.

Comparing human learning abilities with current AI highlights several areas for improvement:

  1. Supervised learning needs too much labeled data, and model-free reinforcement learning requires excessive trials. Humans generalize well with less experience.
  2. Current systems are less robust to distribution changes than humans, who adapt quickly with few examples.
  3. Deep learning excels at perception (system 1 tasks). Applying it to system 2 tasks requiring deliberate steps is a nascent but exciting area.

What Needs to Be Improved. Machine learning theory has long focused on the iid assumption – that test cases come from the same distribution as training examples. This assumption is often unrealistic in the real world, considering non-stationarities from agent actions or the expanding knowledge of a learning agent. Consequently, AI system performance often suffers when transitioning from lab to real-world deployment.

Achieving greater robustness to distribution changes (out-of-distribution generalization) is a specific instance of the broader goal of reducing sample complexity (examples needed for good generalization) when facing new tasks, as in transfer and lifelong learning, or distribution or state-reward relationship changes. Current supervised learning systems require far more examples than humans for new tasks, and model-free reinforcement learning is even more data-intensive, as each rewarded trial provides less information than a labeled example. Humans can generalize differently and more powerfully than iid generalization, interpreting novel concept combinations even if unlikely in the training distribution, as long as they respect learned high-level syntactic and semantic patterns. Recent studies clarify how different neural network architectures perform in systematic generalization. Designing future machine learning systems with enhanced out-of-distribution generalization and faster adaptation is a crucial direction for deep learning for AI.

From Homogeneous Layers to Groups of Neurons That Represent Entities. Neuroscience suggests that neuron groups (hyper-columns) are tightly connected and may represent higher-level vector-valued units, sending sets of coordinated values rather than just scalars. This idea underlies capsule architectures and is inherent in soft-attention mechanisms, where each set element is associated with a vector from which key and value (and sometimes query) vectors are derived. These vector-level units can represent object detection and attributes (like pose in capsules). Recent computer vision research explores convolutional network extensions where the top level represents detected candidate objects, and transformer-like architectures operate on these candidates. Neural networks assigning intrinsic frames of reference to objects and recognizing them via part-geometry relationships should be more resilient to adversarial attacks, which exploit the difference between human and neural network object recognition.

Multiple Time Scales of Adaptation. Most neural networks have two timescales: slow weight adaptation and rapid activity adaptation with each input. Adding rapidly adapting and decaying “fast weights” introduces new computational abilities, particularly high-capacity, short-term memory. This enables true recursion, where neurons can be reused in recursive calls because their activity vectors in higher-level calls can be reconstructed using fast weights. Multiple timescales also arise in meta-learning. Exploring and implementing multiple timescales is an important frontier in deep learning for AI, particularly for tasks requiring memory and hierarchical processing.

Higher-Level Cognition. When facing new challenges, like driving in unfamiliar traffic or imagining lunar driving, we leverage existing knowledge and skills, recombining them dynamically. This systematic generalization enables human adaptation to contexts unlikely in our training distribution. Practice then fine-tunes and compiles these skills, reducing conscious attention needed. How can neural networks gain this ability to adapt quickly to new settings by reusing existing knowledge and minimizing interference with known skills? Transformers and Recurrent Independent Mechanisms are initial steps in this direction.

Implicit (system 1) processing appears to guide search and planning at higher (system 2) levels, possibly akin to value functions guiding Monte-Carlo tree search in AlphaGo. How can system 1 networks guide system 2 search and planning?

Machine learning research relies on inductive biases or priors to encourage learning in directions aligned with world assumptions. System 2 processing and cognitive neuroscience theories suggest inductive biases and architectures that could be exploited in novel deep learning systems. How do we design deep learning architectures and training frameworks incorporating such inductive biases to enable more human-like reasoning and problem-solving in AI?

Young children’s ability to perform causal discovery suggests it’s a fundamental brain property. Recent work indicates that optimizing out-of-distribution generalization under interventional changes can train neural networks to discover causal dependencies or variables. How should we structure and train neural networks to capture these underlying causal properties of the world, enabling more robust and interpretable deep learning for AI?

How do these open questions relate to 20th-century symbolic AI? Symbolic AI aimed at system 2 abilities like reasoning, knowledge factorization for recombination in computational steps, and manipulation of abstract variables, types, and instances. We aim to design neural networks that achieve these while retaining deep learning’s strengths: efficient large-scale learning using differentiable computation and gradient-based adaptation, grounding high-level concepts in perception and action, handling uncertain data, and using distributed representations. The future of deep learning for AI lies in bridging the gap between connectionist and symbolic approaches to create more robust, adaptable, and intelligent systems.

Image of Yoshua Bengio, Yann LeCun, and Geoffrey Hinton, the authors of the article, discussing deep learning for AI, highlighting their expertise and authority in the field.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *