Deep Learning 2.0: Elevating Machine Learning to the Meta-Level

Deep Learning (DL) has fundamentally transformed how machines learn from raw data like images, text, and speech. Its power lies in replacing traditional, hand-engineered features with features that are jointly learned and optimized for specific tasks. Building upon this revolution, this article introduces Deep Learning 2.0 (DL 2.0), a paradigm shift that extends this joint learning principle to the meta-level. DL 2.0 aims to automate and optimize elements of the deep learning pipeline that are currently handcrafted, such as neural architectures, initializations, training procedures, hyperparameters, and self-supervised learning strategies.

Deep Learning 2.0 represents a significant leap forward because:

It mirrors the success of deep learning by substituting manual, handcrafted solutions with automated, learned approaches, but now at a higher, meta-level.
It enables a new level of automated customization tailored precisely to the nuances of each specific task.
It facilitates the automatic navigation of meta-level trade-offs to meet user-defined objectives, including algorithmic fairness, interpretability, uncertainty calibration, robustness, and energy efficiency. This capability paves the way for Trustworthy AI by design, setting a new benchmark for deep learning applications.

From Standard Deep Learning to Deep Learning 2.0

To understand the significance of Deep Learning 2.0, it’s crucial to revisit the factors that propelled deep learning to the forefront of machine learning. Before the deep learning revolution, the conventional approach involved domain experts meticulously crafting features relevant to the data. These engineered features then served as input for traditional machine learning algorithms like XGBoost, as illustrated in the top part of the figure below.

Feature engineering was often a laborious and iterative process, demanding considerable expertise and time. Deep learning emerged as a disruptive force by demonstrating the capability to learn features directly from raw data. This breakthrough effectively eliminated the need for manual feature engineering, streamlining the machine learning workflow and democratizing access to advanced AI capabilities (as depicted in the bottom part of the figure above).

This shift not only reduced manual effort but also yielded superior results by enabling end-to-end joint optimization of feature extraction and classification. This success story of deep learning underscores a recurring theme in the history of AI: manually designed components are consistently being superseded by automatically generated counterparts that deliver enhanced performance [Clune, 2019, Sutton, 2019].

Deep Learning 2.0 aims to replicate this transformative success at the meta-level. While deep learning eliminated manual feature engineering, it introduced a new bottleneck: manual architecture and hyperparameter engineering by deep learning specialists. This meta-level engineering, much like feature engineering before it, is a time-consuming, trial-and-error driven process. Deep Learning 2.0 proposes to bypass this manual intervention, mirroring the advantages that deep learning brought to traditional machine learning. The transition from “Deep Learning 1.0” to Deep Learning 2.0 is visualized in the figure below.

Deep Learning 2.0 can be implemented by integrating an AutoML (Automated Machine Learning) component that conducts meta-level learning and optimization on top of a standard deep learning system. However, the scope of Deep Learning 2.0 extends beyond mere automation. It also addresses the practical need to incorporate crucial additional objectives. The vision for Deep Learning 2.0 includes allowing domain experts to define their specific objectives for a given application. For instance, policymakers could specify the relevant algorithmic fairness criteria for a particular use case. A multi-objective AutoML component can then identify deep learning system configurations that represent a Pareto front of optimal solutions. This approach ensures that the system directly optimizes for user-defined objectives, fostering trustworthy AI by design. Furthermore, Deep Learning 2.0 is envisioned to empower deep learning experts by automating routine tasks while allowing for expert guidance, enhancing their productivity and efficiency. This collaborative approach is illustrated in the figure below.

The Three Pillars of Deep Learning 2.0

The realization of Deep Learning 2.0’s full potential hinges on the convergence of three key pillars:

Pillar 1: Joint Optimization of the Deep Learning Pipeline

The success of deep learning was fundamentally driven by the automatic learning of features from raw data. Deep learning models learn increasingly abstract representations of data through successive layers, with all components optimized jointly in an end-to-end manner. Consider computer vision as an example: before deep learning, researchers tackled feature learning in a fragmented approach, with separate efforts focused on edge detection, contour learning, object part recognition, and finally, object classification based on these parts. In stark contrast, deep learning optimizes all these stages concurrently by adjusting the weights of connections across all layers to minimize a unified loss function. This holistic optimization enables deep learning to, without explicit instructions, learn edge detectors in initial layers that are optimally suited for combination into contour detectors in subsequent layers, which are further refined to learn object parts, ultimately leading to robust object classification. This entire process is end-to-end and jointly orchestrated.

Currently, at the meta-level of deep learning, we face a situation analogous to the pre-deep learning era at the base level. Researchers often work in silos, focusing on isolated components of the deep learning pipeline, such as architectures, optimizers, regularization techniques, self-supervised learning methods, and hyperparameters. Deep Learning 2.0 advocates for the joint optimization of these meta-level components to maximize the performance of the inner deep learning system.

Pillar 2: Efficiency of Meta-Optimization

Deep learning’s widespread adoption and impact are largely attributable to the efficiency of stochastic gradient descent (SGD) in optimizing models with millions or even billions of parameters. Similarly, the success of Deep Learning 2.0 will depend heavily on the efficiency of meta-level optimization processes. While hyperparameter sweeps and random search are commonly used, they are insufficient for effectively exploring the complex, joint search space of meta-level decisions.

The target is to develop meta-optimization methods that incur only a 3x-5x overhead compared to standard deep learning with fixed meta-level settings. More sophisticated techniques than grid search and random search are emerging, but significant research is still needed to achieve this efficiency goal. Promising approaches include:

Gradient-based meta-optimization: Leveraging gradients to optimize meta-parameters directly, offering a more efficient search than black-box methods.
Multi-fidelity optimization: Utilizing low-fidelity approximations (e.g., smaller datasets, simpler models) to rapidly evaluate and prune unpromising configurations before investing in expensive, high-fidelity evaluations.

Furthermore, substantial improvements in tooling are essential. Analogous to the development of deep learning frameworks like TensorFlow and PyTorch, robust and user-friendly tools are needed to make sophisticated meta-level optimization accessible and commonplace for deep learning researchers.

Pillar 3: Direct Alignment with User Objectives through Multi-Objective Optimization

In contrast to standard deep learning, Deep Learning 2.0 aims to reintegrate the domain expert into the AI development process by enabling them to specify their objectives directly. The DL 2.0 system can then automatically optimize for these objectives and present a Pareto front of non-dominated solutions, allowing domain experts to select the best trade-off based on their priorities. These user objectives can encompass various aspects, such as algorithmic fairness, defined according to the specific context of the application. Recognizing that “fairness” is context-dependent, Deep Learning 2.0 allows for nuanced and application-specific fairness considerations. The responsibility of determining the best solution shifts from deep learning engineers to ethicists, policymakers, and other stakeholders who can define the objectives without needing deep technical expertise in AI. These stakeholders can then audit the proposed solutions and select the most appropriate one for the given situation. This direct alignment of optimization with user objectives is central to achieving Trustworthy AI by Design.

To further enhance trust, Deep Learning 2.0 should also generate reports detailing the information gathered during the automated search process, explaining the impact of different choices, and providing insights into the model selection process. Unlike manual tuning, Deep Learning 2.0 provides reproducible results, and these reports enhance explainability and auditability. This transparency will bolster trust in Deep Learning 2.0 and facilitate compliance with potential future regulations mandating explanations of model selection processes.

Relationship to Other Trends in AI

Relationship to Foundation Models

Foundation models, large pretrained models adaptable to numerous downstream tasks, represent a significant trend in deep learning. This paradigm has been prevalent in computer vision with models pretrained on ImageNet for nearly a decade and has been amplified by generative pretraining in natural language processing, exemplified by models like GPT-3 [GPT-3]. The ease of accessing and fine-tuning pretrained models raises the question of how Deep Learning 2.0 relates to this trend.

The relationship is twofold. First, many aspects of utilizing pretrained models fall under the scope of Deep Learning 2.0. Whether fine-tuning or prompt tuning, numerous meta-level choices need to be made. Fine-tuning involves selecting the training pipeline, including optimizer, learning rate schedule, weight decay, data augmentation, and regularization hyperparameters. Similarly, prompt tuning introduces meta-level choices related to prompt length, initialization, and ensembling strategies.

Second, and more importantly, the creation of high-quality pretrained models is itself a Deep Learning 2.0 problem. Developing effective pretrained models requires careful optimization of various components of the deep learning pipeline. This problem directly embodies the three pillars of DL 2.0:

Joint optimization of the entire DL pipeline: Pretraining involves numerous interacting degrees of freedom, including neural architecture, optimization and regularization pipelines, self-supervised learning strategies, and their respective hyperparameters. Given the immense computational cost of training foundation models, comprehensive optimization of these components and their interactions remains largely unexplored, leaving significant performance gains untapped.
Efficiency: With training costs reaching millions of dollars for individual foundation models, efficiency is paramount even for basic hyperparameter tuning, let alone joint optimization of multiple components. The high cost necessitates tuning hyperparameters on downscaled models or using cheaper training pipelines. Parameterizations that maintain optima across scales [parameterizations for stable optima], automated multi-fidelity optimization techniques, and gradient-based optimization offer promising avenues for improvement.
Multi-objective optimization for trustworthy design: Foundation models are designed for broad applicability, necessitating multi-objective optimization to ensure performance across diverse downstream tasks. Furthermore, addressing bias, racism, and unfairness requires optimizing for multiple fairness metrics. Parameterizing the training data selection pipeline to balance accuracy with fairness is a crucial research direction for developing trustworthy foundation models.

Relationship to AI-Generating Agents

Jeff Clune’s work on AI-Generating Algorithms (AI-GAs) [Clune, 2019] has been a significant source of inspiration for Deep Learning 2.0. Clune argues that throughout AI history, manually designed components have been consistently replaced by automatically generated, better-performing alternatives. Deep Learning 2.0 aligns with this perspective, advocating for increased research focus on meta-level automation to maximize progress in AI. However, Deep Learning 2.0 and AI-GAs diverge in their emphasis on specific meta-level challenges. While both frameworks propose three pillars, their focus differs.

Clune’s AI-GA pillars prioritize individual components of the deep learning pipeline: (1) meta-learning architectures (neural architecture search – NAS) and (2) meta-learning learning algorithms (optimizers and weight initializers like MAML). These areas have become highly active research fields, as evidenced by the surge in NAS publications (over 1000 papers in the last two years, as shown below). While acknowledging the importance of these fields, Deep Learning 2.0 emphasizes their joint optimization and integration with other crucial aspects like SSL pipelines and hyperparameters, which are not explicitly addressed in Clune’s AI-GA framework [2019]. This holistic joint optimization of the entire DL pipeline constitutes the First Pillar of Deep Learning 2.0.

The Third Pillar of AI-GAs focuses on automatically generating effective learning environments, categorized into target-task and open-ended approaches. The target-task approach, aiming to optimize for a specific objective function, shares some overlap with the joint optimization pillar of Deep Learning 2.0, particularly in the context of curriculum learning. The open-ended approach, exploring co-evolution of learning environments and agents, resembles Darwinian evolution. While intriguing, Deep Learning 2.0 adopts a more pragmatic and target-task-oriented approach. DL 2.0 prioritizes efficiency (Second Pillar, emphasizing sustainability) and achieving trustworthy AI by design through multi-objective AutoML (Third Pillar). These critical elements, less prominent in the AI-GA vision, deserve greater attention from the deep learning community.

Deep Learning 2.0 also strengthens the foundation for AI-GAs. The success of AI-GAs hinges on outperforming manual AI design. However, if AI-GAs require orders of magnitude more computation than single model runs (as early evolutionary NAS methods did), their competitiveness against human experts is questionable. Deep Learning 2.0, with its emphasis on efficiency (Second Pillar), aims to make AI-GAs significantly faster and a more viable alternative to manual design.

Relationship to Human-level AI

As Deep Learning 2.0 enhances the power of deep learning, its relationship to strong AI (artificial general intelligence – AGI) becomes relevant. While the ethical implications of pursuing strong AI are complex, achieving AGI likely requires AI systems capable of self-assessment and self-improvement. For deep learning-based AI, the multi-objective AutoML component of Deep Learning 2.0 provides a natural mechanism for self-improvement. In essence, “Strong AI is unlikely to emerge if humans must constantly fine-tune hyperparameters.”

Strong AI is also a stated goal of Clune’s AI-GAs [2019]. Deep Learning 2.0’s potential to accelerate AI-GAs and improve foundation models indirectly contributes to the pursuit of strong AI. The focus on user objectives and multi-objective optimization in DL 2.0 raises hopes that it can contribute to developing strong AI that aligns with and benefits human objectives.

A recurring question is whether meta-level learning should be extended further (meta-meta-level learning). Humans learn and “learn to learn,” but “learning to learn to learn” has not been necessary. Similarly, in AutoML, meta-level robustness is achievable with default hyperparameter settings for hyperparameter optimizers. While meta-meta-level optimization might have niche applications, it is unlikely to be broadly essential.

Deep Learning 2.0 is Long Underway

Deep Learning 2.0 builds upon a substantial body of existing research. Meta-level optimization, hyperparameter optimization, meta-learning, and NAS have been active areas for years. While joint optimization of multiple meta-level components (Pillar One) is less explored, early works like Auto-Meta and MetaNAS (jointly optimizing architecture and initial weights), Auto-PyTorch and AutoHAS (jointly optimizing architecture and hyperparameters), SEARL (jointly optimizing architecture and hyperparameters for RL agents), and regularization cocktails (jointly optimizing regularization choices) demonstrate the feasibility and benefits of this approach. These works pave the way for increased research on joint optimization within the DL pipeline.

Pillar Two, efficient meta-level optimization, has been extensively researched in hyperparameter optimization and NAS. Techniques like transfer learning, online optimization, gradient-based optimization, multi-fidelity optimization, and expert knowledge integration offer promising avenues for achieving efficiency in Deep Learning 2.0.

Pillar Three, multi-objective optimization, also benefits from existing research. Hardware-aware NAS, a significant driver in NAS, already incorporates secondary objectives like latency, parameter count, and energy consumption. Furthermore, research explores fairness as a secondary objective in hyperparameter optimization. Connecting these approaches to deep learning faces no fundamental obstacles.

NAS itself can be viewed as an early instantiation of Deep Learning 2.0, exhibiting elements of joint optimization (NAS+X), efficient gradient-based methods, and multi-objective optimization. While further development is needed across all pillars, the existing components suggest that Deep Learning 2.0 is a tangible and achievable goal.

Deep Learning 2.0 is Destined to Succeed

Deep Learning 2.0 is poised for success because it offers compelling advantages to various stakeholders: domain experts, DL experts, policymakers, and industry.

DL 2.0 from a Domain Expert Perspective

Domain experts, once central to traditional machine learning through feature engineering, have been somewhat displaced by deep learning. Deep Learning 2.0 re-empowers domain experts by allowing them to directly influence AI systems through objective specification. This user-centric approach fosters trust and is likely to drive greater acceptance of DL 2.0 compared to standard deep learning.

Moreover, Deep Learning 2.0 democratizes access to deep learning, enabling users without deep learning expertise to effectively utilize these powerful techniques without relying on scarce and expensive DL specialists. This enhanced usability will contribute to the widespread adoption of Deep Learning 2.0.

DL 2.0 from a DL Expert Perspective

Deep learning experts often resort to manual hyperparameter tuning due to limitations of automated methods (speed, inability to integrate expert knowledge). Manual tuning, while providing intuition, is time-consuming and tedious. Deep Learning 2.0 aims to alleviate this burden by offering efficient, automated optimization. DL 2.0 promises to be (1) efficient (Pillar Two), (2) capable of incorporating expert knowledge (e.g., through expert-defined priors [expert-defined priors], and (3) transparent, providing reports to build intuition and trust. DL experts are likely to embrace DL 2.0 to enhance their productivity and effectiveness.

DL 2.0 from a Policy Maker’s Perspective

Deep Learning 2.0’s alignment with user objectives, reproducibility, and auditability are highly appealing to policymakers. The automated reporting capabilities of DL 2.0 systems can facilitate compliance with potential regulations mandating explainability and auditability in machine learning model development. Legislation may even favor or require the use of Deep Learning 2.0 in certain contexts.

DL 2.0 from an Industry Perspective

Deep Learning 2.0 is poised to drive substantial growth in the deep learning market due to:

Reduced reliance on DL experts, saving labor, time, and costs.
Potential for improved performance compared to standard deep learning.
Ability to incorporate guidance from DL experts.
Enhanced trustworthiness of deep learning systems.
Broader applicability and pervasiveness than standard deep learning.

Take-aways

Deep Learning 2.0 represents the natural evolution of deep learning, mirroring its predecessor’s success in automating previously manual steps, but now at the meta-level. It is attractive to industry due to its potential to save resources, improve results, democratize deep learning, enhance trustworthiness, and become even more ubiquitous than standard deep learning.

Deep Learning 2.0 is built upon three pillars: (1) joint optimization of the entire DL pipeline, (2) efficient meta-optimization, and (3) direct alignment with user objectives through multi-objective optimization. To accelerate progress in deep learning, the community is strongly encouraged to prioritize research and development efforts focused on these three pillars.

Back