ECCV 2024 Tutorial: Learning Image Representations Beyond Static Images with Self-Supervised Video Learning

Time is Precious: Self-Supervised Learning Beyond Images – Exploring Video for Robust Visual Representations

30th September, 09:00 to 13:00 CST, Amber 7 + 8, MiCo Milano

Tutorial Overview

Self-Supervised Learning (SSL) has revolutionized neural network pretraining, enabling models to learn from vast amounts of unlabeled data, surpassing the limitations of labeled datasets. This progress has led to training on billions of images, achieving robust performance without the need for expensive annotations. However, current state-of-the-art (SoTA) models often derive representations from single, static images, missing the crucial temporal context inherent in visual data. Learning from these disjointed snapshots restricts the ability of models to truly understand the dynamic world. This limitation is especially apparent in recent SSL techniques predominantly trained on meticulously curated, object-centric datasets like ImageNet. Scaling single-image techniques to larger, less-curated datasets, such as Instagram-1B, has not yielded significant performance gains. The inherent constraint of a single image is its static nature; it cannot offer new perspectives of an object or anticipate unfolding events within a scene.

This tutorial pivots to the concept of leveraging the rich information present in video frames to learn more robust visual representations. While image-based pretraining, exemplified by SimCLR, has gained recent prominence, the practice of pretraining models using videos has a longer history. This session will revisit both foundational and contemporary works that have pretrained image encoders using videos for diverse pretext tasks, including egomotion prediction, active recognition, and dense prediction. We will delve into practical implementation details relevant for practitioners and highlight connections to contemporary research such as VITO, TimeTuning, DoRA, and V-JEPA. Furthermore, we will contextualize this approach within broader trends aiming to mimic human visual systems, such as learning from continuous video streams and longitudinal audio-visual headcam recordings from young children.

The Rising Motivation for Video-Based Self-Supervised Learning

Recent advancements in the last 6-12 months signal a significant paradigm shift: SSL models pretrained on videos are demonstrating superior performance compared to their image-based counterparts. This notable progress underscores the growing importance of video-based SSL and makes a tutorial focusing on its past and present developments essential for both newcomers and experienced researchers in the field. This tutorial aims to address critical questions, including:

Efficacy with Limited Data: Can we achieve strong image encoders using high-quality videos, even with limited datasets?
The Role of Augmentations: Do synthetic augmentations remain necessary? How can we effectively utilize the natural augmentations inherent in videos?
Continuous Learning Paradigms: Can we develop learning systems that mimic continuous human visual learning from streams of video data?

Expert Speakers Leading the Tutorial

This tutorial features a distinguished panel of researchers at the forefront of self-supervised learning and computer vision:

Shashanka Venkataramanan (INRIA): Expert in self-supervised learning, focusing on image representation learning from videos and data augmentation methods.
Mohammadreza Salehi (University of Amsterdam): Specializes in representation learning, particularly learning image representations from videos, and machine learning safety.
Yuki M. Asano (University of Amsterdam): Assistant Professor and QUVA Lab leader, renowned for research in self-supervised learning, multi-modal learning, and video-based SSL methods like TimeTuning and DoRA.
João Carreira (Google DeepMind): Research Scientist at Google DeepMind.
Ishan Misra (GenAI, Meta): From GenAI, Meta.
Emin Orhan (Independent Researcher): Independent Researcher.

Detailed Tutorial Schedule

The tutorial is structured to provide a comprehensive understanding of video-based self-supervised learning, from foundational concepts to cutting-edge applications.

Title	Speaker	Slides	Talk
Introduction	Mohammadreza Salehi	Slides	Talk
Part (1): Learning image encoders from videos – Prior works	Shashanka Venkataramanan	Slides	Talk
Part (2): New Vision Foundation Models from Video(s): 1-video pretraining, tracking image-patches	Yuki M. Asano	Slides	Talk
Coffee Break
Applications (1): Learning from one continuous stream: single-stream continual learning, massively parallel video models, perceivers	João Carreira	Slides	Talk
Applications (2): What makes Generative video models tick? Emu Video (text-to-video), FlowVid (video-to-video), factorizing text-to-video generation, efficiency	Ishan Misra
Applications (3): SSL from the perspective of a developing child – Audio-visual dataset, development of early word learning, learning from children	Emin Orhan	Slides	Talk
Conclusion, Open Problems & Final remarks	Yuki M. Asano

About the Tutorial Speakers

Learn more about the experts who will be guiding you through the exciting landscape of video-based self-supervised learning:

Shashanka Venkataramanan (INRIA)

Shashanka is a PhD candidate at INRIA, France, specializing in self-supervised learning with a focus on video-based image representation and data augmentation techniques. He has a strong background in organizing deep learning workshops covering diverse topics such as diffusion models and adversarial attacks.

Mohammadreza Salehi (University of Amsterdam)

Mohammadreza is pursuing his PhD at the QUVA Lab, University of Amsterdam, researching representation learning with an emphasis on video-derived image representations and machine learning safety.

Yuki Asano (University of Amsterdam)

Yuki M. Asano leads the Qualcomm-UvA (QUVA) lab and is an Assistant Professor at the University of Amsterdam. His research spans self-supervised learning, multi-modal learning, and augmentations, contributing significantly to video-based SSL through works like TimeTuning and DoRA. He is actively involved in the computer vision community as an Area Chair for major conferences and workshop organizer.

João Carreira (Google DeepMind)

João Carreira is a Research Scientist at Google DeepMind.

Ishan Misra (GenAI, Meta)

Ishan Misra is associated with GenAI at Meta.

Emin Orhan (Independent Researcher)

Emin Orhan is an Independent Researcher.

This tutorial offers a valuable opportunity to gain insights into the future of self-supervised learning and its application to image representation through the power of video. Join us at ECCV 2024 to explore the exciting potential of learning image representations beyond the limitations of static images.