Enhancing Video Understanding with Contrastive Learning and Large Language Models

Video-language pre-training has significantly advanced video understanding tasks. While previous efforts primarily concentrated on short-form videos and sentence pairings, the domain of long-form video-language pre-training remains largely unexplored. Understanding long-form videos directly alongside language offers substantial advantages for numerous video-language tasks. However, this presents challenges in modeling long-range temporal relationships and managing the computational demands of processing extended video sequences.

To address these challenges, we introduce LF-VILA (Long-Form VIdeo-LAnguage pre-training model), designed for large-scale long-form video and paragraph datasets. LF-VILA incorporates two innovative mechanisms to effectively capture temporal dynamics and ensure efficient video-language alignment. Firstly, we propose Multimodal Temporal Contrastive (MTC) loss, which learns temporal relationships across modalities by promoting fine-grained alignment between lengthy videos and paragraphs. Secondly, the Hierarchical Temporal Window Attention (HTWA) mechanism is introduced to efficiently capture long-range dependencies within videos, reducing computational overhead in Transformer networks.

Empirical evaluations demonstrate the effectiveness of LF-VILA. Fine-tuning our pre-trained model across seven long-form video-language understanding tasks, including paragraph-to-video retrieval and long-form video question-answering, establishes new state-of-the-art performance levels. Notably, LF-VILA achieves a 16.1% relative improvement on the ActivityNet paragraph-to-video retrieval task and a 2.4% improvement on the How2QA task. Our code, dataset, and pre-trained models are publicly available at https://github.com/microsoft/XPretrain.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *