Finding video parts with natural language

Abstract

The increasing number of videos have motivated the development of content-based video retrieval (CBVR) methods, which search for videos whose content is relevant to a query. Since most existing datasets for this task provide short video clips capturing a single activity, previous methods have focused on short video clips. However, the majority of real-world videos are more lengthy and edited. Such videos may consist of multiple video clips and may include various content within a video, thus previous methods may fail with real-world videos. In this paper, we propose a new video retrieval task which aims to handle such multi-clip videos. The task is to find query-relevant parts from a video consisting of multiple clips, which we call fine-grained video retrieval (FGVR). For this new task, we build datasets from existing video-description datasets. We synthesize multi-clip video and query pairs by augmenting video-description datasets, which results in large-scale training and evaluation data. We introduce several deep neural network-based approaches as baselines and a training scheme using the synthesized video and query pairs. We investigate the baselines on two datasets built from YouTube and movie datasets, respectively, and present preliminary results.