Video understanding is a difficult task that involves analyzing both spatial and temporal information in videos. It has many applications, such as semantic content analysis and robot perception. However, current methods for video understanding are computationally expensive and require significant computational resources.
In our research, we introduce a technique called “Sparse Video Tubes” that makes video understanding more efficient. We use a Vision Transformer model, which is a type of artificial intelligence model, and transform it into a video backbone by using sparse video tubes. These tubes are learnable representations of samples from the video and help reduce the computational requirements of the model.
Our approach can process both images and videos, allowing us to leverage both data sources during training. It also allows the model to serve as either an image or video backbone, depending on the input.
We demonstrate that our model is scalable and achieves state-of-the-art results on various video classification benchmarks. By using sparse video tubes, we can efficiently represent visual information in videos and share it with image inputs.
The sparse tube ViT model uses a standard ViT backbone with Transformer layers to process video information. Instead of densely tokenizing the video, we sparsely sample it using video tubes. These tubes are 3D learnable representations of different shapes and sizes from the video. By sparsely sampling the video, we can use global self-attention and improve the model’s performance.
Video tubes are applied to the model multiple times, depending on the input video size. A fixed position embedding is added to capture the global location of each tube. The tube features are then concatenated and processed by a standard ViT encoder. Finally, an attention pooling technique compresses the tokens into a single representation for classification.
Our sparse tube ViT model allows for the efficient scaling of video models by leveraging pre-trained image backbones. This makes it easier to train large video models without starting from scratch.
We have evaluated our model on various datasets and compared it to prior methods. Our approach outperforms all previous methods and achieves state-of-the-art results. We have also visualized the learned features of our model, showing that it can detect common features like edges and colors in images, and capture basic shapes and their changes over time in videos.
In conclusion, our research presents a more efficient and effective method for video understanding using sparse video tubes. It improves the performance of video models and allows for the seamless processing of image and video inputs.