Introducing Unified Vision Transformer: SAM-CLIP
In 2023, at the UniReps Workshop at NeurIPS, a groundbreaking paper was accepted that introduced a simple recipe to efficiently merge publicly available vision foundation models (VFMs) into a unified model that absorbs their expertise.
What are Vision Foundation Models (VFMs)?
Vision foundation models, such as CLIP and Segment Anything Model (SAM), have distinct capabilities stemming from their pre-training objectives. CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation.
The Integration of SAM and CLIP
By applying a method that integrates multi-task learning, continual learning, and distillation, SAM and CLIP were merged into SAM-CLIP: a unified model that combines their capabilities into a single vision transformer. SAM-CLIP reduces storage and compute costs for inference, making it well-suited for edge device applications. It also establishes new state-of-the-art results on 5 benchmarks, outperforming previous models specifically designed for this task by a large margin.
By merging these two VFMs, SAM-CLIP not only retains the foundational strengths of SAM and CLIP, but also introduces synergistic functionalities, notably in zero-shot semantic segmentation, where it achieves outstanding results.