The researchers from The University of Hong Kong, TikTok, Zhejiang Lab, and Zhejiang University have developed a foundational model for Monocular Depth Estimation (MDE). The model uses large-scale unlabeled data that are simple and cheap to obtain, diverse, and easy to annotate. It utilizes both labeled and unlabeled data to better estimate depth. The researchers collected 1.5 million labeled images from 6 public datasets and designed a depth engine that automatically generates depth annotations for unlabeled images. Moreover, the model outperforms other models in depth estimation and semantic segmentation tasks. The model shows potential to be used in various visual perception systems.
The researchers named their model “Depth Anything,” which significantly outperforms the latest MiDaS model across various scenes and unseen datasets. Depth Anything focuses on cheap and diverse unlabeled images, making it a robust solution for MDE. Additionally, it optimizes learning unlabeled images and preserves rich semantic priors from pre-trained models. This leads to better performance and zero-shot estimation capabilities.
For those interested in learning more, the research paper and model code can be accessed on the provided links.