Introducing AIM: A New Frontier for Training Large-Scale Vision Models
AIM is a collection of vision models pre-trained with an autoregressive objective. These models are inspired by Large Language Models (LLMs) and exhibit similar scaling properties. The performance of the visual features scale with both the model capacity and the quantity of data. The value of the objective function correlates with the performance of the model on downstream tasks.
Practical Implication of AIM
A 7 billion parameter AIM pre-trained on 2 billion images achieves 84.0% on ImageNet-1k with a frozen trunk. Even at this scale, we observe no sign of saturation in performance, suggesting that AIM represents a new frontier for training large-scale vision models. The pre-training of AIM is similar to the pre-training of LLMs and does not require any image-specific strategy to stabilize the training at scale.
Key Points to Remember
AIM is a collection of vision models pre-trained with an autoregressive objective, similar to Large Language Models (LLMs).
The performance of the visual features scale with both the model capacity and the quantity of data.
The value of the objective function correlates with the performance of the model on downstream tasks.
Pre-training a 7 billion parameter AIM on 2 billion images achieved 84.0% on ImageNet-1k with a frozen trunk.
No sign of saturation in performance was observed, suggesting a new frontier for training large-scale vision models.
The pre-training of AIM does not require any image-specific strategy to stabilize training at scale.