The Vision Transformer (ViT) is gaining popularity as a replacement for convolution-based neural networks in AI. It offers simplicity, flexibility, and scalability. The model works by dividing an image into patches and projecting each patch into a token. This allows for easy handling of input photos, which are usually squared up and divided into patches.
Some recent studies have explored variations of this model. FlexiViT, for example, allows for a continuous range of sequence length by accommodating different patch sizes within a single design. Pix2Struct takes an alternate approach to patching, maintaining the aspect ratio, which is useful for tasks like chart and document comprehension.
NaViT is another alternative developed by Google researchers. It uses a technique called Patch n’ Pack, which allows for varying resolution while maintaining the aspect ratio. This technique packs many patches from different images into a single sequence, inspired by “example packing” used in natural language processing. NaViT shows great performance across a wide range of solutions, making it adaptable to various tasks.
One key advantage of using NaViT is its computational efficiency during both pre-training and fine-tuning. It allows for a smooth trade-off between performance and inference cost. This efficiency is achieved by applying a single NaViT across different resolutions.
Traditionally, computer vision applications rely on predetermined batch sizes and geometries for optimal performance. This often involves resizing or padding images to fit a specific size. NaViT, however, offers flexibility in batch shapes, allowing for aspect-ratio preserving resolution-sampling and variable token dropping rates. This opens up possibilities for adaptive computation and new algorithms to enhance training and inference efficiency.
While NaViT is based on the original ViT, other variants can theoretically be used as long as they can process a sequence of patches. Patch n’ Pack is a simple application of sequence packing to visual transformers. It significantly improves training efficiency and enables the development of flexible NaViT models that can be easily adapted to new tasks.
Overall, NaViT represents a step in the right direction for ViTs, breaking away from the conventional input and modeling pipeline of computer vision models. It paves the way for further research in adaptive computation and algorithm enhancements.