Introducing RMT: A Powerful Vision Backbone Combining RetNet and Transformer
The NLP-based Transformer has proven to be effective in computer vision as well. However, there is growing interest in Retentive Network (RetNet) as a potential replacement for Transformer in the NLP community. To address this, Chinese researchers have proposed RMT, a hybrid of RetNet and Transformer. RMT incorporates the spatial prior knowledge of RetNet and adds explicit decay to the vision backbone. This allows the vision model to utilize previously acquired knowledge about spatial distances, resulting in precise regulation of each token’s perceptual bandwidth. Additionally, RMT decomposes the modeling process along the image’s two coordinate axes, reducing computational costs.
RMT’s Impressive Performance in Computer Vision Tasks
Extensive experiments have demonstrated that RMT excels in various computer vision tasks. For example, with only 4.5G FLOPS, RMT achieves 84.1% Top1-acc on ImageNet-1k. When compared to models of similar size and trained using the same technique, RMT consistently produces the highest Top1-acc. In object detection, instance segmentation, and semantic segmentation, RMT outperforms existing vision backbones.
The Contributions of RMT
- RMT incorporates spatial prior knowledge about distances into vision models, bringing retention from RetNet to the two-dimensional setting with Retentive SelfAttention (ReSA).
- RMT simplifies computation by decomposing ReSA along two image axes. This strategy significantly reduces the required computational effort without compromising the model’s efficiency.
- Extensive testing confirms RMT’s superior performance, particularly in downstream tasks such as object detection and instance segmentation.
RMT: A Powerful Vision Backbone with RetNet and Transformer
In summary, RMT combines the strengths of RetNet and Transformer to create a powerful vision backbone. It introduces spatial prior knowledge to visual models through explicit decay related to distance. The novel mechanism called ReSA improves memory retention. To simplify the model, ReSA is decomposed into two axes. Extensive experiments have shown that RMT is efficient, especially in object detection tasks where it has significant advantages.
For more information, you can read the paper. All credit for this research goes to the respective researchers. Don’t forget to join our ML subreddit, Facebook community, Discord channel, and email newsletter to stay updated on the latest AI research news and projects.
If you enjoy our work, you’ll love our newsletter. Subscribe here.