Utilizing Unlabeled Videos with Video PreTraining for AI Models
The internet is filled with endless videos that offer valuable learning opportunities. You can find tutorials on making impressive presentations, witness the creation of stunning artwork, or observe skilled gamers building incredible virtual structures. However, these videos only provide a visual record of what happened without giving precise details on how it was achieved. In other words, you won’t know the exact sequence of mouse movements and key presses used.
This lack of “action labels” creates a challenge when we aim to develop large-scale AI models in domains similar to what we have accomplished with language models like GPT. In the world of language, action labels are simply the next words in a sentence. However, when dealing with videos, the absence of action labels poses a new hurdle that must be overcome.
To harness the untapped potential of the vast amount of unlabeled video data available online, we have introduced an innovative yet straightforward method called Video PreTraining (VPT) for semi-supervised imitation learning. Our approach begins with gathering a small dataset from contractors. In addition to recording their videos, we also capture the actions they perform throughout, including mouse movements and keypresses. This collected data is then used to train an inverse dynamics model (IDM), which predicts the action taken at each step in the video.
What makes the IDM particularly useful is its ability to employ both past and future information to make accurate predictions. This significantly simplifies the task, requiring less data compared to behavioral cloning, which involves predicting actions based solely on past video frames. The behavioral cloning approach necessitates understanding the person’s intentions and the means to achieve them. By training the IDM with the available data, we can label a much larger dataset of online videos and learn to imitate actions through behavioral cloning.
In conclusion, by leveraging the powerful Video PreTraining method, we can effectively utilize the wealth of unlabeled video data found on the internet. This opens up new possibilities for developing large-scale AI models in various domains, providing a significant boost to the field of artificial intelligence.