Artificial intelligence (AI) models have recently made significant advancements. Thanks to the open-source movement, programmers can now combine different open-source models to create innovative applications.
One area of improvement is stable diffusion, which allows for the automatic generation of photorealistic images and other styles based on text input. However, these models are typically large and require significant computational power. To address this, computations are offloaded to GPU servers when building web applications that use these models. Additionally, most workloads require specific GPU families to run popular deep-learning frameworks.
The Machine Learning Compilation (MLC) team aims to change this situation and increase biodiversity in the environment. They believe that moving computation to the client can bring various benefits, such as lower service provider costs and better-individualized experiences and security.
According to the team, ML models should be able to run without the need for GPU-accelerated Python frameworks. AI frameworks heavily depend on hardware vendors’ optimized computed libraries. To maximize returns, unique variants must be generated based on each client’s infrastructure.
The team has developed a web stable diffusion that directly runs the regular diffusion model in the browser using the client’s GPU. All computations are handled locally within the browser and do not involve any server interactions. This is a groundbreaking approach and the world’s first browser-based stable diffusion.
The key technologies used in this solution include PyTorch, Hugging Face diffusers and tokenizers, rust, wasm, and WebGPU. Apache TVM Unity serves as the foundation for constructing the main flow.
The team has utilized the Hugging Face diffuser library’s Runway stable diffusion v1-5 models.
The model components are captured in an IRModule in TVM using TorchDynamo and Torch FX. The IRModule generates executable code for each function, enabling deployment in any environment that can run the TVM minimum runtime (including JavaScript).
TensorIR and MetaSchedule are used to automatically generate efficient code, which is then fine-tuned locally to optimize GPU shaders. The adjustments are saved in a repository, allowing future builds to be produced without further fine-tuning.
Static memory planning optimizations are implemented to optimize memory reuse across multiple layers. The TVM web runtime utilizes Emscripten and typescript to facilitate module deployment.
The team also utilizes the wasm port of the hugging face rust tokenizers library.
Apart from the final step of creating a JavaScript app to tie everything together, the entire workflow is done in Python. This participatory development enables the introduction of new models.
The open-source community plays a crucial role in making all of this possible. The team relies on TVM Unity, the latest addition to the TVM project, to enable Python-first interactive MLC development experiences. TVM Unity also enables the rapid composition of novel ecosystem solutions.
For more information and access to the tool and GitHub link, visit the provided links. Credit for this research goes to the researchers involved in the project. Don’t forget to join our ML SubReddit, Discord Channel, and subscribe to our Email Newsletter for the latest AI research news and cool AI projects.