Running Llama 13B and Openchat 13B Models on a Single GPU: A Step-by-Step Guide

AI Tools

Running Llama 13B and Openchat 13B Models on a Single GPU: A Step-by-Step Guide

Connor A.

August 15, 2023

Running Llama 13B and Openchat 13B Models on a Single GPU: A Step-by-Step Guide

How to Run Llama 13b and Openchat 13b Models on a Single GPU

To test and experiment with large language models, like Llama 13b and Openchat 13b, we often face resource limitations. While platforms like Google Colab Pro offer the ability to test up to 7B models, what options do we have when we wish to experiment with even larger models, such as 13B?

In this blog post, we will show you how to run Llama 13b and Openchat 13b models on a single GPU using Google Colab Pro’s T4 GPU with 25 GB of system RAM. Follow these steps to get started.

Step 1: Install the Requirements
First, you need to install the necessary requirements. Make sure you have the latest version of the BitsAndBytes library (0.39.0).

“`
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install sentencepiece
“`

Step 2: Use Quantization Technique
We are using the quantization technique in our approach, which employs the BitsAndBytes functionality from the transformers library. This technique allows us to perform quantization using various 4-bit variants, such as NF4 or pure FP4 quantization. Different combinations, including float16, bfloat16, and float32, can be chosen for computation.

To enhance the efficiency of matrix multiplication and training, we recommend utilizing a 16-bit compute dtype. The recent introduction of the BitsAndBytesConfig in transformers provides the flexibility to modify these parameters as needed.

“`
import torch
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
“`

Step 3: Load the Tokenizer and Model
Now, in this step, we will load the tokenizer and the model. In this example, we are using the Openchat model, but you can use any 13b model available on HuggingFace Model. If you want to use the Llama 13 model, simply change the model-id to “openlm-research/open_llama_13b” and run the steps below.

“`
model_id = “openchat/openchat_8192”
tokenizer = AutoTokenizer.from_pretrained(model_id)
model_bf16 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
“`

Step 4: Test the Model
Once the model is loaded, it’s time to test it. You can provide any input of your choice and adjust the “max_new_tokens” parameter to generate the desired number of tokens.

“`
text = “Q: What is the largest animal?\nA:”
device = “cuda:0″
inputs = tokenizer(text, return_tensors=”pt”).to(device)
outputs = model_bf16.generate(**inputs, max_new_tokens=35)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
“`

You can use this quantization technique to run any 13b model on a single GPU or Google Colab Pro.

Source link

LEAVE A REPLY Cancel reply