How to Use Gemma LLM?

Introduction

Large language models (LLMs) are increasingly becoming powerful tools for understanding and generating human language. These models have achieved state-of-the-art results on different natural language processing tasks, including text summarization, machine translation, question answering, and dialogue generation. LLMs have even shown promise in more specialized domains, like healthcare, finance, and law.

Google has been at the forefront of LLM research and development, releasing a series of open models that have pushed the boundaries of what is possible with this technology. These models include BERT, T5, and T5X, which have been widely adopted by researchers and practitioners alike. In this Guide, we introduce Gemma, a new family of open LLMs developed by Google.

Learning Objectives

Understand Gemma’s architecture and key features.
Explore Gemma’s training process and techniques.
Evaluate Gemma’s performance across NLP benchmarks.
Learn to use Gemma for inference tasks.
Recognize the importance of responsible deployment for Gemma.

This article was published as a part of the Data Science Blogathon.

What is Gemma?

Gemma is a family of open language models based on Google’s Gemini models, trained on up to 6T tokens of text. These are considered to be the lighter versions of Gemini models. The Gemma family consists of two sizes: a 7 billion parameter model for efficient deployment on GPU and TPU, and a 2 billion parameter model for CPU and on-device applications. Gemma exhibits strong generalist capabilities in text domains and state-of-the-art understanding and reasoning skills at scale. It achieves better performance compared to other open models of similar or larger scales across different domains, including question answering, commonsense reasoning, mathematics and science, and coding. For both the models, the pre-trained, finetune checkpoints and open-source codebase for inference and serving are released by the Google Team.

Gemma builds upon recent advancements in sequence models, transformers, deep learning, and large-scale training in a distributed manner. It continues Google’s history of releasing open models and ecosystems, following Word2Vec, Transformer, BERT, T5, and T5X. The responsible release of Gemma aims to improve the safety of frontier models, provide equitable access to this technology, give the path to rigorous evaluation and analysis of current techniques, and foster the development of future innovations. However, thorough safety testing specific to each Use Case is crucial before deploying or using Gemma.

Gemma – Model Architecture

Gemma follows the architecture of a decoder-only transformer that was introduced way back in 2017. Both the Gamma 2B and the 7B models have a vocabulary size of 256k. Both models even have a context length of 8192 tokens. The Gemma even includes the recent advancements made in the transformers’ architecture including:

Multi-Query Attention: The 7B model uses multi-head attention, while the 2B model implements multi-query attention (with num_kv_heads=1). This choice is based on performance improvements that were shown at each scale through ablation studies.
RoPE Embeddings: Instead of absolute positional embeddings, both models employ rotary positional embeddings in each layer. Additionally, embedding sharing across inputs and outputs minimizes model size.
GeGLU Activations: The regular ReLU activation function is replaced by the GeGLU activation function, giving good performance.
Normalizer Location: Gemma deviates from the goto practice by normalizing both the input and output of each transformer sub-layer, using RMSNorm for the normalization method.

How was Gemma Trained?

Gemma 2B and 7B models were trained on 2T and 6T tokens, respectively, of primarily-English data sourced from Web Docs, mathematics, and code. Unlike Gemini models, which include multimodal elements and are optimized for multilingual tasks, Gemma models focus is on processing English text. The training data underwent a careful filtering process to remove Unwanted or Unsafe Content, including personal information and sensitive data. This filtering involved both heuristic methods and model-based classifiers to ensure the quality and safety of the dataset.

Gemma 2B and 7B models underwent supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) to further refine their performance. The supervised fine-tuning involved a mix of text-only, English-only synthetic, and human-generated prompt-response pairs. Data mixtures for fine-tuning were carefully selected based on LM-based side-by-side evaluations, with different Prompt sets designed to highlight specific capabilities like the instruction following, factuality, creativity, and safety.

Even, synthetic data underwent several stages of filtering to remove examples containing personal information or toxic outputs, following the approach established by Gemini for improving model performance without compromising safety. Finally, reinforcement learning from human feedback involved collecting pairs of preferences from human raters and training a reward function under the Bradley-Terry model. This function was then optimized using a type of REINFORCE to further refine the models’ performance and mitigate potential issues like reward hacking.

Also Watch this Video of Google Gemma Tutorial and How to use:

Benchmarks and Performance Metrics

Looking at the results, Gemma outperforms Mistral on five out of six benchmarks, with the sole exception being HellaSwag, where they get similar accuracy. This dominance is clearly evident in tasks like ARC-c and TruthfulQA, where Gemma surpasses Mistral by nearly 2% and 2.5% in accuracy and F1 score, respectively. Even on MMLU, where Perplexity scores are lower is better, Gemma achieves a prominently lower Perplexity, indicating a better grip of language patterns. These results solidify Gemma’s position in being a powerful language model, capable of handling complex NLP tasks with good accuracy and efficiency.

Getting Started with Gemma

In this section, we will get started with Gemma. We will be working with Google Colab because it comes with a free GPU. Before we get started, we need to accept Google’s Terms and Conditions to download the model.

Step 1: Opening Gemma

Click on this link to go to Gemma on HuggingFace. You will be presented with something like the below:

Step 2: Click on Acknowledge License

If you click on Acknowledge License , then you will see a page as below.

Click on Authorize. Done we are now ready to download the model. Before, let’s generate a new HuggingFace Token. For this, you can go to the HuggingFace Settings and Generate a new Token, this token will be useful because we need it to authorize inside Google Colab to download the Google Gemma Large Language Model.

Step 3: Installing Libraries

To get started, we first need to install the following libraries.

!pip install -U accelerate bitsandbytes transformers huggingface_hub

accelerate: Allows distributed training and mixed-precision training for faster and more efficient model training. The accelerate library even helps for faster inference of the Large Language Models.
bitsandbytes: Allows quantization of model weights to 4-bit or 8-bit precision, reducing memory footprint and computation requirements. Because we are dealing with a 7Billion Parameter model, which requires around 30-40GB of GPU VRAM, we need to quantize it to fit in the Colab GPU.
transformers: Provide pre-trained language models, tokenizers, and training tools for natural language processing tasks. We work with this library to download the Gemma model and start inferring it.
huggingface_hub: Facilitates access to the Hugging Face Hub, a platform for sharing and seeing language models and datasets. We need this library to login to huggingface so that we can verify that we are authorized to download the Google Gemma Large Language Model

The -U option after the install indicates that we are fetching the latest updated versions of all the libraries.

Step 4: Typing Important Command

Now, type the below command

!huggingface-cli login

The above command will ask you to provide the HuggingFace Token, which we can get from the HuggingFace website. Give this token and press the enter button and you will receive a Login Successful message. Now let’s move on to coding

# Import necessary classes for model loading and quantization
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Configure model quantization to 4-bit for memory and computation efficiency
quantization_config = BitsAndBytesConfig(load_in_4bit=True)

# Load the tokenizer for the Gemma 7B Italian model
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")

# Load the Gemma 7B Italian model itself, with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it",
                                             quantization_config=quantization_config)

AutoTokenizer: This class dynamically loads the pre-trained tokenizer associated with the given model, ensuring compatibility and avoiding manual config.
AutoModelForCausalLM: Similar to the tokenizer, this class automatically loads the pre-trained Causal Language Model architecture based on the provided model identifier.
quantization_config = BitsAndBytesConfig(load_in_4bit=True): This line creates a config object for quantization, telling that the model’s weights should be pushed in 4-bit precision instead of the original 32-bit. This to a great extent reduces memory consumption and potentially speeds up computations, making the model more efficient for resource-constrained environments.
tokenizer = AutoTokenizer.from_pretrained(“google/gemma-7b-it”): This line loads the pre-trained tokenizer specifically designed for the “google/gemma-7b-it” model. This tokenizer knows how to break down text into separate Tokens that the model can understand and process.
model = AutoModelForCausalLM.from_pretrained(“google/gemma-7b-it”, quantization_config=quantization_config): This line loads the actual “google/gemma-7b-it” model, but with the crucial addition of the quantization_config object. This ensures that the model weights are created in the 4-bit format that we have discussed earlier, adding the benefits of quantization.

Our Gemma Large Language Model is downloaded, converted into a 4-bit quantized model, and loaded into the GPU.

Step 5: Inferencing the model

Now let’s try inferencing the model.

# Define input text:
input_text = "List the key points about Responsible AI"

# Tokenize the input text:
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

# Generate text using the model:
outputs = model.generate(
    **input_ids,  # Pass tokenized input as keyword argument
    max_length=512,  # Limit output length to 512 tokens
)

# Decode the generated text:
print(tokenizer.decode(outputs[0]))

Define Input Text: The code starts by assigning the Prompt “List the key aspects of Responsible AI” to the input_text variable.
Tokenize Input: The tokenizer object associated with the downloaded model is used to convert the text into numerical tokens that the model can understand. The return_tensors=”pt” line tells about the conversion to a PyTorch tensor for efficient GPU processing. The resulting tensor of token IDs is then moved to the GPU using to(“cuda”) if available.
Generate Text: The model.generate function is called with the tokenized input (input_ids) and a maximum output length of 512 tokens. This instructs the model to generate text based on the provided Prompt, respecting the given length limit.
Decode and Convert: The generated text, represented in the format of a sequence of token IDs, is decoded back into human-readable text using the tokenizer.decode function. Finally, the decoded text is printed out.

Step 6: Response Generation

Running the code has generated the following response

The model has generated a fair response to the query provided. It has highlighted all the key aspects that go into creating a Responsible AI. This is really a relevant and accurate answer to the question asked. Let’s the AI by asking a common sense question.

input_text = "How many eggs can a Whale lay in its lifetime?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids,max_length=512)
print(tokenizer.decode(outputs[0]))

input_text = "How many smartphones can a human eat ?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids,max_length=512)
print(tokenizer.decode(outputs[0]))

So far, so good. The model possess good common sense abilities. It is able to identify what’s wrong in the sentence and output the same, which is seen in the pics above. Let’s try asking some math questions.

input_text = "I have 3 apples and 2 oranges. I ate 2 organes. How many apples do I have?"
input_ids = tokenizer(input_text, return_tensors="pt").to('cuda')
outputs = model.generate(**input_ids,max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

Seems like the model struggled to answer this simple tricky math question. Let’s try do some Prompt Engineering here. Let’s add additional info in the Prompt and run it like the below:

input_text = "I have 3 apples and 2 oranges. \
I ate 2 oranges. How many apples do I have? \
Think Step by Step. For each step, re-evaluate your answer"
input_ids = tokenizer(input_text, return_tensors="pt").to('cuda')
outputs = model.generate(**input_ids,max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

Wow, a simple tweak in the Prompt and the model answered correctly. It began thinking incrementally that is step by step. And for each step, it starts re-evaluating its answer, if it’s right or wrong. And finally, it has steered to the right answer. Let’s try asking the model to write a simple Hello World program in Python.

input_text = "Write a hello world program"
input_ids = tokenizer(input_text, return_tensors="pt").to('cuda')
outputs = model.generate(**input_ids,max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

Conclusion

Gemma, Google’s latest addition to its suite of open language models, presents advancement in the field of natural language processing. With its strong generalist capabilities and state-of-the-art understanding and reasoning skills, Gemma outperforms other open models across different domains including question answering, commonsense reasoning, mathematics and science, and coding tasks. Built upon recent advancements in sequence models, transformers, and large-scale training techniques, Gemma provides improved performance and efficiency, making it a powerful tool for researchers and practitioners alike. However, responsible deployment and thorough safety testing specific to each problem are compulsory before integrating Gemma into production systems.

Key Takeaways

Gemma is a family of open language models developed by Google, based on the Gemini models but lighter in scale.
It comes in two sizes: a 7 billion parameter model for GPU and TPU deployment, and a 2 billion parameter model for CPU and on-device applications.
Gemma exhibits strong generalist capabilities and excels in different domains including question answering, commonsense reasoning, mathematics and science, and coding.
The model architecture includes advancements like multi-query attention, RoPE embeddings, GeGLU activations, and RMSNorm for normalization.
Training data for Gemma underwent filtering to ensure quality, and models underwent supervised fine-tuning and reinforcement learning from human feedback.
Performance benchmarks show Gemma’s superiority over other models, mainly in tasks like ARC-c and TruthfulQA.
Getting started with Gemma involves installing necessary libraries, logging into Hugging Face, and loading the model for inference.
Gemma shows impressive capabilities in generating text, answering questions, and even writing simple programming tasks.

Frequently Asked Questions

Q1. What is Gemma?

A. Gemma is a family of open language models developed by Google, providing strong generalist capabilities and state-of-the-art understanding and reasoning skills in different domains.

Q2. How does Gemma differ from old Google models like BERT and T5?

A. Gemma builds upon recent advancements in sequence models, transformers, and large-scale training, providing improved performance and efficiency compared to old models.

Q3. What training data was used for Gemma?

A. Gemma models were trained on primarily English data sourced from Web Docs mathematics, and code, with careful filtering to remove Unwanted or Unsafe Content.

Q4. How can I get started with using Gemma?

A. You can start using Gemma by installing the necessary libraries, logging into Hugging Face, and loading the model for inference in platforms like Google Colab.

Q5. What performance benchmarks have shown Gemma’s superiority?

A. Benchmarks comparing Gemma with other models, like the Mistral, across different NLP tasks showcase Gemma’s impressive capabilities, mainly in tasks like ARC-c and TruthfulQA.

Q6. Does Gemma support multilingual tasks like the Gemini models?

A. No, Gemma models are mainly trained on processing English text and do not include multimodal elements or support multilingual tasks like the Gemini models.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.