Photo by Google DeepMind
2023 was the year of Large Language Models and Open Source. Many startups and companies open-sourced their models and weights to combat proprietary LLMs such as ChatGPT and Claude. Some of the important companies and models (open source) for 2023 were:
- Meta (LLama, LLamav2)
- TII (Falcon 7B, 40B, 180B)
- Mistral (Mistral 7B, Mixtral8x7B)
However, a 7B model which is relatively easy and cheaper to deploy is not up to par with bigger models such as 70B. The strongest open-source contender was Mistral 7B which would outperform many bigger models.
Comparison of Mistral-7B from Mistral.ai
These small models however still do not respond well to natural prompts and require good prompt engineering.
Zephyr 7B is a model created by the HuggingFace H4 (Helpful, Honest, Harmless, Huggy) team whose main goal was to create a smaller language model that is aligned with user intent and outperforms even bigger models.
Zephyr is an aligned version of Mistral-7B mainly created with the power of Distillation, and is comparable to 70B models in academic and conversational benchmarks.
Key Features
The reason behind the outstanding performance of Zephyr is these 4 key techniques that the H4 Team has used.
- Self-Instruct data creation & DSFT (Distilled Supervised Fine-Tuning)
- Feedback collection
- DDPO (Distilled Direct Preference Optimization) of the DSFT model
Self-Instruct Data Creation & DSFT
Traditionally Supervised Fine-Tuning (SFT) is performed on a Large Language Model via a high-quality instruction completion pair. Construction of this data is costly and requires human supervision (Chung et al., 2022; Sanh et al., 2021).
One of the interesting approaches here is to use a Teacher model (already trained LLM) to generate the instructions and responses. This distillation technique was first used on Alpaca (Taori et al., 2023) which proved that a small model can outperform larger models with Distilled Supervised Fine-Tuning.
Self-Instruct pipeline | Source: Self-Instruct paper
The H4 Team used Zephyr for constructing high-quality supervised (instruction, completion) datasets that were used for doing DSFT. (Training a model on instructions/completions generated is a form of distillation known as DSFT: Distilled Supervised Fine-Tuning).
Feedback Collection
Large Language Models are aligned typically with the help of Reinforcement learning from human feedback (RLHF). Zephyr instead uses Feedback from a better teacher model (such as GPT-4) to align the interests of the model, following the approach of Ultra Feedback.
UltraFeedback construction process | Source: UltraFeedback paper
The way it works is that each prompt supervised prompt from SFT is passed to 4 models (Claude, LLama, Falcon, etc.) and each of the 4 responses against the single prompt is scored with the help of GPT-4. Now we have a dataset of an Input (x), highest scoring completion (yw), and a random prompt denoted as low scoring completion (yl), i.e we have a triplet of (x, yw, yl).
Preference Optimization
The goal of this last step is to maximize the preference of the model from yw(highest-scoring completion) over yl (low-scoring completion). This is done using DPO (Direct Preference Optimization). Using DPO is simpler than using plain RLHF and intuitively it performs better than RLHF. The approach in this case is known as dDPO because it uses a distilled dataset generated with the help of a teacher model.
The overall algorithm looks somewhat like this:
And can be translated into the following steps:
- Compute the probability for (x, yw) and (x, yl) from the dSFT model (forward-only).
- Compute the probability for (x, yw) and (x, yl) from the dDPO model.
- Compute Eq 1 and backpropagate to update. Repeat
The base model that Zephyr used is Mistral-7B which was the state-of-the-art open source at the time of release. They used the TRL library for fine-tuning and alignment. Deep-Speed Zero 3 and Flash-Attention 2 were used to optimize and speed up the training and to fully utilize the GPU. The models were trained using AdamW optimizer and no weight decay was used. All experiments were run on 16 A100s using bfloat16 precision and typically took 2–4 hours to complete. You can refer to the original paper for in-depth details on the Training Procedure of Zephyr.
Zephyr team combines the best techniques to train the Large Language Models and it matched the performance of 40B models with just 7B parameters and matched 70B for chat models.
Zephyr models are publically available on Hugging Face and can be used similarly to any other Language Model.
import torch
from transformers import pipeline
pipe = pipeline("text-generation",
model="HuggingFaceH4/zephyr-7b-alpha", # can also use the beta model
torch_dtype=torch.bfloat16,
device_map="auto")
# We use the tokenizer's chat template to format each message - see
messages = [
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a pirate",
},
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
Output:
<|system|>
You are a friendly chatbot who always responds in the style of a pirate.
<|user|>
How many helicopters can a human eat in one sitting?
<|assistant|>
Ah, me hearty matey! But yer question be a puzzler! A human cannot eat a helicopter in one sitting, as helicopters are not edible. They be made of metal, plastic, and other materials, not food!
Zephyr-7B is a small model that showed the power of distillation from a LLM to a smaller model. The resulting model ZEPHYR-7B, based on MISTRAL-7B, sets a new state-of-the-art for 7B parameter chat models and even outperforms LLAMA2-CHAT-70B on MT-Bench.
References
- Zephyr: Direct Distillation of LM Alignment (https://arxiv.org/abs/2310.16944)
- HuggingFace Zephyr blog (https://huggingface.co/blog/Isamu136/understanding-zephyr)
- Self Instruct:
- UltraFeedback:
Ahmad Anis is a passionate machine learning engineer and researcher currently working at redbuffer.ai. Beyond his day job, Ahmad actively engages with the machine learning community. He serves as a regional lead for Cohere for AI, a nonprofit dedicated to open science, and is an AWS community builder. Ahmad is an active contributor at Stackoverflow, where he has 2300+ points. He has contributed to many famous open-source projects, including Shap-E by OpenAI.