Performance comparison between these models for accuracy and response time in a RAG question-answering setup.

Generated using Canva as prompted by author

With the introduction of the open-source language model Mistral 7B by a French startup, Mistral, the breathtaking performance demonstrated by proprietary models like ChatGPT and claude.ai became available for the open-source community as well. To explore the feasibility of using this model on resource-constrained systems, its quantized versions have been shown to maintain great performance.

Even though 2-bit quantized Mistral 7B model passed accuracy test with flying colors in our earlier study, it was taking around 2 minutes on average to respond to questions on a Mac. Enter TinyLlama [1], a compact 1.1B language model pretrained on 3 trillion tokens with the same architecture and tokenizer as Llama 2. It is aimed for more resource-constrained environments.

In this article, we will compare the accuracy and response time performance of question answering capabilities of quantized Mistral 7B against quantized TinyLlama 1.1B in an ensemble Retrieval-Augmented Generation (RAG) setup.

Contents
Enabling Technologies
System Architecture
Environment Setup
Implementation
Results and Discussions
Final Thoughts

This test will be conducted on a MacBook Air M1 with 8GB RAM. Due to its limited compute and memory resources, we are adopting quantized versions of these LLMs. In essence, quantization involves representing the model’s parameters using fewer bits, which effectively compresses the model. This compression results in reduced memory usage, faster execution times, and increased energy efficiency but at the compromise of accuracy. We will be using the 2-bit quantized Mistral 7B Instruct and 5-bit quantized TinyLlama 1.1B Chat models in the GGUF format for this study. GGUF is a binary format that is designed for fast loading and saving of models. To load such a GGUF model, we will be using the llama-cpp-python library, which is a…