Whisper is an open-source speech-to-text model provided by OpenAI. There are five model sizes available in both English-focused and multilingual varieties to choose from, depending on the complexity of the application and desired accuracy-efficiency tradeoff. Whisper is an end-to-end speech-to-text framework that uses an encoder-decoder transformer architecture operating on input audio split into 30-second chunks and converted into a log-Mel spectrogram. The network is trained on multiple speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection.
For this project, two walkie-talkie buttons are available to the user: one which sends their general English-language questions to the bot through the lighter, faster “base” model, and a second which deploys the larger “medium” multilingual model that can distinguish between dozens of languages and accurately transcribe correctly pronounced statements. In the context of language learning, this leads the user to focus very intently on their pronunciation, accelerating the learning process. A chart of the available Whisper models is shown below:
There exists a variety of highly useful open-source language model interfaces, all catering to different use cases with varying levels of complexity for setup and use. Among the most widely known are the oobabooga text-gen webui, with arguably the most flexibility and under-the-hood control, llama.cpp, which originally focused on optimized deployment of quantized models on smaller CPU-only devices but has since expanded to serving other hardware types, and the streamlined interface chosen for this project (built on top of llama.cpp): Ollama.
Ollama focuses on simplicity and efficiency, running in the background and capable of serving multiple models simultaneously on small hardware, quickly shifting models in and out of memory as needed to serve their requests. Instead of focusing on lower-level tools like fine-tuning, Ollama excels at simple installation, efficient runtime, a great spread of ready-to-use models, and tools for importing pretrained model weights. The focus on efficiency and simplicity makes Ollama the natural choice for LLM interface in a project like LingoNaut, since the user does not need to remember to close their session to free up resources, as Ollama will automatically manage this in the background when the app is not in use. Further, the ready access to performant, quantized models in the library is perfect for frictionless development of LLM applications like LingoNaut.
While Ollama is not technically built for Windows, it is easy for Windows users to install it on Windows Subsystem for Linux (WSL), then communicate with the server from their Windows applications. With WSL installed, open a Linux terminal and enter the one-liner Ollama installation command. Once the installation finishes, simply run “ollama serve” in the Linux terminal, and you can then communicate with your Ollama server from any Python script on your Windows machine.
Coqui.ai 🐸 TTS
TTS is a fully-loaded text-to-speech library available for non-commercial use, with paid commercial licenses available. The library has experienced notable popularity, with 3k forks and 26.6k stars on GitHub as of the time of this writing, and it’s clear why: the library works like the Ollama of the text-to-speech space, providing a unified interface for accessing a diverse array of performant models which cover a variety of use cases (for example: providing a multi-speaker, multilingual model for this project), exciting features such as voice cloning, and controls over the speed and emotional tone of transcriptions.
The TTS library provides an extensive selection of text-to-speech models, including the illustrious Fairseq models from Facebook research’s Massively Multilingual Speech (MMS) project. For LingoNaut, the Coqui.ai team’s own XTTS model turned out to be the correct choice, as it generates high-quality speech in multiple languages seamlessly. Although the model does have a “language” input parameter, I found that even leaving this set to “en” for English and simply passing text in other languages still results in faithful multilingual generation with mostly correct pronunciations.