A few personal lessons learned from developing LLM applications

Source DALL·E 3 prompted with “Operationalizing LLMs, watercolor”

It’s been fun posting articles exploring new Large Language Model (LLM) techniques and libraries as they emerge, but most of the time has been spent behind the scenes working on the operationalization of LLM solutions. Many organizations are working on this right now, so I thought I’d share a few quick thoughts about my journey so far.

It’s beguiling easy to throw up a quick demo to showcase some of the amazing capabilities of LLMs, but anybody who is tasked with putting them in front of users with the hope of having a discernable impact soon realizes there’s a lot of work required to tame them. Below are some of the key areas that most organizations might need to consider.

Some of the key areas that should be considered before launching applications that use Large Language Models (LLMs).

The list isn’t exhaustive (see also Kadour et al 2023), and which of the above applies to your application will of course vary, but even solving for safety, performance, and cost can be a daunting prospect.

So what can we do about it?

There is much concern about the safe use of LLMs, and quite right too. Trained on human output they suffer from many of the less favorable aspects of the human condition, and being so convincing in their responses raises new issues around safety. However, the risk profile is not the same for all cases, some applications are much safer than others. Asking an LLM to provide answers directly from its training data offers more potential for hallucination and bias than a low-level technical use of an LLM to predict metadata. This is an obvious distinction, but worthwhile considering for anybody about to build LLM solutions— starting with low-risk applications is an obvious first step and reduces the amount of work required for launch.

How LLMs are used influences how risky it is to use them

We live in incredibly exciting times with so many rapid advances in AI coming out each week, but it sure makes building a roadmap difficult! Several times in the last year a new vendor feature, open-source model, or Python package has been released which has changed the landscape significantly. Figuring out which techniques, frameworks, and models to use such that LLM applications maintain value over time is challenging. No point in building something fabulous only to have its capabilities natively supported for free or very low cost in the next 6 months.

Another key consideration is to ask whether an LLM is actually the best tool for the job. With all of the excitement in the last year, it’s easy to get swept away and “LLM the heck” out of everything. As with any new technology, using it just for the sake of using it is often a big mistake, and as LLM hype adjusts one may find our snazzy app becomes obsolete with real-world usage.

That said, there is no doubt that LLMs can offer some incredible capabilities so if forging ahead, here are some ideas that might help …

In web design there is the concept of mobile-first, to develop web applications that work on less functional phones and tablets first, then figure out how to make things work nicely on more flexible desktop browsers. Doing things this way around can sometimes be easier than the converse. A similar idea can be applied to LLM applications — where possible try and develop them so that they work with cheaper, faster, and lower-cost models from the outset, such as GPT-3.5-turbo instead of GPT-4. These models are a fraction of the cost and will often force the design process towards more elegant solutions that break the problem down into simpler parts with less reliance on monolithic lengthy prompts to expensive and slow models.

Of course, this isn’t always feasible and those advanced LLMs exist for a reason, but many key functions can be supported with less powerful LLMs — simple intent classification, planning, and memory operations. It may also be the case that careful design of your workflows can open the possibility of different streams where some use less powerful LLMs and others more powerful (I’ll be doing a later blog post on this).

Down the road when those more advanced LLMs become cheaper and faster, you can then swap out the more basic LLMs and your application may magically improve with very little effort!

It is a good software engineering approach to use a generic interface where possible. For LLMs, this can mean using a service or Python module that presents a fixed interface that can interact with multiple LLM providers. A great example is langchain which offers integration with a wide range of LLMs. By using Langchain to communicate with LLMs from the outset and not native LLM APIs, we can swap out different models in the future with minimal effort.

Another example of this is to use autogen for agents, even if using OpenAI assistants. That way as other native agents become available, your application can be adjusted more easily than if you had built a whole process around OpenAI’s native implementation.

A common pattern with LLM development is to break down the workflow into a chain of conditional steps using frameworks such as promptflow. Chains are well-defined so we know, more or less, what’s going to happen in our application. They are a great place to start and have a high degree of transparency and reproducibility. However, they don’t support fringe cases well, that’s where groups of autonomous LLM agents can work well as they are able to iterate towards a solution and recover from errors (most of the time). The issue with these is that — for now at least — agents can be a bit slow due to their iterative nature, expensive due to LLM token usage, and have a tendency to be a bit wild at times and fail spectacularly. They are likely the future of LLM applications though, so it’s a good idea to prepare even if not using them in your application right now. By building your workflow as a modular chain, you are in fact doing just that! Individual nodes in the workflow can be swapped out to use agents later, providing the best of both worlds when needed.

It should be noted there are some limitations with this approach, streaming of the LLM response becomes more complicated, but depending on your use case the benefits may outweigh these challenges.

Linking together steps in an LLM workflow with Promtpflow. This has several advantages, one being that steps can be swapped out with more advanced techniques in the future.

It is truly amazing to watch autogen agents and Open AI assistants generating code and automatically debugging to solve tasks, to me it feels like the future. It also opens up amazing opportunities such as LLM As Tool Maker (LATM, Cai et al 2023), where your application can generate its own tools. That said, from my personal experience, so far, code generation can be a bit wild. Yes, it’s possible to optimize prompts and implement a validation framework, but even if that generated code runs perfectly, is it right when solving new tasks? I have come across many cases where it isn’t, and it’s often quite subtle to catch — the scale on a graph, summing across the wrong elements in an array, and retrieving slightly the wrong data from an API. I think this will change as LLMs and frameworks advance, but right now, I would be very cautious about letting LLMs generate code on the fly in production and instead opt for some human-in-the-loop review, at least for now.

There are of course many use cases that absolutely require an LLM. But to ease into things, it might make sense to choose applications where the LLM adds value to the process rather than being the process. Imagine a web app that presents data to a user, already being useful. That application could be enhanced to implement LLM improvements for finding and summarizing that data. By placing slightly less emphasis on using LLMs, the application is less exposed to issues arising from LLM performance. Stating the obvious of course, but it’s easy to dive into generative AI without first taking baby steps.

Prompting LLMs incurs costs and can result in a poor user experience as they wait for slow responses. In many cases, the prompt is similar or identical to one previously made, so it’s useful to be able to remember past activity for reuse without having to call the LLM again. Some great packages exist such as memgpt and GPTCache which use document embedding vector stores to persist ‘memories’. This is the same technology used for the common RAG document retrieval, memories are just chunked documents. The slight difference is that frameworks like memgpt do some clever things to use LLM to self-manage memories.

You may find however that due to a specific use case, you need some form of custom memory management. In this scenario, it’s sometimes useful to be able to view and manipulate memory records without having to write code. A powerful tool for this is pgvector which combines vector store capabilities with Postgres relational database for querying, making it easy to understand the metadata stored with memories.

At the end of the day, whether your application uses LLMs or not it is still a software application and so will benefit from standard engineering techniques. One obvious approach is to adopt test-driven development. This is especially important with LLMs provided by vendors to control for the fact that the performance of those LLMs may vary over time, something you will need to quantify for any production application. Several validation frameworks exist, again promptflow offers some straightforward validation tools and has native support in Microsoft AI Studio. There are other testing frameworks out there, the point being, to use one from the start for a strong foundation in validation.

That said, it should be noted that LLMs are not deterministic, providing slightly different results each time depending on the use case. This has an interesting effect on tests in that the expected result isn’t set in stone. For example, testing that a summarization task is working as required can be challenging because the summary with slightly vary each time. In these cases, it’s often useful to use another LLM to evaluate the application LLM’s output. Metrics such as Groundedness, Relevance, Coherence, Fluency, GPT Similarity, ADA Similarity can be applied, see for example Azure AI studio’s implementation.

Once you have a set of amazing tests that confirm your application is working as expected, you can incorporate them into a DevOps pipeline, for example running them in GitHub actions before your application is deployed.

No one size fits all of course, but for smaller organizations implementing LLM applications, developing every aspect of the solution may be a challenge. It might make sense to focus on the business logic and work closely with your users while using enterprise tools for areas such as LLM safety rather than developing them yourself. For example, Azure AI studio has some great features that enable various safety checks on LLMs with a click of a button, as well as easy deployment to API endpoints with integrating monitoring and safety. Other vendors such as Google have similar offerings.

There is of course a cost associated with features like this, but it may be well worth it as developing them is a significant undertaking.

Azure AI Content Safety Studio is a great example of a cloud vendor solution to ensure your LLM application is safe, with no associated development effort

LLMs are far from being perfect, even the most powerful ones, so any application using them must have a human in the loop to ensure things are working as expected. For this to be effective all interactions with your LLM application must be logged and monitoring tools in place. This is of course no different to any well-managed production application, the difference being new types of monitoring to capture performance and safety issues.

Another key role humans can play is to correct and improve the LLM application when it makes mistakes. As mentioned above, the ability to view the application’s memory can help, especially if the human can make adjustments to the memory, working with the LLM to provide end-users with the best experience. Feeding this modified data back into prompt tunning of LLM fine-tuning can be a powerful tool in improving the application.

The above thoughts are by no means exhaustive for operationalizing LLMs and may not apply to every scenario, but I hope they might be useful for some. We are all on an amazing journey right now!

Challenges and Applications of Large Language Models, Kaddour et al, 2023

Large Language Models as Tool Makers, Cai et al, 2023.

Unless otherwise noted, all images are by the author

Please like this article if inclined and I’d be delighted if you followed me! You can find more articles here.