AI adoption in the enterprise is no longer just a trend; it’s a necessity for staying competitive. Generative AI applications have started to make their way into production in a big way for use cases including customer support, content generation, summarization, and more. Klarna is a great example of this adoption, where they have moved AI assistants into production that handle more than two-thirds of their support volume – doing the work of 700 full-time agents.
Gaining the required confidence to deploy these apps at scale can be challenging, and structured evaluation has gained recognition as a key requirement on the path from science experiment to customer value. Evaluation frameworks can play a critical role in this journey by allowing developers to run experiments faster and gain systematic validation for production readiness. Connecting such an evaluation framework with a scaled observability platform brings confidence in production. Let’s explore five practical steps to move LLM applications from early prototypes to scaled, production applications.
1. Identify success metrics
To successfully advance an AI project, we first need to identify by which metrics to judge its success. Often-used metrics today include accuracy and response style, and these metrics are judged – at least initially – by human evaluators, either the developers themselves or subject matter experts.
Once we’ve established the success metrics for the AI application and scored the first version, we can identify early failure modes. We can start to answer questions like, “For which user queries is my application missing information that it needs to complete the answer?” and “When asked off-topic questions, does the AI stay in its lane?”.
This is the point where AI projects can often stall out. With human evaluations lying between each iteration, it can be slow progress to improve on these metrics. Developers also often lack the tracing required to find the root cause of these failures. At this stage, it can be useful to turn to an AI evaluation tool to address these challenges.
2. Validate AI evals
In many ways, AI evaluation tools themselves are simple AI applications. These evaluation tools take the input, output, or intermediate traces, and provide a performance score. To gain trust in the application, we must first trust the evaluation process.
We are able to gain a measure of confidence through benchmarks of these evaluators on public data, but for many organizations, the evaluation required can be nuanced. To gain confidence in specific domains, it’s often useful to establish alignment between the AI evaluation tool and human SMEs. In some cases, this alignment process can result in further tuning to the evaluators themselves.
3. Accelerate experimentation
Once trust has been established in an AI evaluator, experimentation can be dramatically accelerated. Developers can quickly adjust data inputs, try different retrievers, and experiment with different prompting strategies, and then get fast feedback through their evaluation metrics.
For example, a company using the TruLens open-source testing and evaluation tool improved app metrics, such as relevance and groundedness, by up to 50% and reduced iteration time from 2 weeks to 2 hours.
This accelerated experimentation cycle allows teams to much more rapidly reach a point of confidence in the AI app that it is ready for production use.
4. Monitor evaluations at production scale
Deploying AI models into production is just the beginning. To maintain confidence in the AI app’s quality, we can continuously monitor its performance. Just as in the development phase, AI evaluations can be leveraged to assess this performance over time.
Real-time insights into performance across the evaluation suite enable quick identification of the root cause of issues and how it can be improved. After making the required adjustments, new versions can be tested and evaluated in shadow or canary modes before full deployment.
5. To minimize risk, add guardrails and lower them over time
For high-stakes applications, such as those in healthcare or finance, minimizing risk to the enterprise is critical. In these cases, AI evaluations can be moved into the production path to guard against low-quality responses. While this can come at the cost of increased latency, evaluating the AI’s response against established thresholds ensures only high-quality responses reach the end-user. Many of the first enterprise use cases to bring AI to production take this approach. Combined with production monitoring, these guardrails can be lowered over time as the application improves.
The mindset of continuous improvement
Accelerating AI adoption in the enterprise requires a pragmatic approach and a commitment to ongoing improvement. By following these five steps, enterprises still stuck in the experimentation phase can find the path to validation and unlock the benefits of AI that waits in production.
About the Author
Josh Reini, Developer Relations Data Scientist, TruEra. Josh is the founding Developer Relations Data Scientist at TruEra where he is responsible for growing a thriving community of AI Quality practitioners. In this pursuit, Josh evangelizes best practices in trustworthy machine learning through the development of new applications and extensions, the creation of educational content and working directly with data scientists to implement these practices into their machine learning systems.
Prior to TruEra, Josh delivered end-to-end data and machine learning and solutions to clients including the Department of State and the Walter Reed National Military Medical Center. During his time at Walter Reed, he was published in the Journal of Telemedicine and e-Health as the lead statistician for a clinical trial involving a novel heart rate device. Josh also worked in product management at Geico and has a Master’s degree in Economics from the University of Georgia.
Sign up for the free insideBIGDATA newsletter.