Lessons from agile, experimental chatbot development

Lessons learned bringing LLM-based products to production

A photo of me (Katherine Munro) on stage presenting this article as a talk. To watch or listen to the recording, click here. — Today’s post recaps my recent talk on lessons learned trying to bring LLM-based products to production. You can check out the video here.

What happens when you take a working chatbot that’s already serving thousands of customers a day in four different languages, and try to deliver an even better experience using Large Language Models? Good question.

It’s well known that evaluating and comparing LLMs is tricky. Benchmark datasets can be hard to come by, and metrics such as BLEU are imperfect. But those are largely academic concerns: How are industry data teams tackling these issues when incorporating LLMs into production projects?

In my work as a Conversational AI Engineer, I’m doing exactly that. And that’s how I ended up centre-stage at a recent data science conference, giving the (optimistically titled) talk, “No baseline? No benchmarks? No biggie!” Today’s post is a recap of this, featuring:

The challenges of evaluating an evolving, LLM-powered PoC against a working chatbot
How we’re using different types of testing at different stages of the PoC-to-production process
Practical pros and cons of different test types

Lessons from agile, experimental chatbot development

Lessons learned bringing LLM-based products to production

Recent Articles

How to Log Your Data with MLflow. Mastering data logging in MLOps for… | by Jack Chang | Jan, 2025

Meet OmAgent: A New Python Library for Building Multimodal Language Agents

Trump’s second inauguration: live updates and how to watch

TikTok Goes Dark in the U.S. as Federal Ban Takes Effect January 19, 2025

Answer Data Questions for Non-Technical Stakeholders

Related Stories

Leave A Reply Cancel reply