Lessons from agile, experimental chatbot development


Lessons learned bringing LLM-based products to production

Towards Data Science
A photo of me (Katherine Munro) on stage presenting this article as a talk. To watch or listen to the recording, click here.
Today’s post recaps my recent talk on lessons learned trying to bring LLM-based products to production. You can check out the video here.

What happens when you take a working chatbot that’s already serving thousands of customers a day in four different languages, and try to deliver an even better experience using Large Language Models? Good question.

It’s well known that evaluating and comparing LLMs is tricky. Benchmark datasets can be hard to come by, and metrics such as BLEU are imperfect. But those are largely academic concerns: How are industry data teams tackling these issues when incorporating LLMs into production projects?

In my work as a Conversational AI Engineer, I’m doing exactly that. And that’s how I ended up centre-stage at a recent data science conference, giving the (optimistically titled) talk, “No baseline? No benchmarks? No biggie!” Today’s post is a recap of this, featuring:

  • The challenges of evaluating an evolving, LLM-powered PoC against a working chatbot
  • How we’re using different types of testing at different stages of the PoC-to-production process
  • Practical pros and cons of different test types

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here