We Need to Raise the Bar for AI Product Managers | by Julia Winn | Aug, 2024


How to Stop Blaming the ‘Model’ and Start Building Successful AI Products

Towards Data Science
Image generated by the author using Midjourney

Product managers are responsible for deciding what to build and owning the outcomes of their decisions. This applies to all types of products, including those powered by AI. However, for the last decade it’s been common practice for PMs to treat AI models like black boxes, deflecting responsibility for poor outcomes onto model developers.

PM: I don’t know why the model is doing that, ask the model developer.

This behavior makes about as much sense as blaming the designer for bad signup numbers after a site redesign. Tech companies assume PMs working on consumer products have the intuition to make informed decisions about design changes and take ownership of the results.

So why is this hands-off approach to AI the norm?

The problem: PMs are incentivized to keep their distance from the model development process.

This more rigorous hands-on approach is what helps ensure models land successfully and deliver the best experience to users.

A hands-on approach requires:

  • More technical knowledge and understanding.
  • Taking on more risk and responsibility for any known issues or trade offs present at launch.
  • 2–3X more time and effort — creating eval data sets to systematically measure model behavior can take anywhere from hours to weeks.

Not sure what an eval is? Check out my post on What Exactly Is an “Eval” and Why Should Product Managers Care?.

Nine times out of ten, when a model launch falls flat, a hands-off approach was employed. This is less the case at large companies with a long history of deploying AI in products, like Netflix, Google, Meta and Amazon, but this article isn’t for them.

However, overcoming the inertia of the hands-off approach can be challenging. This is especially true when company leadership doesn’t expect anything more, and a PM might even face pushback for “slowing down” the development cycle when adopting hands-on practices.

Imagine a PM at a marketplace like Amazon tasked with developing a product bundle recommendation system for parents. Consider the two approaches.

Hands-off AI PM — Model Requirements

Goal: Grow purchases.

Evaluation: Whatever the model developer thinks is best.

Metrics: Use an A/B test to decide if we roll out to 100% of users if there is any improvement in purchase rate with statistical significance.

Hands-on AI PM — Model Requirements

Goal: Help parents discover quality products they didn’t realize they needed to make their parenting journey easier.

Metrics: The primary metric is driving purchases of products for parents of young children. Secondary longer term metrics we will monitor are repeat purchase rate from brands first discovered in the bundle and brand diversity in the marketplace over time.

Evaluation: In addition to running an A/B test, our offline evaluation set will look at sample recommendations for multiple sample users from key stages of parenthood (prioritize expecting, newborn, older baby, toddler, young kid) and four income brackets. If we see any surprises here (ex: low income parents being recommended the most expensive products) we need to look more closely at the training data and model design.

In our eval set we will consider:

  • Personalization — look at how many people are getting the same products. We expect differences across income and child age groups
  • Avoid redundancy — penalize duplicative recommendations for durables (crib, bottle warmer) if there is already one in the bundle, or user has already purchased this type of item from us (do not penalize for consumables like diapers or collectables like toys)
  • Coherence — products from different stages shouldn’t be combined (ex: baby bottle and 2 year old clothes)
  • Cohesion — avoid mixing wildly different products, ex: super expensive handmade wooden toys with very cheap plastic ones, loud prints with licensed characters with muted pastels.

Possible drivers of secondary goals

  • Consider experimenting with a bonus weight for repeat purchase products. Even if we sell slightly fewer bundles upfront that’s a good tradeoff if it means people who do are more likely to buy more products in future.
  • To support marketplace health longer term, we don’t want to bias towards just bestsellers. While upholding quality checks, aim for at least 10% of recs including a brand that isn’t the #1 in their category. If this isn’t happening from the start the model might be defaulting to “lowest common denominator” behavior, and is likely not doing proper personalization

Hands-on AI Product Management — Model Developer Collaboration

The specific model architecture should be decided by the model developer, but the PM should have a strong say in:

  • What the model is optimizing for (this should go one or two levels deeper than “more purchases” or “more clicks”)
  • How the model performance will be evaluated.
  • What examples are used for evaluation.

The hands-on approach is objectively so much more work! And this is assuming the PM is even brought into the process of model development in the first place. Sometimes the model developer has good PM instincts and can account for user experience in the model design. However a company should never count on this, as in practice a UX savvy model developer is a one in a thousand unicorn.

Plus, the hands-off approach might still kind-of work some of the time. However in practice this usually results in:

  • Suboptimal model performance, possibly killing the project (ex: execs conclude bundles were just a bad idea).
  • Missed opportunities for significant improvements (ex: a 3% uplift instead of 15%).
  • Unmonitored long-term effects on the ecosystem (ex: small brands leave the platform, increasing dependency on a few large players).

In addition to being more work up front, the hands-on approach can radically change the process of product reviews.

Hands-off AI PM Product Review

Leader: Bundles for parents seems like a great idea. Let’s see how it performs in the A/B test.

Hands-on AI PM Product Review

Leader: I read your proposal. What’s wrong with only suggesting bestsellers if those are the best products? Shouldn’t we be doing what’s in the user’s best interest?

[half an hour of debate later]

PM: As you can see, it’s unlikely that the bestseller is actually best for everyone. Take diapers as an example. Lower income parents should know about the Amazon brand of diapers that’s half the price of the bestseller. High income parents should know about the new expensive brand richer customers love because it feels like a cloud. Plus if we always favor the existing winners in a category, longer term, newer but better products will struggle to emerge.

Leader: Okay. I just want to make sure we aren’t accidentally suggesting a bad product. What quality control metrics do you propose to make sure this doesn’t happen?

Model developer: To ensure only high quality products are shown, we are using the following signals…

The Hidden Costs of Hands-Off AI Product Management

The contrasting scenarios above illustrate a critical juncture in AI product management. While the hands-on PM successfully navigated a challenging conversation, this approach isn’t without its risks. Many PMs, faced with the pressure to deliver quickly, might opt for the path of least resistance.

After all, the hands-off approach promises smoother product reviews, quicker approvals, and a convenient scapegoat (the model developer) if things go awry. However, this short-term ease comes at a steep long-term cost, both to the product and the organization as a whole.

When PMs step back from engaging deeply with AI development, obvious issues and crucial trade offs remain hidden, leading to several significant consequences, including:

  1. Misaligned Objectives: Without PM insight into user needs and business goals, model developers may optimize for easily measurable metrics (like click-through rates) rather than true user value.
  2. Unintended Ecosystem Effects: Models optimized in isolation can have far-reaching consequences. For instance, always recommending bestseller products could gradually push smaller brands out of the marketplace, reducing diversity and potentially harming long-term platform health.
  3. Diffusion of Responsibility: When decisions are left “up to the model,” it creates a dangerous accountability vacuum. PMs and leaders can’t be held responsible for outcomes they never explicitly considered or approved. This lack of clear ownership can lead to a culture where no one feels empowered to address issues proactively, potentially allowing small problems to snowball into major crises.
  4. Perpetuation of Subpar Models: Without close examination of model shortcomings through a product lens, the highest impact improvements can’t be identified and prioritized. Acknowledging and owning these shortcomings is necessary for the team to make the right trade-off decisions at launch. Without this, underperforming models will become the norm. This cycle of avoidance stunts model evolution and wastes AI’s potential to drive real user and business value.

The first step a PM can take to become more hands-on? Ask your model developer how you can help with the eval! There are so many great free tools to help with this process like promptfoo (a favorite of Shopify’s CEO).

Product leadership has a critical role in elevating the standards for AI products. Just as UI changes undergo multiple reviews, AI models demand equal, if not greater, scrutiny given their far-reaching impact on user experience and long-term product outcomes.

The first step towards fostering deeper PM engagement with model development is holding them accountable for understanding what they are shipping.

Ask questions like:

  • What eval methodology are you using? How did you source the examples? Can I see the sample results?
  • What use cases do you feel are most important to support with this first version? Will we have to make any trade offs to facilitate this?

Be thoughtful about what kinds of evals are used where:

  • For a model deployed on a high stakes surface, consider making using eval sets a requirement. This should also be paired with rigorous post-launch impact and behavior analysis as far down the funnel as possible.
  • For a model deployed on a lower stakes surface, consider allowing a quicker first launch with a less rigorous evaluation, but push for rapid post-launch iteration once data is collected about user behavior.
  • Investigate feedback loops in model training and scoring, ensuring human oversight beyond mere precision/recall metrics.

And remember iteration is key. The initial model shipped should rarely be the final one. Make sure resources are available for follow up work.

Ultimately, the widespread adoption of AI brings both immense promise and significant changes to what product ownership entails. To fully realize its potential, we must move beyond the hands-off approach that has too often led to suboptimal outcomes. Product leaders play a pivotal role in this shift. By demanding a deeper understanding of AI models from PMs and fostering a culture of accountability, we can ensure that AI products are thoughtfully designed, rigorously tested, and truly beneficial to users. This requires upskilling for many teams, but the resources are readily available. The future of AI depends on it.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here