Machine Learning
Part 1 â What is MLOps
In recent years, the tech industry has rapidly adopted AI-driven innovation. Across various sectors like entertainment, HR, legal, financial, and agriculture, Artificial Intelligence and Machine Learning (AI/ML)* algorithms have become pivotal drivers of business growth. Moreover, recent breakthroughs in Generative AI (GenAI) have expanded the spectrum of use cases, ranging from smarter chatbots to enhanced content and code creation tools such as Chat-GPT.
In 2023, GenAI servicing endpoints grew by 500%. However, it remains an open secret within the AI/ML community that only a fraction of these AI/ML solutions successfully transition to production systems, with surveys estimating a mere 54% success rate. In essence, teams often develop a âProof of Conceptâ (PoC) to demonstrate the potential business value of an AI/ML solution, yet frequently fail to operationalise tangible business value fully. The primary reasons for these AI/ML PoCs falling short of reaching production systems include:
- Scalability: While a solution may demonstrate the desired output for a small or clean dataset (e.g. those freely available on platforms like Kaggle), it cannot perform well on larger âreal-lifeâ datasets.
- Governance: Due to inadequate monitoring and adherence to best software practices, AI/ML solutions often get lost or abandoned, particularly as team members transition to different organisations.
- Reproducibility: Although an AI/ML algorithmâs output may be reproducible locally, it frequently cannot be replicated across team members or in production, leading to decreased confidence in the solution.
- Over-reliance on Vendors: Developing AI/ML algorithms within third-party platforms that donât align with the organisationâs production environment always poses integration challenges. In some cases, such integration might not be even feasible, leading to a loss of invested resources.
To address the challenges above, MLOps has emerged as the approach to streamline the process of taking AI/ML algorithms (e.g. in the form of PoCs) to production systems [1].
In the last section, I outlined why many AI/ML solutions fail to provide business value. On the bright side, organisations that have embraced MLOps are experiencing improvements in efficiency and delivery. For example, by leveraging MLOps, Uber has been able to empower âa better customer experience, helping prevent safety incidentsâ while supporting âa large volume of model deployments on a daily basisâ. [2]
As pictured in the diagram above, MLOps encompasses different areas of a business: AI/ML, DevOps and Data Engineering. Integrating these three technical areas constitutes the main challenge in MLOps. For example, in addition to version-controlling code and data (as DevOps and DataOps approaches do), AI/ML solutions require version control of the AI/ML algorithms themselves.
A crucial concept that drives MLOps into action is the AI/ML Algorithm Lifecycle (or AI/ML Lifecycle). Looking at the diagram above, the AI/ML Lifecycle is a âdivide and conquerâ framework that leverages MLOps by breaking the process into a series of actionable steps. These steps can then be implemented by Data, DevOps and ML Engineers. Outlining each of these actions to achieve AI/ML deployment (see diagram above):
- Business Goal: Arguably the most important step, here we formulate the business problem and identify what metrics we must set to assess business value. Most PoCs fail because a business goal is not precisely formulated.
- ML Problem Framing: In this step, we translate the business language, criteria and objectives into their technical AI/ML counterparts. We have to justify why an AI/ML algorithm is the best solution for this use case and what AI/ML architecture will be optimal to provide business value (e.g. supervised vs. unsupervised learning, LLM, â¦).
- Data Processing: In the words of Amazonâs CTO Werner Vogels: âif you donât have good data, you donât have good AIâ [4]. This step entails collecting and cleaning the data so that itâs ready to train the AI/ML algorithm.
- Model Development: In this step, we train the AI/ML algorithm and evaluate it to understand how well it will do in production â fine-tuning it as needed. For the interested reader, Iâve written a more detailed description of these steps in this blog post.
- Deployment and Monitoring: Finally, the model is automatically deployed and monitored in production (ideally through a DevOps pipeline), making predictions and providing business value for the organisation.
The AI/ML lifecycle is an iterative process, allowing for back-and-forth development between steps. In reality, we can break up some of these steps further. However, this picture provides a laypersonâs overview of how organisations can deliver business value with an AI/ML solution.
Lastly, it is crucial to point out one of the most significant challenges in deploying AI/ML solutions â doing so ethically. Although the AI/ML Lifecycle does not directly tackle ethics and bias, ongoing efforts are addressing this challenge both in industry and academia. In summary, organisations need to be accountable and proactive, driving ethical efforts from moral values and organisational culture. If youâre interested in reading further (including tips for organisations to become more data ethics-driven), Iâve written on âFairness in AI/MLâ at length in this Medium article.
In this blog post, I have introduced and motivated MLOps as an approach to streamlining the business value that AI/ML can provide to organisations. The goal of MLOps is to provide frameworks and patterns to leverage the AI/ML lifecycle, thereby maximising business value. In the next part of this blog series, Iâll dive into the implementation of MLOps through frameworks and tools, primarily exploring MLflow.