Data science is not a one-time project but a continuous development that provides value to the company. Through constant retraining and refinement, our model will always be able to meet the business needs.
Standard processes must be followed to ensure the data science project continues providing the values. This is where the Data Science lifecycle process will help our work. By using a systematic approach to our project, we can maintain the highest standard for our machine learning model.
So, what is this Data Science lifecycle process, and how will it help our work? Let’s explore them together.
Data Science Lifecycle Process
The Data Science lifecycle process is a structured series of phases to guide data scientists in building the machine learning model and analytic solution.
Several frameworks exist for the Data Science lifecycle process, but I am fond of the cross-industry standard process for data mining, or CRISP-DM.
CRISP-DM can be described as a standard framework for data science projects. It was first developed in 1999 and has since been used in many successful industrial applications of data science projects.
In general, the framework defines the data science lifecycle process as the following:
- Business Understanding
- Data Understanding
- Data Preparation
- Modelling
- Evaluation
- Deployment
If we visualize the process, it follows the similar image below.
The CRISP-DM process itself is not a strictly executed framework. This means we can move back and forth between different phases, like the arrow in the image. The outer circle arrow also represents the framework’s cyclic nature.
CRISP-DM is not a one-time process; every process is a new learning experience, and we can apply what we learn.
Let’s try to break down each phase and how you can apply them to your projects.
Business Understanding
In any data science project, you should start with a business understanding, as this is the project’s foundation.
This phase has several critical tasks, including defining the business question and objectives by identifying the specific issue from a business perspective, situational assessment, and creating the project plan.
First, we must always define the project’s business question and objective. What do we need to solve from the business perspective, and what are the business success criteria (Key Performance Indicator or KPI)? We need to answer these questions by discussing them with our business counterparts.
Success criteria include model metrics, availability time, or anything else you can explain. What is essential is that it is logically sound and suitable for the business needs.
Lastly, develop a detailed plan for each project phase and what tools you will use. If possible, assess the resources available, project requirements, risks, and cost-benefit from this project. Getting as much detail as possible is important to create a foundation for our project.
Data Understanding
The next phase we need to understand is data understanding. In this phase, we analyze and evaluate data to support solving the business problem.
This phase has essential parts, including data collection, description, exploration, and quality verification.
Data collection involves understanding where and how we could acquire important data for our project. We already have the business foundation and know what data to use. Still, sometimes the data isn’t available for many reasons—it could not be collected yet by the data warehouse, or the required data is locked behind regulation. Either way, we need to work with the data we have.
Data description, exploration, and quality verification become important aspects after we have data. Even though we have our data, it’s essential to understand the data we have and ensure the data is helpful for our project. Examine the data format, describe the data relationship, data visualization, missing values assessment, and various other methods that should be implemented to understand your dataset.
Data Preparation
The next step is to prepare our dataset when we understand our data and are sure that it can be processed for our project.
By preparation we need to prepare the dataset concerning the next steps for our modelling. It includes various steps such as data selection, cleaning, integration, formatting, and feature engineering.
When we talk about data selection, it always reflects the selection based on the business question and the modelling we want to do. Ensure that when we filter specific data, we need a valid explanation so we don’t accidentally drop essential data.
The cleaning process also needs to follow the above principle, as we don’t perform garbage in, garbage out process. We don’t include wrong data if we want the right outputs. The cleaning also consists of the data formatting, where the dataset standard should be followed thoroughly throughout the process.
Data preparation also includes feature engineering and data integration from multiple datasets. Feature engineering is an action where we develop features deemed necessary for modelling the existing features. The integration, on the other hand, is combining the datasets from multiple datasets. Both are important aspects of data preparation that we should not miss.
Modelling
This is the phase that many data people love as it’s the most exciting one. However, this phase could be considerably shorter than the other phase as modelling mostly focuses on developing the machine learning object. However, this phase is as important as the other because the model will become the tool to answer our business problem.
Starting from model selection where we need to decide which algorithm that suitable for our business problem. From the selection, we also want to design our modelling testing to validate our model performances with techniques like train-test splitting, cross-validation, technical metrics, and many more. Choose the one that is suitable to solve our business problem.
In between the model development, we also need to manage our resources well. Some models might need longer time and memory to train, so it will cost more to experiment with the model. The development should also consider questions such as “Will the model I develop is possible in the business,” “Are the resources I need to develop this model is costly?” etc. The answer will become important in managing resources.
In the real world, we don’t need to achieve perfection. A good enough model is already enough as the data science lifecycle process will improve the model in future iterations. Even if we have the perfect model now, degradation could happen and the model will need further calibration.
Evaluation
The evaluation phase is different than the technical model evaluation. This phase is more focused on the business indicator from the model standpoint and what to do next.
Evaluate our model based on the business success criteria and assess if it will be met by using our model. Thoroughly explain why our model will help the business and avoid too much technical jargon to easily communicate with non-technical people.
Review our work process as well and evaluate the project as a whole. Try to ask questions such as “Is there anything missing?”, “do we need more time?” and “How does the other phase of execution progress?” as the answers will help us to decide our next steps. Reviewing the mistakes will also become part of the process which will help our future iterations.
Deployment
There is a saying “You might have the best model in the world, but it is useless if the model does not make it into production”. It means that our model only provides values if it is deployed and can be accessed or provided output.
The deployment phase involves planning and documenting how the model will be deployed and how its results will be presented or delivered. It includes establishing a monitoring and maintenance plan to ensure model quality over time so the model will keep providing value to the business.
Finally, this phase will involve creating a final report to conclude our project by creating a final report or presentation to the business stakeholders where we can review the whole project together. Try to get as much feedback as possible to improve what we are lacking and determine if the project will require frequent maintenance or not.
The project phase might end with the deployment phase, but it is a continuous cycle. When you develop your data science project, make sure you think for a long-term and not a one-time project (except if that is what you want).
Conclusion
A data science project is a continuous project if we want to get values from the model. To standardize the process, we could rely on the data science lifecycle process. In this article, we have discussed the CRISP-DM framework for the lifecycle.
The phase can be divided into a few phases, including:
- Business Understanding
- Data Understanding
- Data Preparation
- Modelling
- Evaluation
- Deployment
The process itself is a continuous cycle where we review the project and learn from our mistakes to improve our models. Each phase is treated differently but equally important for the project’s success.
I hope this has helped!
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.