Â
For many beginners in data science, Python is the first language everyone learns and explores. It’s a versatile language powerful for data manipulation activity, making it attractive for many data people.
However, the basic introduction is not enough as things evolve into real-world projects. Things could get more complicated, and the basic code may not be enough to solve the problem. Your code can even become inefficient and hard to communicate.
This is true if you want to move on from a beginner level. A new library or complex algorithm is not enough. You must structure the Python code differently by incorporating programming patterns into the workflows.
As mentioned, transitioning from beginner to intermediate-level data scientist is not about writing basic code. You need to become smarter and more structured when writing it. The goal at the intermediate level is to shift towards code that is much more modular and scalable. We aim for something that is much more readable and maintainable while being able to handle more complex workflows.
By leveraging various Python Patterns, we can improve the code and workflow more than the basic could provide. Let’s explore them, starting with the data processing pipeline pattern.
In this article, we will discuss various programming Python patterns to enhance your code and take you to an intermediate data scientist. Let’s get into it.
Â
Data Pipeline Pattern
Â
When working with the data, we encounter many data manipulation tasks, such as cleaning, preprocessing, and feature engineering. In the beginner-level script, we scatter these tasks across the notebook or repeat the same tasks multiple times within the script.
If you continue this approach, debugging will become exhausting and difficult. It will also create a large technical debt and make it difficult to collaborate with others. This is why we need a pipeline pattern to improve the workflow.
The pipeline is a process where we organize the steps required for the data to work sequentially. We define the stages, and each stage is responsible for one specific action, such as data loading, cleaning, scaling, model training, and many more. It’s a systematic process from one step to the next.
Let’s take an example with Pandas DataFrame. We will simulate the data frame as sales data.
import pandas as pd
import numpy as np
data =
'sales': [1000, 1500, np.nan, 2000, 2500],
'quantity': [50, 60, 70, np.nan, 90],
'product': ['A', 'B', 'C', 'D', 'E']
example_df = pd.DataFrame(data)
Â
As a data scientist, you already have these common data processing steps you will always do. For example, here are the most common steps we did: data loading, cleaning, and feature engineering.
# Step 1: Load Data
def load_data(df: pd.DataFrame) -> pd.DataFrame:
return df
# Step 2: Data Cleaning
def clean_data(df: pd.DataFrame) -> pd.DataFrame:
missing_before = df.isnull().sum().sum()
df_cleaned = df.dropna().reset_index(drop=True)
missing_after = df_cleaned.isnull().sum().sum()
return df_cleaned
# Step 3: Feature Engineering
def engineer_features(df: pd.DataFrame) -> pd.DataFrame:
if 'sales' in df.columns and 'quantity' in df.columns:
df['avg_price'] = df['sales'] / df['quantity']
return df
Â
By defining each data processing stage into reusable functions, you can execute multiple times without writing explicitly each time.
However, you must make it into a pipeline to make the structure more neat. Let’s create a pipeline execution function for all the functions we created previously.
def execute_pipeline(df: pd.DataFrame, steps: list) -> pd.DataFrame:
for step in steps:
df = step(df)
return df
pipeline_steps = [
load_data,
clean_data,
engineer_features
]
final_df = execute_pipeline(example_df, pipeline_steps)
Â
That’s all; with a simple pipeline function, you could have reusable code that is easy to read.
You can also use Scikit-Learn to create the data processing and modelling pipeline. Here are conceptual codes you can use.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
pipeline.fit(X, y)
Â
Try to use the pipeline properly, as it is useful for any future data science tasks you will do.
Â
Factory Pattern
Â
The key to distinguishing between beginner and intermediate proficiency in programming is that we can always reuse code and write as little as possible. This may be harder in data science, as the workflows become more complex and data experiments are much more common.
When experimenting with our data and model, we often initiate and switch between the model and dataset. This can be tedious and messy, as we need to write each piece of code.
This is where the factory pattern proves useful. Essentially, this pattern centralises the creation of objects, such as machine learning models, and allows them to produce multiple outputs by adjusting the parameters rather than writing separate code for each one.
By using the factory pattern, we can centralize all the creation of the objects we need, easily experiment with things, and make the code more modular.
Let’s see the code example. Here is a factory pattern for creating a machine learning model.
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
def model_factory(model_type: str, **kwargs):
models =
'logistic_regression': LogisticRegression,
'decision_tree': DecisionTreeClassifier,
'random_forest': RandomForestClassifier,
'gradient_boosting': GradientBoostingClassifier
if model_type not in models:
raise ValueError(f"Unsupported model type 'model_type'. Supported types: list(models.keys())")
return models[model_type](**kwargs)
Â
We only define a function to create a machine learning model by changing the parameters.
Then, we can easily call the function to create the desired model.
logreg_model = model_factory(
model_type="logistic_regression",
solver="liblinear",
random_state=42)
Â
Try to create them for any experiment you know will become messy if you don’t make it into a factory pattern.
Â
Decorator Pattern
Â
When your data workflow grows, you might want to have additional insight, such as logging. Often, we would manually put that code like a logger in each code execution through the script, but it would lead to redundancy and poor readability.
This is why we can use the decorator pattern. A decorator pattern is a special type of function that wraps another function. It allows us to add additional functionality to the wrapped function’s execution.
Using decorators, we can consistently handle the code across various functions while keeping our code clean. We can also easily remove them if we don’t need them without changing the core of the code.
For example, we create a wrapper function to understand how much time the function takes to execute.
import functools
import time
def log_and_time(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
func_name = func.__name__
start_time = time.time()
result = func(*args, **kwargs)
end_time = time.time()
duration = end_time - start_time
print(f"'func_name' completed in duration:.4f seconds.\n")
return result
return wrapper
Â
Then, let’s say we create a function to create simulated data. We will use the decorator pattern we created previously for our new function.
@log_and_time
def load_simulated_data(n_rows=10000):
np.random.seed(42)
df = pd.DataFrame(
'age': np.random.randint(18, 70, size=n_rows),
'income': np.random.normal(50000, 15000, size=n_rows),
'credit_score': np.random.randint(300, 850, size=n_rows),
'loan_amount': np.random.normal(20000, 5000, size=n_rows),
'defaulted': np.random.binomial(1, 0.2, size=n_rows)
)
for col in ['income', 'credit_score']:
df.loc[df.sample(frac=0.05).index, col] = np.nan
return df
Â
Then, let’s try to execute the function we just created previously.
df_loaded = load_simulated_data()
Â
You will see the execution time result similar to the below output.
'load_simulated_data' completed in 0.0707 seconds.
Â
The decorator functions correctly, accepting the function name and accurately providing the execution time.
Use the decorator pattern for any jobs you want to wrap and acquire information about.
Â
Conclusion
Â
Going from beginner to intermediate data scientist means we can use Python code more than just the introductory level. We need to be able to use the code to manage our data workflow more neatly and efficiently than the beginner. In this article, we explore various Python patterns that take you to intermediate data scientist, including the data pipeline pattern, the factory pattern, and the decorator pattern.
I hope this has helped!
Â
Â
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.