Building a Regression Model to Predict Delivery Durations: A Practical Guide | by Jimin Kang | Dec, 2024


Data Preparation & Exploratory Analysis

Now that we’ve outlined our approach, let’s take a look at our data and what kind of features we’re working with.

From the above, we see our data contains ~197,000 deliveries, with a variety of numeric & non-numeric features. None of the features are missing a large percentage of values (lowest non-null count ~181,000), so we likely won’t have to worry about dropping any features entirely.

Let’s check if our data contains any duplicated deliveries, and if there are any observations that we cannot compute delivery time for.

print(f"Number of duplicates: {df.duplicated().sum()} \n")

print(pd.DataFrame({'Missing Count': df[['created_at', 'actual_delivery_time']].isna().sum()}))

We see that all the deliveries are unique. However, there are 7 deliveries that are missing a value for actual_delivery_time, which means we won’t be able to compute the delivery duration for these orders. Since there’s only a handful of these, we’ll remove these observations from our data.

Now, let’s create our prediction target. We want to predict the delivery duration (in seconds), which is the elapsed time between when the customer placed the order (‘created_at’) and when they recieved the order (‘actual_delivery_time’).

# convert columns to datetime 
df['created_at'] = pd.to_datetime(df['created_at'], utc=True)
df['actual_delivery_time'] = pd.to_datetime(df['actual_delivery_time'], utc=True)

# create prediction target
df['seconds_to_delivery'] = (df['actual_delivery_time'] - df['created_at']).dt.total_seconds()

The last thing we’ll do before splitting our data into train/test is check for missing values. We already viewed the non-null counts for each feature above, but let’s view the proportions to get a better picture.

We see that the market features (‘onshift_dashers’, ‘busy_dashers’, ‘outstanding_orders’) have the highest percentage of missing values (~8% missing). The feature with the second-highest missing data rate is ‘store_primary_category’ (~2%). All other features have < 1% missing.

Since none of the features have a high missing count, we won’t remove any of them. Later on, we will look at the feature distributions to help us decide how to appropriately deal with missing observations for each feature.

But first, let’s split our data into train/test. We will proceed with an 80/20 split, and we’ll write this test data to a separate file which we won’t touch until evaluating our final model.

from sklearn.model_selection import train_test_split
import os

# shuffle
df = df.sample(frac=1, random_state=42)
df = df.reset_index(drop=True)

# split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# write test data to separate file
directory = 'datasets'
file_name = 'test_data.csv'
file_path = os.path.join(directory, file_name)
os.makedirs(directory, exist_ok=True)
test_df.to_csv(file_path, index=False)

Now, let’s dive into the specifics of our train data. We’ll establish our numeric & categorical features, to make it clear which columns are being referenced in later exploratory steps.

categorical_feats = [
'market_id',
'store_id',
'store_primary_category',
'order_protocol'
]

numeric_feats = [
'total_items',
'subtotal',
'num_distinct_items',
'min_item_price',
'max_item_price',
'total_onshift_dashers',
'total_busy_dashers',
'total_outstanding_orders',
'estimated_order_place_duration',
'estimated_store_to_consumer_driving_duration'
]

Let’s revisit the categorical features with missing values (‘market_id’, ‘store_primary_category’, ‘order_protocol’). Since there was little missing data among those features (< 3%), we will simply impute those missing values with an “unknown” category.

  • This way, we won’t have to remove data from other features.
  • Perhaps the absence of feature values holds some predictive power for delivery duration i.e. these features are not missing at random.
  • Additionally, we will add this imputation step to our preprocessing pipeline during modeling, so that we won’t have to manually duplicate this work on our test set.
missing_cols_categorical = ['market_id', 'store_primary_category', 'order_protocol']

train_df[missing_cols_categorical] = train_df[missing_cols_categorical].fillna("unknown")

Let’s look at our categorical features.

pd.DataFrame({'Cardinality': train_df[categorical_feats].nunique()}).rename_axis('Feature')

Since ‘market_id’ & ‘order_protocol’ have low cardinality, we can visualize their distributions easily. On the other hand, ‘store_id’ & ‘store_primary_category’ are high cardinality features. We’ll take a deeper look at those later.

import seaborn as sns
import matplotlib.pyplot as plt

categorical_feats_subset = [
'market_id',
'order_protocol'
]

# Set up the grid
fig, axes = plt.subplots(1, len(categorical_feats_subset), figsize=(13, 5), sharey=True)

# Create barplots for each variable
for i, col in enumerate(categorical_feats_subset):
sns.countplot(x=col, data=train_df, ax=axes[i])
axes[i].set_title(f"Frequencies: {col}")

# Adjust layout
plt.tight_layout()
plt.show()

Some key things to note:

  • ~70% of orders placed have ‘market_id’ of 1, 2, 4
  • < 1% of orders have ‘order_protocol’ of 6 or 7

Unfortunately, we don’t have any additional information about these variables, such as which ‘market_id’ values are associated with which cities/locations, and what each ‘order_protocol’ number represents. At this point, asking for additional data concerning this information may be a good idea, as it may help for investigating trends in delivery duration across broader region/location categorizations.

Let’s look at our higher cardinality categorical features. Perhaps each ‘store_primary_category’ has an associated ‘store_id’ range? If so, we may not need ‘store_id’, as ‘store_primary_category’ would already encapsulate a lot of the information about the store being ordered from.

store_info = train_df[['store_id', 'store_primary_category']]

store_info.groupby('store_primary_category')['store_id'].agg(['min', 'max'])

Clearly not the case: we see that ‘store_id’ ranges overlap across levels of ‘store_primary_category’.

A quick look at the distinct values and associated frequencies for ‘store_id’ & ‘store_primary_category’ shows that these features have high cardinality and are sparsely distributed. In general, high cardinality categorical features may be problematic in regression tasks, particularly for regression algorithms that require solely numeric data. When these high cardinality features are encoded, they may enlarge the feature space drastically, making the available data sparse and decreasing the model’s ability to generalize to new observations in that feature space. For a better & more professional explanation of the phenomena, you can read more about it here.

Let’s get a sense of how sparsely distributed these features are.

store_id_values = train_df['store_id'].value_counts()

# Plot the histogram
plt.figure(figsize=(8, 5))
plt.bar(store_id_values.index, store_id_values.values, color='skyblue')

# Add titles and labels
plt.title('Value Counts: store_id', fontsize=14)
plt.xlabel('store_id', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()

We see that there are a handful of stores that have hundreds of orders, but the majority of them have much less than 100.

To handle the high cardinality of ‘store_id’, we’ll create another feature, ‘store_id_freq’, that groups the ‘store_id’ values by frequency.

  • We’ll group the ‘store_id’ values into five different percentile bins shown below.
  • ‘store_id_freq’ will have much lower cardinality than ‘store_id’, but will retain relevant information regarding the popularity of the store the delivery was ordered from.
  • For more inspiration behind this logic, check out this thread.
def encode_frequency(freq, percentiles) -> str:
if freq < percentiles[0]:
return '[0-50)'
elif freq < percentiles[1]:
return '[50-75)'
elif freq < percentiles[2]:
return '[75-90)'
elif freq < percentiles[3]:
return '[90-99)'
else:
return '99+'

value_counts = train_df['store_id'].value_counts()
percentiles = np.percentile(value_counts, [50, 75, 90, 99])

# apply encode_frequency to each store_id based on their number of orders
train_df['store_id_freq'] = train_df['store_id'].apply(lambda x: encode_frequency(value_counts[x], percentiles))

pd.DataFrame({'Count':train_df['store_id_freq'].value_counts()}).rename_axis('Frequency Bin')

Our encoding shows us that ~60,000 deliveries were ordered from stores catgorized in the 90–99th percentile in terms of popularity, whereas ~12,000 deliveries were ordered from stores that were in the 0–50th percentile in popularity.

Now that we’ve (attempted) to capture relevant ‘store_id’ information in a lower dimension, let’s try to do something similar with ‘store_primary_category’.

Let’s look at the most popular ‘store_primary_category’ levels.

A quick look shows us that many of these ‘store_primary_category’ levels are not exclusive to each other (ex: ‘american’ & ‘burger’). Further investigation shows many more examples of this kind of overlap.

So, let’s try to map these distinct store categories into a few basic, all-encompassing groups.

store_category_map = {
'american': ['american', 'burger', 'sandwich', 'barbeque'],
'asian': ['asian', 'chinese', 'japanese', 'indian', 'thai', 'vietnamese', 'dim-sum', 'korean',
'sushi', 'bubble-tea', 'malaysian', 'singaporean', 'indonesian', 'russian'],
'mexican': ['mexican'],
'italian': ['italian', 'pizza'],
}

def map_to_category_type(category: str) -> str:
for category_type, categories in store_category_map.items():
if category in categories:
return category_type
return "other"

train_df['store_category_type'] = train_df['store_primary_category'].apply(lambda x: map_to_category_type(x))

value_counts = train_df['store_category_type'].value_counts()

# Plot pie chart
plt.figure(figsize=(6, 6))
value_counts.plot.pie(autopct='%1.1f%%', startangle=90, cmap='viridis', labels=value_counts.index)
plt.title('Category Distribution')
plt.ylabel('') # Hide y-axis label for aesthetics
plt.show()

This grouping is probably brutally simple, and there may very well be a better way to group these store categories. Let’s proceed with it for now for simplicity.

We’ve done a good deal of investigation into our categorical features. Let’s look at the distributions for our numeric features.

# Create grid for boxplots
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 15)) # Adjust figure size
axes = axes.flatten() # Flatten the 5x2 axes into a 1D array for easier iteration

# Generate boxplots for each numeric feature
for i, column in enumerate(numeric_feats):
sns.boxplot(y=train_df[column], ax=axes[i])
axes[i].set_title(f"Boxplot for {column}")
axes[i].set_ylabel(column)

# Remove any unused subplots (if any)
for i in range(len(numeric_feats), len(axes)):
fig.delaxes(axes[i])

# Adjust layout for better spacing
plt.tight_layout()
plt.show()

Boxplots for a subset of our numeric features

Many of the distributions appear to be more right skewed then they are due to the presence of outliers.

In particular, there seems to be an order with 400+ items. This seems strange as the next largest order is less than 100 items.

Let’s look more into that 400+ item order.

train_df[train_df['total_items']==train_df['total_items'].max()]

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here