How to Combine Pandas, NumPy, and Scikit-learn Seamlessly

Integrating Pandas, NumPy, and scikit-learn in a Machine Learning Workflow
Image by Author | ChatGPT

Introduction

Machine learning workflows require several distinct steps — from loading and preparing data to creating and evaluating models. Python offers specialized libraries that excel at each step: Pandas handles data manipulation, NumPy provides mathematical operations, and scikit-learn delivers machine learning algorithms. While each is valuable independently, their true strength emerges when they work together.

In this tutorial, you’ll discover how to integrate these three libraries in a cohesive workflow to build effective machine learning solutions. You’ll work with a concrete compressive strength dataset to predict strength based on various ingredients — an engineering problem that demonstrates practical applications of machine learning.

By the end of this tutorial, you’ll understand:

How these three libraries complement each other in data science workflows
The specific roles each library plays in different stages of analysis
How to move data smoothly between libraries while preserving important information
Techniques for creating an integrated pipeline from raw data to predictions

Prerequisites

Before diving into this tutorial, you should have:

Python 3.6 or newer installed on your system
Basic familiarity with Python syntax and programming concepts
A working installation of the following libraries:
- Pandas (1.0.0 or newer)
- NumPy (1.18.0 or newer)
- scikit-learn (0.22.0 or newer)
- Matplotlib (3.1.0 or newer) for visualizations

If you need to install these packages, you can do so using pip:

pip install pandas numpy scikit-learn matplotlib

pip install pandas numpy scikit–learn matplotlib

This tutorial assumes you have some basic understanding of machine learning concepts like regression, training/testing splits, and model evaluation. However, we’ll explain key concepts as we progress, so even if you’re relatively new to machine learning, you should be able to follow along.

For those who want to brush up on the individual libraries before combining them, these resources may help:

The Data Science Pipeline

In data science projects, we typically follow a sequential workflow where data flows through different stages of processing. Each of our three libraries serves a specific purpose in this pipeline:

Pandas acts as our initial data handler, excelling at:
- Reading data from various sources (CSV, Excel, SQL)
- Exploring and summarizing dataset characteristics
- Cleaning messy data and handling missing values
- Transforming and reshaping data structures
NumPy functions as our numerical computation engine:
- Providing efficient array operations
- Enabling vectorized mathematical operations
- Supporting scientific computing functions
- Offering linear algebra operations
scikit-learn serves as our modeling toolkit:
- Preprocessing data with consistent APIs
- Building machine learning models
- Evaluating model performance
- Creating prediction pipelines

The elegance of this trio lies in their compatibility. Pandas DataFrames can be easily converted to NumPy arrays, which are the standard input format for scikit-learn models. This seamless data flow allows us to transition between descriptive analysis, numerical computation, and predictive modeling without friction.

Loading and Exploring Data with Pandas

Let’s begin by loading our concrete compressive strength dataset using Pandas. This dataset contains information about concrete mixtures and their resulting strength measurements.

import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score # Load the dataset url = “https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls” concrete_data = pd.read_excel(url) # Display the first few rows and check for missing values print(concrete_data.head()) print(f”Dataset shape: concrete_data.shape”) print(f”Missing values: concrete_data.isnull().sum().sum()”)

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset

url = “https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls”

concrete_data = pd.read_excel(url)

# Display the first few rows and check for missing values

print(concrete_data.head())

print(f“Dataset shape: concrete_data.shape”)

print(f“Missing values: concrete_data.isnull().sum().sum()”)

When you run this code, you’ll see the first five rows of our dataset with columns representing different concrete ingredients and the resulting compressive strength:

Sample rows from the concrete compressive strength dataset

Dataset shape: (1030, 9) Missing values: 0

Dataset shape: (1030, 9)

Missing values: 0

The dataset contains 1030 samples with 8 features that influence concrete strength. The target variable is the concrete compressive strength measured in megapascals (MPa).

Let’s visualize the relationship between cement (a primary ingredient) and compressive strength:

plt.figure(figsize=(10, 6)) plt.scatter(concrete_data.iloc[:, 0], concrete_data.iloc[:, -1]) plt.xlabel(‘Cement (kg/m³)’) plt.ylabel(‘Compressive Strength (MPa)’) plt.title(‘Cement vs. Compressive Strength’) plt.grid(True) plt.show()

plt.figure(figsize=(10, 6))

plt.scatter(concrete_data.iloc[:, 0], concrete_data.iloc[:, –1])

plt.xlabel(‘Cement (kg/m³)’)

plt.ylabel(‘Compressive Strength (MPa)’)

plt.title(‘Cement vs. Compressive Strength’)

plt.grid(True)

plt.show()

Scatter plot showing the relationship between cement content and concrete compressive strength

This scatter plot shows a positive correlation between cement content and compressive strength, which aligns with engineering knowledge.

We can also use Pandas to create a correlation matrix to identify relationships between variables:

# Calculate correlation matrix correlation_matrix = concrete_data.corr() # Display the correlation with the target variable print(“Correlation with Compressive Strength:”) print(correlation_matrix.iloc[-1, :-1].sort_values(ascending=False))

# Calculate correlation matrix

correlation_matrix = concrete_data.corr()

# Display the correlation with the target variable

print(“Correlation with Compressive Strength:”)

print(correlation_matrix.iloc[–1, :–1].sort_values(ascending=False))

Correlation coefficients for each concrete ingredient with compressive strength

This analysis reveals which ingredients have the strongest relationships with concrete strength. Understanding these relationships will help us interpret our machine learning models later. Pandas makes these initial data exploration steps straightforward, allowing us to quickly gain insights before moving to more advanced analysis.

Data Preparation and Transformation

After exploring our dataset, let’s prepare it for machine learning by transforming our Pandas DataFrame into NumPy arrays suitable for scikit-learn models.

# Split the data into features (X) and target variable (y) X = concrete_data.iloc[:, :-1] # Features: All columns except the last y = concrete_data.iloc[:, -1] # Target: Only the last column # Here we transition from Pandas to NumPy by converting DataFrames to arrays # This is a key integration point between the two libraries X_array = X.values # Pandas DataFrame → NumPy array y_array = y.values # Pandas Series → NumPy array print(f”Type before conversion: type(X)”) print(f”Type after conversion: type(X_array)”) # Split the data into training and testing sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X_array, y_array, test_size=0.2, random_state=42) print(f”Training set shape: X_train.shape”)

# Split the data into features (X) and target variable (y)

X = concrete_data.iloc[:, :–1] # Features: All columns except the last

y = concrete_data.iloc[:, –1] # Target: Only the last column

# Here we transition from Pandas to NumPy by converting DataFrames to arrays

# This is a key integration point between the two libraries

X_array = X.values # Pandas DataFrame → NumPy array

y_array = y.values # Pandas Series → NumPy array

print(f“Type before conversion: type(X)”)

print(f“Type after conversion: type(X_array)”)

# Split the data into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_array, y_array, test_size=0.2, random_state=42)

print(f“Training set shape: X_train.shape”)

The output would be:

Type before conversion: <class ‘pandas.core.frame.DataFrame’> Type after conversion: <class ‘numpy.ndarray’> Training set shape: (824, 8)

Type before conversion: <class ‘pandas.core.frame.DataFrame’>

Type after conversion: <class ‘numpy.ndarray’>

Training set shape: (824, 8)

This section highlights the first key integration point in our workflow: how Pandas DataFrames can be converted to NumPy arrays using the .values attribute. While scikit-learn can actually work directly with Pandas DataFrames (it will convert them internally), understanding this transition helps illustrate how these libraries were designed to work together. The NumPy array format is the ‘common language’ that enables efficient numerical computations and allows for seamless integration with scikit-learn’s algorithms.

Building Machine Learning Models with scikit-learn

Now let’s build and evaluate machine learning models using our processed data:

from sklearn.linear_model import LinearRegression from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, r2_score # Train a linear regression model lr_model = LinearRegression() lr_model.fit(X_train, y_train) # Train a random forest model rf_model = RandomForestRegressor(n_estimators=100, random_state=42) rf_model.fit(X_train, y_train) # Make predictions lr_predictions = lr_model.predict(X_test) rf_predictions = rf_model.predict(X_test) # Evaluate models models = [“Linear Regression”, “Random Forest”] predictions = [lr_predictions, rf_predictions] for model_name, pred in zip(models, predictions): mse = mean_squared_error(y_test, pred) r2 = r2_score(y_test, pred) print(f”model_name:”) print(f” Mean Squared Error: mse:.2f”) print(f” R² Score: r2:.2f”)

from sklearn.linear_model import LinearRegression

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error, r2_score

# Train a linear regression model

lr_model = LinearRegression()

lr_model.fit(X_train, y_train)

# Train a random forest model

rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

rf_model.fit(X_train, y_train)

# Make predictions

lr_predictions = lr_model.predict(X_test)

rf_predictions = rf_model.predict(X_test)

# Evaluate models

models = [“Linear Regression”, “Random Forest”]

predictions = [lr_predictions, rf_predictions]

for model_name, pred in zip(models, predictions):

mse = mean_squared_error(y_test, pred)

r2 = r2_score(y_test, pred)

print(f“model_name:”)

print(f” Mean Squared Error: mse:.2f”)

print(f” R² Score: r2:.2f”)

The above block of code should output:

Linear Regression: Mean Squared Error: 95.98 R² Score: 0.63 Random Forest: Mean Squared Error: 30.36 R² Score: 0.88

Linear Regression:

Mean Squared Error: 95.98

R² Score: 0.63

Random Forest:

Mean Squared Error: 30.36

R² Score: 0.88

This section highlights the second key integration point: feeding NumPy arrays directly into scikit-learn models. Notice how scikit-learn’s consistent API seamlessly accepts our NumPy arrays without requiring any further conversion. This integration enables us to switch between different machine learning algorithms (like linear regression and random forest) while using the same preprocessed data.

The results show a significant performance difference between the two models. The Random Forest achieves an R² score of 0.88, much better than Linear Regression’s 0.63, and reduces the mean squared error by more than two-thirds (from 95.98 to 30.36). This substantial improvement suggests that the relationship between concrete ingredients and strength is non-linear, which the Random Forest can capture but the Linear Regression cannot.

The ability to quickly compare different algorithms is a major advantage of scikit-learn’s unified interface – we can change models with just a few lines of code while keeping the rest of our workflow intact. This flexibility is made possible by the seamless integration between NumPy arrays and scikit-learn’s algorithms.

Case Study: Adding Domain Knowledge

Finally, let’s improve our model by incorporating domain knowledge about concrete:

# Use NumPy’s efficient arithmetic operations to create domain-specific features cement_water_ratio = X_train[:, 0] / X_train[:, 3] # Cement / Water ratio cement_water_ratio_test = X_test[:, 0] / X_test[:, 3] # Add this new feature to our feature matrices using NumPy’s array manipulation X_train_enhanced = np.column_stack((X_train, cement_water_ratio)) X_test_enhanced = np.column_stack((X_test, cement_water_ratio_test)) # Train a model with the enhanced features from sklearn.ensemble import GradientBoostingRegressor model = GradientBoostingRegressor(n_estimators=100, random_state=42) model.fit(X_train_enhanced, y_train) predictions = model.predict(X_test_enhanced) print(f”Model with domain knowledge:”) print(f” R² Score: r2_score(y_test, predictions):.2f”) # Visualize results plt.figure(figsize=(8, 6)) plt.scatter(y_test, predictions, alpha=0.5) plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], ‘r–‘) plt.xlabel(‘Actual Strength (MPa)’) plt.ylabel(‘Predicted Strength (MPa)’) plt.title(‘Predicted vs Actual Concrete Strength’) plt.grid(True) plt.show()

# Use NumPy’s efficient arithmetic operations to create domain-specific features

cement_water_ratio = X_train[:, 0] / X_train[:, 3] # Cement / Water ratio

cement_water_ratio_test = X_test[:, 0] / X_test[:, 3]

# Add this new feature to our feature matrices using NumPy’s array manipulation

X_train_enhanced = np.column_stack((X_train, cement_water_ratio))

X_test_enhanced = np.column_stack((X_test, cement_water_ratio_test))

# Train a model with the enhanced features

from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor(n_estimators=100, random_state=42)

model.fit(X_train_enhanced, y_train)

predictions = model.predict(X_test_enhanced)

print(f“Model with domain knowledge:”)

print(f” R² Score: r2_score(y_test, predictions):.2f”)

# Visualize results

plt.figure(figsize=(8, 6))

plt.scatter(y_test, predictions, alpha=0.5)

plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], ‘r–‘)

plt.xlabel(‘Actual Strength (MPa)’)

plt.ylabel(‘Predicted Strength (MPa)’)

plt.title(‘Predicted vs Actual Concrete Strength’)

plt.grid(True)

plt.show()

This final example demonstrates the full power of integrating all three libraries. We start with data prepared using Pandas, then leverage NumPy’s vectorized operations to efficiently create a domain-specific feature (cement-to-water ratio) that engineers recognize as important for concrete strength. NumPy’s array manipulation functions like column_stack allow us to seamlessly combine our original features with this new engineered feature.

Model with domain knowledge: R² Score: 0.89

Model with domain knowledge:

R² Score: 0.89

The results are impressive, with our enhanced model achieving an R² score of 0.89, which is even better than the Random Forest model’s 0.88. The visualization shows a strong correlation between predicted and actual strength values across the entire range, with points clustering closely around the reference diagonal line.

Scatter plot comparing predicted and actual concrete strength values

This complete workflow—from Pandas to NumPy to scikit-learn—demonstrates why these libraries form the foundation of so many data science projects. Each library excels at specific tasks: Pandas for data handling, NumPy for numerical operations, and scikit-learn for machine learning. When combined, they create a powerful toolkit that allows data scientists to quickly iterate from raw data to accurate predictions.

By understanding how these libraries work together and where they integrate, you can build more efficient and effective machine learning solutions. The addition of domain knowledge through feature engineering further shows how human expertise combined with these tools can lead to superior results.

Extensions and Summary

In this tutorial, we’ve explored how to combine Pandas, NumPy, and scikit-learn to create an effective machine learning workflow:

We used Pandas to load, explore, and clean our concrete dataset
We leveraged NumPy for efficient numerical operations and feature transformations
We built predictive models with scikit-learn‘s consistent API

This integration allows us to harness the strengths of each library: Pandas for data manipulation, NumPy for numerical computations, and scikit-learn for machine learning algorithms.

To extend this workflow further, consider exploring:

scikit-learn’s Pipeline API for streamlined workflows
Feature selection techniques to identify the most important concrete ingredients
Ensemble techniques like Random Forest which we demonstrated
Cross-validation methods to ensure model robustness

By learning how these libraries work together, you’ll be able to tackle a wide range of data science and machine learning problems efficiently.

How to Combine Pandas, NumPy, and Scikit-learn Seamlessly

Introduction

Prerequisites

The Data Science Pipeline

Loading and Exploring Data with Pandas

Data Preparation and Transformation

Building Machine Learning Models with scikit-learn

Case Study: Adding Domain Knowledge

Extensions and Summary

Recent Articles

How to Build an AI Journal with LlamaIndex

Sednit abuses XSS flaws to hit gov’t entities, defense companies

OpenAI’s planned data center in Abu Dhabi would be bigger than Monaco

How to Set the Number of Trees in Random Forest

HTML Email Accessibility Report 2025

Related Stories

Leave A Reply Cancel reply