Building A Simple Linear Regression Model With Scikit-Learn | by Tanisha.Digital | Gen AI Adventures | Jan, 2025


Gen AI Adventures

Linear regression is one of the simplest and most widely used machine learning algorithms for predicting a continuous target variable. In this guide, we’ll walk through the basics of building a linear regression model using Scikit-Learn, a powerful Python library for machine learning.

Photo by Rick Rothenberg on Unsplash

Scikit-Learn provides an end-to-end framework to implement a machine learning pipeline, which includes steps such as splitting datasets, preprocessing data, selecting models, and evaluating results. Let’s dive into these steps in the context of creating a linear regression model.

Preprocessing > Model Selection > Splitting Dataset > Build Model > Evaluation

Preprocessing ensures that your data is clean and ready for modeling. Common techniques include:

  1. Normalizing and Scaling: Ensures features are on the same scale.
  2. Encoding Categorical Variables: Converts categorical data into numerical format.

For scaling:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Graphic by Codecademy

For more details, check out this post on pre-processing data for machine learning.

Model selection is a critical step in building an effective machine learning pipeline. It involves choosing the right type of model for your problem, validating the model’s performance, and optimizing its parameters. Here’s how to approach it:

Types of Models

The choice of a model depends on the nature of your target variable:

  • Regression Models: Used when the target variable is continuous. For example, predicting house prices. Common models include Linear Regression, Ridge Regression, and Decision Trees.
  • Classification Models: Used when the target variable is categorical. For example, predicting whether an email is spam or not. Examples include Logistic Regression, Support Vector Machines, and Random Forests.
Graphic by SpringBoard

How to Select a Model

Selecting the right model involves understanding the problem you’re solving:

  1. Data Characteristics: If you have a small dataset, simpler models like Linear Regression or Logistic Regression might work best. For larger datasets, you can experiment with more complex models like Random Forests or Gradient Boosting.
  2. Problem Complexity: If the relationships in your data are linear, Linear Regression or Logistic Regression might suffice. For non-linear relationships, consider models like Decision Trees or Neural Networks.
  3. Interpretability: If explainability is important, simpler models like Linear Regression or Decision Trees are preferable.

The first step is to divide your data into training, testing, and validation sets.

  • Training Set: Used to train the model.
  • Testing Set: Evaluates the model’s performance.
  • Validation Set (optional): Fine-tunes the model parameters.
Graphic by V7 Labs

To split the dataset:

from sklearn.model_selection import train_test_split 

# Example split: 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Here, X represents the independent variables (features), and y is the dependent variable (target).

Scikit-Learn simplifies the process of building machine learning models with pre-built classes. To build a linear regression model:

  1. Import the Class: Import the LinearRegression class.
  2. Create an Instance: Initialize the model.
  3. Fit the Model: Train the model on the training dataset.

Example:

from sklearn.linear_model import LinearRegression  

# Create an instance of the Linear Regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train_scaled, y_train)

After training the model, evaluate its performance using metrics like accuracy, mean squared error, or R-squared. Here’s a complete guide to evaluation metrics for regression and classification models.

For linear regression, R-squared is often used to assess how well the model explains the variance in the target variable. Example:

from sklearn.metrics import r2_score  

# Predict on the test set
y_pred = model.predict(X_test_scaled)

# Evaluate the model
r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2}")

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here