
Hello everyone! I hope you’re all doing well. Lately, I’ve been working on some university projects and continuing my learning journey. Today, I’d like to talk to you about Scikit-learn (also known as Sklearn), a powerful machine learning library in Python.
For those who might not be deeply familiar with mathematics or want to implement machine learning more easily, Scikit-learn is an excellent choice. It simplifies the process of building and applying machine learning models and is highly efficient for most tasks.
I’m preparing a presentation about Scikit-learn for my university, and I’d like to share it with you first to get some feedback hehe , So without further ado, let’s dive in!
I’ve been using Scikit-learn for a while now. I first used it during my first semester for our multi-domain price estimator project, where we implemented linear regression and other algorithms. Most of us had no prior knowledge of these concepts, but our group leader, Mudassir, knew a bit more than us. Mudassir, being the nerdy and smart one, had a solid grasp of math. Despite this, he chose to use Scikit-learn, which made things a lot easier for us.
Even with Scikit-learn, it took me a considerable amount of time to understand everything because it was my first time doing machine learning. Before that, I had only heard about it. The process involved reading data with Pandas, preprocessing it, using train-test split, and then applying the model. Initially, it was all over my head, especially since I came from a pre-med background (a running joke among my friends 🤡). But I kept trying to understand it, and finally, I got it.
Here are the results of our models for each category:
- Cars: Approximately 65% accuracy
- Mobiles: Around 70% accuracy
- Laptops: About 60% accuracy
- Home: Approximately 65% or lower accuracy
I know these results aren’t stellar, but it was our first attempt, so we were all quite happy. Despite messing up our presentation and ending up in third place, the experience was fun (even though we were a bit scattered). Thanks to Mudassir, we gained some valuable insights.
Okay, enough about that! Let’s get back to the main topic. But I guess that was kind of related too, hehe.
Scikit-learn, or Sci-kit learn, is a machine learning library specifically designed for Python (apologies to enthusiasts of other programming languages, as this one is Python-exclusive). It encompasses a variety of machine learning algorithms and techniques, including data preprocessing, data splitting, and even supports custom datasets.
It is widely favored for its user-friendly nature and comprehensive documentation, making it accessible even to those new to machine learning.
Before delving deeper into Scikit-learn, let’s briefly discuss machine learning for those who may not be familiar with it. Machine learning is a subdomain of artificial intelligence (AI) that focuses on teaching machines to perform tasks or make predictions based on data. It enables computers to learn and execute repetitive tasks (like automation) and perform complex computations, leveraging their ability to handle heavy lifting efficiently.
There are three main types of machine learning:
- Supervised Learning: Involves learning with labeled data. The algorithm learns from input-output pairs to predict outcomes for new, unseen data.
- Unsupervised Learning: Involves learning without labeled data. The algorithm discovers patterns and structures in data without explicit guidance.
- Reinforcement Learning: Involves learning by interaction with an environment. The algorithm learns to make decisions by receiving feedback and rewards, similar to how humans and animals learn.
Reinforcement learning is similar to how babies learn by exploring and interacting with their surroundings.
Ok i hope i give you a good short explanation about machine learning lezz go further
Now, the question arises: why should we use Scikit-learn, and is it worth it? In my opinion, yes, it is. Many people find it highly beneficial because it’s a versatile tool that’s easy to use, efficient, reliable, and offers a wide range of algorithms. From simple linear regression to complex neural networks (although I personally haven’t used neural networks in Scikit-learn, as I prefer TensorFlow for that), Scikit-learn covers a broad spectrum of machine learning tasks.
Additionally, Scikit-learn is well-documented, making it easier for users to understand and implement.
So, if you’re considering which tool to use for your machine learning projects, Scikit-learn is definitely worth considering.
Now, let me provide you with a quick guide on how to use it, along with some easy-to-implement code.
Before using Scikit-learn, you need to install it. You can easily do this by typing pip install scikit-learn
in your VS Code terminal or command line. However, before installing Scikit-learn, ensure you have Python installed on your system. Python is a prerequisite for using Scikit-learn; if you haven’t installed Python yet, you’ll need to do so first.
To install Scikit-learn, simply execute the following command:
pip install scikit-learn # or pip install sk learn like i do up
PIP (Python Install Package) is a package management system for Python, which comes built-in when you install Python.
Once installed, you’re ready to start using Scikit-learn for your machine learning projects.
After installing Scikit-learn, you can start coding. First, you need to import Scikit-learn into your code. Instead of importing everything with just one line (import sklearn
), I prefer to import specific modules that I need. Here’s an example:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
Let’s break down what each line does:
- Line 1: Imports the
datasets
module from Scikit-learn. This allows us to access built-in datasets for practice or testing. - Line 2: Imports the
train_test_split
function from themodel_selection
module. This function is crucial for splitting our data into training and testing sets. It ensures that our model is trained on one set of data and tested on another, which is essential for evaluating its performance on new data. - Line 3: Imports the
SVC
(Support Vector Classifier) algorithm from thesvm
module. Here,SVC
is used for classification tasks. Scikit-learn also provides algorithms for regression tasks and other machine learning tasks. - Line 4: Imports the
accuracy_score
function from themetrics
module. This function calculates the accuracy of our model predictions, helping us understand how well our model performs on the test data.
It’s important to note that when using your own data, you’ll typically need to preprocess it. This involves tasks like reading the data with Pandas, handling missing values, and scaling numerical features. These steps ensure that your data is ready for training and testing with machine learning models.
Remember, learning machine learning is a journey of continuous improvement. Take small steps every day and gradually build your understanding and skills.
Here’s the code example. I’ve included comments to make it easier to understand. We’re using a small, preprocessed default dataset for simplicity. Remember, collecting and preprocessing data is a critical part of machine learning, and Scikit-learn can help with that too. Let me explain this code step by step:
# Import necessary modules and functions
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score# Load the Iris dataset from Scikit-learn and assign it to the variable 'iris'
iris = datasets.load_iris()
# Split the dataset into features (X) and targets (y)
X , y = iris.data , iris.target
# Split X and y into training and testing sets, with 80% for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the Support Vector Classifier (SVC) model
clf = SVC()
# Train the model using the training sets
clf.fit(X_train, y_train)
# Predict the labels of test data
y_pred = clf.predict(X_test)
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
# Print the accuracy score
print(f"Accuracy: {accuracy:.2f}")
Explanation:
- Line 6: We import the Iris dataset from Scikit-learn and assign it to the variable
iris
. - Line 7: We split the dataset into
X
(features) andy
(targets).X
contains the data we use to predicty
, which contains the labels we want to predict. - Line 9: Using
train_test_split
, we splitX
andy
into training (X_train
,y_train
) and testing (X_test
,y_test
) sets. Here,test_size=0.2
indicates that 20% of the data is reserved for testing, while the rest is used for training.random_state=42
ensures reproducibility by setting a random seed. - Line 11: We initialize a Support Vector Classifier (SVC) model named
clf
. - Line 12: We train the model (
clf
) using the training data (X_train
,y_train
) with thefit()
method. This step is where the model learns from the data. - Line 13: We use the trained model (
clf
) to predict the labels (y_pred
) of the test data (X_test
). - Line 14: We calculate the accuracy of our model’s predictions using
accuracy_score()
, comparing the predicted labels (y_pred
) with the actual labels (y_test
). - Line 15: Finally, we print the accuracy score to evaluate how well our model performed on the test data.
In machine learning, accuracy scores typically range from 0.00 to 1.00, where higher values indicate better performance. A score of 1.00 (or 100%) suggests perfect accuracy, which is often achievable with well-preprocessed and straightforward datasets like the Iris dataset.
I hope this explanation helps you understand how this code works and its components!
It’s quite evident that Scikit-learn has several advantages:
- Ease of Use: Scikit-learn is known for its user-friendly interface, making it easy for beginners to start with machine learning projects.
- Comprehensive Tools: It offers a wide range of machine learning algorithms and utilities, covering tasks from preprocessing data to model evaluation.
- Strong Documentation: Scikit-learn provides well-documented resources that help users understand its functionalities and use them effectively.
These strengths make Scikit-learn a preferred choice for many developers and data scientists when implementing machine learning solutions.
Like everything else, Scikit-learn also has its disadvantages:
- Limited to Classical Machine Learning: Scikit-learn primarily focuses on traditional machine learning algorithms. It may not have the latest advancements in deep learning or other cutting-edge techniques.
- Performance Limitations: While Scikit-learn is efficient for many tasks, its performance may not scale well with extremely large datasets or complex models compared to specialized libraries.
- Python Dependency: Being a Python library, Scikit-learn’s functionalities and optimizations are inherently tied to Python. This can sometimes limit integration with other programming languages or environments.
These factors are important to consider when choosing tools for machine learning projects, especially if you require capabilities beyond traditional machine learning algorithms.
Well, that’s all for now. I hope you found this helpful. If you did, please consider supporting me. Thank you for reading and for your support!
There’s still much more to learn, but I hope this gave you a good understanding. I recommend checking out the Scikit-learn documentation for more resources. Feel free to ask any questions you may have. Once again, thank you, and see you next time! 😊
regards : Abdul Rauf Jatoi
Thank you for being a part of the In Plain English community! Before you go: