Data science was originally known as statistical analysis before it got its name, as that was the primary method for extracting information from data. With recent advances in technology, machine learning models have been introduced, expanding our ability to analyze and understand data. There are many machine learning models available, but you don’t need to learn them all. The most important thing is to understand how these tools can help you.
In this 7-part crash course, you will learn through examples how to carry out a data science project using the most common machine learning models. This mini-course focuses on the core aspects of data science. It assumes that you have already gathered and prepared the data for use. This course is designed for practitioners who are comfortable with Python programming and eager to learn about common data science tools such as pandas and scikit-learn. While a machine learning engineer’s goal is to develop models, a data scientist’s objective is to interpret data using machine learning models as tools. You will explore how these tools can assist in deriving meaningful insights and making data-driven decisions.
Let’s get started!
data:image/s3,"s3://crabby-images/392eb/392eb253c9e5328669e5f8440ab74aea5f6a8d9e" alt=""
Net Level Data Science (7-day Mini-Course)
Photo by geraldo stanislas. Some rights reserved.
Who Is This Mini-Course For?
Before we begin, let’s make sure you’re in the right place. The list below provides general guidelines on who this course is designed for. Don’t worry if you don’t match these points exactly—you might just need to brush up on certain areas to keep up.
- Developers with some coding experience. You should be comfortable writing Python code and setting up your development environment (a prerequisite). You don’t need to be an expert coder, but you should be able to install packages and write scripts without hesitation.
- Developers with basic machine learning knowledge. You should have a general understanding of machine learning models and feel comfortable using them. You don’t need to be an expert in every model, but you should be able to recognize their strengths and weaknesses.
- Developers familiar with data science tools. If you’ve used Jupyter notebooks or worked with data in Python, that’s a plus. Libraries like pandas can make handling data easier. You don’t need to be an expert in any specific library, but you should be comfortable using different tools and writing code to manipulate data.
This mini-course is not a textbook on data science. Instead, it serves as a project-based guide that takes you step by step from a developer with minimal experience to one who can confidently demonstrate how to complete a data science project.
Mini-Course Overview
This mini-course is divided into 7 parts.
Each lesson was designed to take the average developer about 30 minutes. You might finish some much sooner and other you may choose to go deeper and spend more time.
You can complete each part as quickly or as slowly as you like. A comfortable schedule may be to complete one lesson per day over seven days. Highly recommended.
The topics you will cover over the next 7 lessons are as follows:
- Lesson 1: Getting the Data
- Lesson 2: Find the Numeric Columns for Linear Regression
- Lesson 3: Performing Linear Regression
- Lesson 4: Interpreting Factors
- Lesson 5: Feature Selection
- Lesson 6: Decision Tree
- Lesson 7: Random Forest and Probability
This is going to be a lot of fun.
You’ll have to do some work, though: a little reading, research and programming. You want to learn how to finish a data science project, right?
Post your results in the comments; I’ll cheer you on!
Hang in there; don’t give up.
Lesson 01: Getting the Data
The dataset we will use for this mini-course is the “All Countries Dataset” that is available on Kaggle:
This dataset describes almost all countries’ demographic, economic, geographic, health, and political data. The most well-known dataset of this type would be the CIA World Fact Book. Scrapping from the World Fact Book should give you more comprehensive and up-to-date data. However, using this dataset in CSV format would save you a lot of trouble when building your web scraper.
Downloading this dataset from Kaggle (you may need to sign up an account to do so), you will find the CSV file All Countries.csv
. Let’s check this dataset with pandas.
import pandas as pd
df = pd.read_csv(“All Countries.csv”) df.info() |
The above code will print a table to the screen, like the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
<class ‘pandas.core.frame.DataFrame’> RangeIndex: 194 entries, 0 to 193 Data columns (total 64 columns): # Column Non-Null Count Dtype — —— ————– —– 0 country 194 non-null object 1 country_long 194 non-null object 2 currency 194 non-null object 3 capital_city 194 non-null object 4 region 194 non-null object 5 continent 194 non-null object 6 demonym 194 non-null object 7 latitude 194 non-null float64 8 longitude 194 non-null float64 9 agricultural_land 193 non-null float64 … 62 political_leader 187 non-null object 63 title 187 non-null object dtypes: float64(48), int64(6), object(10) memory usage: 97.1+ KB |
In the above, you see the basic information of the dataset. For example, at the top, you know that there are 194 entries (rows) in this CSV file. And the table tell you there are 64 columns (indexed by number 0 to 63). Some columns are numeric, such as latitude, and some are not, such as capital_city. The data type “object” in pandas usually means it is a string type. You also know that there are some missing values, such as in agricultural_land
, there are only 193 non-null values over 194 entries, meaning there is one row with missing values on this column.
Let’s see more detail into the dataset, such as taking the first five rows as a sample:
This will show you the first five rows of the dataset in a tabular form.
Your Task
This is the basic exploration of a dataset. But using the head()
function may not be always appropriate (e.g., when the input data are sorted). There are also tail()
function for the similar purpose. However, running df.sample(5)
would usually more helpful as it is to randomly sample 5 rows. Try with this function. Also, as you can see from the above output, the columns are clipped to the screen width. How to modify the above code to show all columns from the sample?
Hint: There is a to_string()
function in pandas as well as you can adjust the general print option display.max_columns
.
In the next lesson, you will see how to prepare your data for linear regression.
Lesson 02: Find the Numeric Columns for Linear Regression
Let’s jump into one of the most basic task: predicting a country’s GDP based on other factors using linear regression. But before you use the data, it is important to ensure that no bad data is involved. For example, if you’re using linear regression, all numbers must be valid to allow addition and multiplication. This means NaN
(“not a number”) or infinity should not be present. Often, NaN
is used to denote a missing value.
Filling in missing values in a dataset is easy. For example, in pandas, you can replace all missing values (NaN
) with zero:
But why use zero? The best value to fill in depends on the column. Sometimes, a predefined value is appropriate, while other times, using the average of other non-missing values is a better approach.
Another strategy is to ignore any columns with missing values. You can identify columns with missing values by counting the number of null or NaN
entries:
print(df.isnull().sum().sort_values(ascending=False).to_string()) |
You will see the above prints:
internally_displaced_persons 121 central_government_debt_pct_gdp 74 hiv_incidence 61 energy_imports_pct 56 … urban_population_under_5m 0 rural_land 0 urban_land 0 country 0 |
To list all columns with no missing values, you can filter for rows where the count of missing values is zero:
df_null_count = df.isnull().sum().sort_values(ascending=False) print(df_null_count[df_null_count == 0].index) |
For linear regression, the data must be numeric. You can identify numerical columns by using the describe()
method, which computes basic statistics for numerical columns:
print(df.describe().columns) |
By combining these steps, you can list all columns that are both numerical and have no missing values using a set
intersection:
print(list(set(df.describe().columns) & set(df_null_count[df_null_count == 0].index))) |
Your Task
Looking at the set of columns above, GDP is missing. This makes sense if you check the data in the CSV file—there’s one country without GDP data. Can you identify which country that is using pandas?
Next, let’s find the columns with three or fewer missing values and then remove the countries that have missing values in any of those columns. How can you do this in Python? There should be a simple way to shortlist the pandas DataFrame in just a few lines of code.
In the next lesson, you’ll run linear regression using the numeric columns you shortlisted above.
Lesson 03: Performing Linear Regression
Let’s start with the DataFrame. We will identify the numeric columns that have three or fewer missing values in the entire dataset:
df_null_count = df.isnull().sum().sort_values(ascending=False) good_cols = list(set(df_null_count[df_null_count <= 3].index) & set(df.describe().columns)) print(good_cols)
df_cleaned = df.dropna(axis=“index”, how=“any”, subset=good_cols).copy() print(df_cleaned) |
Now, let’s focus on the columns listed in good_cols
. How well do you think population can predict GDP? After all, a country with a larger population should generally have a higher GDP.
To find out, we can use scikit-learn to build a linear model:
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True) X = df_cleaned[[“population”]] Y = df_cleaned[“gdp”] model.fit(X, Y) print(model.coef_) print(model.intercept_) print(model.score(X, Y)) |
The score from the last line printed is the $R^2$ of the linear regression model. The best possible score is 1.0, while a value of 0.0 indicates that the predictor X
is independent of y
. In this case, we obtained an $R^2$ score of 0.34—not very high.
Let’s try adding more columns to X
to see if additional predictors improve the model:
model = LinearRegression(fit_intercept=True) X = df_cleaned[[“population”, “rural_population”, “median_age”, “life_expectancy”]] Y = df_cleaned[“gdp”] model.fit(X, Y) print(model.coef_) print(model.intercept_) print(model.score(X, Y)) |
The $R^2$ score increased to 0.66, which is a significant improvement. You can also examine the coefficients of the linear regression model. The rural population has a negative coefficient, meaning that a higher rural population is associated with a lower GDP.
Your Task
Nothing is stopping you from using all numerical columns in the DataFrame for linear regression. Can you try that? What is the $R^2$ score in this case? Which factors are positively correlated with GDP? How can you determine that?
In the next lesson, you will learn how to interpret the coefficients of the linear regression model.
Lesson 04: Interpreting Factors
Let’s try running a linear regression for life expectancy using all relevant factors. To determine which columns are usable, we identify those with no missing values:
df_null_count = df.isnull().sum().sort_values(ascending=False) good_cols = list(set(df_null_count[df_null_count <= 3].index) & set(df.describe().columns)) print(good_cols) |
This shows the columns used to be the following:
[‘renewable_energy_consumption_pct’, ‘rural_land’, ‘urban_population_under_5m’, ‘women_parliament_seats_pct’, ‘electricity_access_pct’, ‘gdp’, ‘rural_population’, ‘birth_rate’, ‘population_female’, ‘fertility_rate’, ‘urban_land’, ‘nitrous_oxide_emissions’, ‘press’, ‘democracy_score’, ‘life_expectancy’, ‘urban_population’, ‘agricultural_land’, ‘longitude’, ‘methane_emissions’, ‘population’, ‘internet_pct’, ‘population_male’, ‘hospital_beds’, ‘land_area’, ‘median_age’, ‘net_migration’, ‘latitude’, ‘death_rate’, ‘forest_area’, ‘co2_emissions’] |
Next, we define the predictors as all columns except the target variable (life expectancy)
X = df_cleaned[[x for x in good_cols if x != “life_expectancy”]] Y = df_cleaned[“life_expectancy”] model.fit(X, Y) print(model.coef_) print(model.intercept_) print(model.score(X, Y)) |
To easily match each coefficient to its corresponding column, we print them together:
for col, coef in zip(X.columns, model.coef_): print(“%s: %.3e” % (col, coef)) |
Some factors have negative coefficients, meaning they negatively impact life expectancy. For example, a higher death rate decreases life expectancy, which aligns with expectations. Some coefficients are extremely small—for instance, net_migration
is on the order of $10^{-6}$, meaning it effectively has no impact on the target variable.
Your Task
Since some features have no effect, why not remove them from the regression? How can you do this automatically? Hint: Write a loop that iteratively adds the “best feature” and compares the increase in the $R^2$ score.
In the next lesson, you’ll learn how to automatically find the best subset of features.
Lesson 05: Feature Selection
In the previous lesson, you predicted life expectancy using all available factors. Now, let’s refine the regression model to make it more “explainable.” Specifically, let’s identify the top five factors affecting life expectancy.
There are many ways to select features, and sequential feature selection is one of the simplest to understand. It uses a greedy algorithm to evaluate all possible combinations until the target number of features is selected. Let’s try it out:
from sklearn.feature_selection import SequentialFeatureSelector
# Initializing the Linear Regression model model = LinearRegression(fit_intercept=True)
# Perform Sequential Feature Selector sfs = SequentialFeatureSelector(model, n_features_to_select=5) X = df_cleaned[[x for x in good_cols if x != “life_expectancy”]] Y = df_cleaned[“life_expectancy”] sfs.fit(X, Y) # Uses a default of cv=5 selected_feature = list(X.columns[sfs.get_support()]) print(“Feature selected for highest predictability:”, selected_feature) |
These are the five best features selected by the sequential feature selector. Let’s now build the model again and examine the coefficients:
model = LinearRegression(fit_intercept=True) X = df_cleaned[selected_feature] Y = df_cleaned[“life_expectancy”] model.fit(X, Y) print(model.score(X, Y)) for col, coef in zip(X.columns, model.coef_): print(“%s: %.3e” % (col, coef)) print(“Intercept:”, model.intercept_) |
This produces the following results:
0.9248375749867905 electricity_access_pct: 3.798e-02 birth_rate: 1.319e-01 press: 3.290e-01 median_age: 9.035e-01 death_rate: -1.118e+00 Intercept: 51.251243580962864 |
This indicates that life expectancy increases with access to electricity and median age. Intuitively, a country with a high life expectancy will have a higher median age. However, this also highlights a weakness of regression models: they cannot detect “data leakage,” where certain predictors may be redundant or misleading, ultimately making the model less useful.
This is the art of data science—carefully and intelligently cleaning the input data before running an algorithm to avoid the “garbage in, garbage out” problem.
Next, let’s convert GDP, land area, and some other columns into their “per capita” versions and rerun the feature selection:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
per_capita = [“gdp”, “land_area”, “forest_area”, “rural_land”, “agricultural_land”, “urban_land”, “population_male”, “population_female”, “urban_population”, “rural_population”] for col in per_capita: df_cleaned[col] = df_cleaned[col] / df_cleaned[“population”]
col_to_use = per_capita + [ “nitrous_oxide_emissions”, “methane_emissions”, “fertility_rate”, “hospital_beds”, “internet_pct”, “democracy_score”, “co2_emissions”, “women_parliament_seats_pct”, “press”, “electricity_access_pct”, “renewable_energy_consumption_pct”]
model = LinearRegression(fit_intercept=True) sfs = SequentialFeatureSelector(model, n_features_to_select=6) X = df_cleaned[col_to_use] Y = df_cleaned[“life_expectancy”] sfs.fit(X, Y) # Uses a default of cv=5 selected_feature = list(X.columns[sfs.get_support()]) print(“Feature selected for highest predictability:”, selected_feature) |
Now, let’s check the regression coefficients using the selected features. Running the previous code will yield:
0.7854421025889131 gdp: 1.076e-04 forest_area: -2.357e+01 fertility_rate: -2.155e+00 internet_pct: 3.464e-02 press: 3.032e-01 electricity_access_pct: 6.548e-02 Intecept: 66.44197315903226 |
This suggests that GDP per capita is the strongest predictor of life expectancy, which makes sense—richer countries tend to have better healthcare. Interestingly, forest area appears to have a negative correlation with life expectancy, possibly indicating an association with urbanization. Press freedom, internet access, and electricity access are all positively correlated with life expectancy, as they reflect the level of societal development.
Your Task
This lesson demonstrates that data science is not purely mechanical—it requires intuition to preprocess data effectively for better model performance. One aspect we did not address here is normalizing the data before regression. For instance, GDP per capita is measured in dollars, whereas other factors are in percentage terms, which can exaggerate coefficient disparities.
Can you try rescaling these factors and rerunning the code above? Does it change the selected feature set? Does it affect the $R^2$ score of the regression model?
In the next lesson, you will learn about decision trees.
Lesson 06: Decision Tree
If linear regression is the first model a data scientist tries for any task, then a decision tree would be the second. It is another simple and easy-to-understand model. However, it is more effective for a different type of problem: classification.
Let’s explore whether countries in the Northern and Southern Hemispheres differ. First, we need to create a label in the dataset:
df_cleaned[“north”] = df_cleaned[“latitude”] > 0 |
Now, let’s train a simple decision tree model as a classifier for this new column, using the selected features from the previous lesson. In scikit-learn, the syntax is almost identical to that of linear regression:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=3) X = df_cleaned[col_to_use] Y = df_cleaned[“north”] model.fit(X, Y) print(model.score(X, Y)) |
The score for a decision tree classifier represents the mean accuracy. Before analyzing this accuracy, let’s check how many countries are in this dataset:
This produces the following output:
north True 147 False 40 Name: count, dtype: int64 |
If there were an equal number of countries in the Northern and Southern Hemispheres, a random guess would yield 50% accuracy. However, since the data is imbalanced, if the model always predicts the Northern Hemisphere, the accuracy would be 78%. This means the model performs slightly better than a random guess.
That doesn’t mean the model is useless. In this case, we used it to demonstrate that the selected features are not strong predictors for classifying a country. In other words, based on these features alone, there is no significant difference between countries in the Northern and Southern Hemispheres.
Your Task
You can visualize the decision tree to see which factors it uses. While scikit-learn has a built-in plotting function, the Python module dtreeviz
provides a better visualization. Try running the code below. What factors does the model use?
In the next lesson, you will expand the decision tree into a random forest.
Lesson 07: Random Forest and Probability
If you have tried a decision tree, you can extend it into a forest to improve accuracy. There are several ways to build a forest from individual trees. For example, you can train multiple trees using a resampled dataset (i.e., selecting a random subset of rows for each tree). Another approach is to train trees using a random subset of features (i.e., columns).
Building a random forest is straightforward if you don’t require extensive fine-tuning:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=5, max_depth=3) X = df_cleaned[col_to_use] Y = df_cleaned[“north”] model.fit(X, Y) print(model.score(X, Y)) |
Using 10 trees instead of one slightly reduces accuracy. This is inherent to random forests, as they do not use all the data for training, meaning there is no guarantee that a random forest will outperform a single decision tree. However, this result aligns with our earlier observation: there is probably not much difference between countries in the Northern and Southern Hemispheres.
Visualizing a random forest requires inspecting each decision tree individually. You can access the trees in the forest using model.estimators_
.
The random forest created above is an ensemble of decision trees that “vote” to determine the final result. Scikit-learn also provides an alternative implementation using the gradient boosting algorithm. Understanding the detailed differences is not necessary because the functional syntax remains the same:
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=5, max_depth=3) X = df_cleaned[col_to_use] Y = df_cleaned[“north”] model.fit(X, Y) model.score(X, Y) |
Although decision trees and random forests are used as classifiers in these tutorials, they do not provide a definitive classification result. In particular, GradientBoostingClassifier
assumes a numerical output. As a result, the model’s native output is the probability of each predicted class. You can retrieve these probabilities as follows:
print(model.predict_proba(X)) |
This returns a row of probabilities for each input row. Typically, you are interested in the class with the highest probability, which you can obtain using predict()
:
You can also determine how confident the model is, on average, in predicting whether a country belongs to the Northern or Southern Hemisphere by calculating the average probability of its predictions:
import numpy as np
print(np.mean(model.predict_proba(X)[range(len(X)), model.predict(X).astype(int)])) |
The above code extracts the predicted class from the model, matches it with the corresponding probability, and computes the average. This provides evidence that the model does not distinguish between the Northern and Southern Hemispheres, as the computed value is no better than a random guess.
Your Task
Scikit-learn is not the primary library for gradient boosting classifiers. A more common choice is XGBoost. How would you rewrite the classifier above using XGBoost? How should you set the n_estimators
and max_depth
hyperparameters in the case of XGBoost?
This was the final lesson.
The End! (Look How Far You Have Come)
You made it. Well done!
Take a moment and look back at how far you have come.
- You discovered how scikit-learn can help you finish a data science project.
- You learned how to use machine learning models to interpret data.
- You experimented with linear regression and decision tree models, and saw how simple models like these are still useful.
Don’t make light of this; you have come a long way in a short time. This is just the beginning of your data science journey. Keep practicing and developing your skills.
Summary
How did you do with the mini-course?
Did you enjoy this crash course?
Do you have any questions? Were there any sticking points?
Let me know. Leave a comment below.