Assalamu’alaikum Warahmatullahi Wabarakatuh
Hello friends!
In this discussion, I will use a classification model for the Sloan Digital Sky Survey DR14 data. Come on, check out the following article!
There are several classification models in machine learning, this time we will only discuss 4 classification models, namely:
- Logistic Regression: Logistic regression is a statistical model used to model the probability that an instance belongs to one class or another. Despite its name containing “regression,” logistic regression is actually a classification tool. It is used when the target variable is binary (two classes). Logistic regression uses the logistic (or sigmoid) function to map inputs into a range of values between 0 and 1, indicating the probability of a particular class. It is one of the most commonly used classification algorithms due to its ease of interpretation and effectiveness in many cases.
- Decision Trees: Decision trees are predictive models that use a tree structure to model decisions and their consequences. Each node in the tree represents a decision based on the features of the data, and each branch represents the outcome of that decision. Decision trees divide the data into smaller subsets based on the given features, with the goal of maximizing class homogeneity within each subset. They are easy to interpret, allow for handling of imbalanced data, and can handle both categorical and numerical data.
- Support Vector Machines (SVM): Support Vector Machines (SVM) is a classification model used to separate two classes by finding the best hyperplane that maximizes the margin between the two classes. The hyperplane is the lowest-dimensional entity that can separate two classes in feature space. SVMs can also be used for non-linear classification using kernel functions, which map data into higher dimensions where they can be linearly separated. SVMs are effective in high-dimensional spaces and have a general ability to handle overfitting.
- Neural Networks: Neural Networks are classification models inspired by the structure of the human brain. They consist of many neurons (information processing units) connected in layers. There are input layers, hidden layers, and output layers. Each connection between neurons has weights that can be adjusted during the training process. Neural networks can learn complex patterns in data and can be used for classification, regression, pattern recognition, and other tasks. With various architectures such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), they have demonstrated excellent performance in many applications including image recognition, natural language processing, and more.
The Sloan Digital Sky Survey DR14 dataset is a public collection of space observations, consisting of 10,000 observations taken by SDSS. Each observation is described by 17 different feature columns and 1 class column that identifies whether the celestial object is a star, galaxy, or quasar. This data can be accessed through the website https://www.kaggle.com/datasets/lucidlenn/sloan-digital-sky-survey/data.
The objective of this analysis is to develop a classification model capable of accurately identifying the categories of celestial objects based on the available features. By leveraging this data, it is hoped that we can gain a deeper understanding of the structure and characteristics of various celestial objects.
#Import data
Before creating and running the data import syntax, make sure the file containing the Sloan Digital Sky Survey DR14 data in CSV format has been input into Google Colab.
import pandas as pd# Membaca file cvs
data_dig = pd.read_csv("Data_1.csv")
objid ra dec u g r i \
0 1.237650e+18 183.531326 0.089693 19.47406 17.04240 15.94699 15.50342
1 1.237650e+18 183.598370 0.135285 18.66280 17.21449 16.67637 16.48922
2 1.237650e+18 183.680207 0.126185 19.38298 18.19169 17.47428 17.08732
3 1.237650e+18 183.870529 0.049911 17.76536 16.60272 16.16116 15.98233
4 1.237650e+18 183.883288 0.102557 17.55025 16.26342 16.43869 16.55492 z run rerun camcol field specobjid class redshift plate \
0 15.22531 752 301 4 267 3.722360e+18 STAR -0.000009 3306
1 16.39150 752 301 4 267 3.638140e+17 STAR -0.000055 323
2 16.80125 752 301 4 268 3.232740e+17 GALAXY 0.123111 287
3 15.90438 752 301 4 269 3.722370e+18 STAR -0.000111 3306
4 16.61326 752 301 4 269 3.722370e+18 STAR 0.000590 3306
mjd fiberid
0 54922 491
1 51615 541
2 52023 513
3 54922 510
4 54922 512
#Data Information
data_dig.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 objid 10000 non-null float64
1 ra 10000 non-null float64
2 dec 10000 non-null float64
3 u 10000 non-null float64
4 g 10000 non-null float64
5 r 10000 non-null float64
6 i 10000 non-null float64
7 z 10000 non-null float64
8 run 10000 non-null int64
9 rerun 10000 non-null int64
10 camcol 10000 non-null int64
11 field 10000 non-null int64
12 specobjid 10000 non-null float64
13 class 10000 non-null object
14 redshift 10000 non-null float64
15 plate 10000 non-null int64
16 mjd 10000 non-null int64
17 fiberid 10000 non-null int64
dtypes: float64(10), int64(7), object(1)
#Data Cleaning
data_dig.dropna(inplace=True)
data_dig.drop_duplicates(inplace=True)
In the data cleansing process for analysis, essential steps like removing rows with missing values and eliminating duplicate rows are crucial. By using the command datasloan.dropna(inplace=True)
, rows with missing values can be removed from the Sloan dataframe. This action is necessary to ensure the consistency and accuracy of the data used in the analysis. Additionally, by employing the command datasloan.drop_duplicates(inplace=True)
, duplicate rows within the dataframe can also be identified and removed. These steps aid in ensuring that each row in the dataframe represents unique data, thereby resulting in more reliable and accurate analysis outcomes.
#Data Sharing
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(data_digital, test_size=0.1, random_state=42)
print(train_data.shape)
print(test_data.shape)
(9000, 18)
(1000, 18)
The above syntax utilizes the train_test_split
function from the sklearn.model_selection
module to split the dataset into training and testing data. By specifying the test_size=0.1
argument, 10% of the data is allocated as testing data, while the remaining 90% is used as training data. The random_state=42
argument is employed to set the seed for the random number generator, ensuring that the data splitting can be reproduced consistently in the future. The results of calling the train_test_split
function are stored in the train_data
and test_data
variables, each containing a subset of the original dataset for training and testing the classification model, respectively. By examining the shape of train_data
and test_data
using the .shape
method, we can verify that the data splitting is done correctly and proportionally.
from re import X
X = data_dig.drop(columns=['class'])
print(X)
Y = data_dig[['class']]
print(Y)
objid ra dec u g r \
0 1.237650e+18 183.531326 0.089693 19.47406 17.04240 15.94699
1 1.237650e+18 183.598370 0.135285 18.66280 17.21449 16.67637
2 1.237650e+18 183.680207 0.126185 19.38298 18.19169 17.47428
3 1.237650e+18 183.870529 0.049911 17.76536 16.60272 16.16116
4 1.237650e+18 183.883288 0.102557 17.55025 16.26342 16.43869
... ... ... ... ... ... ...
9995 1.237650e+18 131.316413 51.539547 18.81777 17.47053 16.91508
9996 1.237650e+18 131.306083 51.671341 18.27255 17.43849 17.07692
9997 1.237650e+18 131.552562 51.666986 18.75818 17.77784 17.51872
9998 1.237650e+18 131.477151 51.753068 18.88287 17.91068 17.53152
9999 1.237650e+18 131.665012 51.805307 19.27586 17.37829 16.30542 i z run rerun camcol field specobjid redshift \
0 15.50342 15.22531 752 301 4 267 3.722360e+18 -0.000009
1 16.48922 16.39150 752 301 4 267 3.638140e+17 -0.000055
2 17.08732 16.80125 752 301 4 268 3.232740e+17 0.123111
3 15.98233 15.90438 752 301 4 269 3.722370e+18 -0.000111
4 16.55492 16.61326 752 301 4 269 3.722370e+18 0.000590
... ... ... ... ... ... ... ... ...
9995 16.68305 16.50570 1345 301 3 161 5.033450e+17 0.027583
9996 16.71661 16.69897 1345 301 3 162 5.033400e+17 0.117772
9997 17.43302 17.42048 1345 301 3 162 8.222620e+18 -0.000402
9998 17.36284 17.13988 1345 301 3 163 5.033400e+17 0.014019
9999 15.83548 15.50588 1345 301 3 163 5.033410e+17 0.118417
plate mjd fiberid
0 3306 54922 491
1 323 51615 541
2 287 52023 513
3 3306 54922 510
4 3306 54922 512
... ... ... ...
9995 447 51877 246
9996 447 51877 228
9997 7303 57013 622
9998 447 51877 229
9999 447 51877 233
[10000 rows x 17 columns]
class
0 STAR
1 STAR
2 GALAXY
3 STAR
4 STAR
... ...
9995 GALAXY
9996 GALAXY
9997 STAR
9998 GALAXY
9999 GALAXY
[10000 rows x 1 columns]
The provided syntax involves data preparation steps using the pandas library. In the first part, the drop
method is applied to the data_dig
DataFrame to create a new DataFrame X
that excludes the ‘class’ column, effectively separating the feature variables from the target variable. This step is crucial for machine learning tasks as it separates the independent variables used for prediction from the dependent variable to be predicted. The resulting DataFrame X
contains only the features. In the second part, another DataFrame Y
is created by selecting only the ‘class’ column from the original data_dig
DataFrame using double square brackets. This DataFrame Y
contains the target variable or the labels corresponding to each observation in the dataset. This separation of features and target variable facilitates the modeling process, allowing for easier manipulation and analysis of the data.
#Model Selection
from sklearn.linear_model import LogisticRegression
model_logreg = LogisticRegression()
model_logreg.fit(X, Y)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=2):
ABNORMAL_TERMINATION_IN_LNSRCH.Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
LogisticRegression
LogisticRegression()
The syntax model_logreg.fit(X, Y)
is used to train a logistic regression model (model_logreg
) using the feature data (X
) and the target data (Y
). Here, X
is a DataFrame containing the features that will be used for prediction or classification, while Y
is a DataFrame containing the target variable to be predicted by the model.
When the fit
function is called, the logistic regression model learns from the provided data (X
) and studies the relationship between these features and the target (Y
). This process involves adjusting the model parameters to fit the training data, so that the model can provide optimal predictions or classifications. Once the training process is completed, the model will be ready to be used for making predictions or classifications on new, unseen data.
model_logreg.predict(X)
array(['GALAXY', 'GALAXY', 'GALAXY', ..., 'GALAXY', 'GALAXY', 'GALAXY'],
dtype=object)
y_pred = model_logreg.predict(X)
len(y_pred)
print(y_pred)
0.4998
Choosing logistic regression as the model for the Sloan Digital Sky Survey (SDSS) Data Release 14 (D14) data could be a suitable option due to the categorical or binary nature of astronomical data. Logistic regression is well-suited for modeling such categorical target variables. The advantages of logistic regression include its ability to provide probability predictions, easy interpretation of the influence of independent variables on the target variable, and relative simplicity in implementation. By utilizing logistic regression, we can gain a better understanding of the relationship between input and output variables, thereby enabling more accurate decision-making based on the model predictions.
from sklearn.neighbors import KNeighborsClassifier
model_knn = KNeighborsClassifier(n_neighbors=7)
model_knn.fit(X,Y)
y_pred = model_knn.predict(X)
print(metrics.accuracy_score(Y, y_pred))
0.8188
The selection of the K-Nearest Neighbors (KNN) model for the Sloan Digital Sky Survey (SDSS) Data Release 14 (D14) is based on several relevant considerations regarding the nature of astronomical data. Firstly, astronomical data often exhibit spatial structures where celestial objects are closely situated or form clusters. KNN is effective in handling such spatial structures because the model leverages information from nearby neighbors to make predictions, making it suitable for classifying celestial objects based on their spatial patterns. Secondly, KNN is a non-parametric model, thus it does not make assumptions about the distribution of data. This is particularly suitable for astronomical data, which may have complex and unpredictable distributions. Additionally, interpreting the results of KNN is relatively straightforward as predictions are based on the majority label of the nearest neighbors.
#ModelTraining and Evaluation
from sklearn import metricsprint(metrics.accuracy_score(Y, y_pred)
0.4998
The accuracy metric measures how accurately a classification model classifies data instances overall. It is calculated by dividing the total number of correct predictions (both positive and negative) by the total number of data instances.
Accuracy is important for the Sloan Digital Sky Survey (SDSS) data as it provides a general overview of how well the model performs in identifying interesting celestial objects from a large dataset. In the context of SDSS, where the number of interesting celestial objects may be relatively small compared to other celestial objects, accuracy can indicate how well the model can correctly distinguish interesting celestial objects from uninteresting ones.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=20, random_state=4)
print(X_train)
print(X_test)
print(y_train)
print(y_test)
objid ra dec u g r \
5511 1.237650e+18 170.091640 -0.045379 19.17184 17.53762 16.68857
7360 1.237650e+18 183.430645 -0.220543 17.99249 15.92869 15.11373
8286 1.237650e+18 183.834767 -0.586356 19.34394 18.25068 17.69389
3450 1.237650e+18 180.551014 -1.675324 19.57907 17.72984 16.65834
4435 1.237650e+18 160.457025 0.313447 19.11273 17.50094 16.68041
... ... ... ... ... ... ...
456 1.237650e+18 162.727961 -0.970882 19.49037 18.21750 17.42162
6017 1.237650e+18 128.491980 49.335406 18.16746 16.84846 16.22980
709 1.237650e+18 149.278393 1.165870 19.36334 18.06556 17.58404
8366 1.237650e+18 241.825469 51.400206 19.56648 17.68892 16.74212
1146 1.237650e+18 165.303222 -0.957026 17.96716 16.85820 16.49776 i z run rerun camcol field specobjid redshift \
5511 16.25410 15.89447 756 301 3 362 3.143000e+17 0.063396
7360 14.80830 14.63797 752 301 3 266 3.722270e+18 0.000090
8286 17.28113 17.14116 756 301 2 454 3.231620e+17 0.142846
3450 16.22477 15.84225 1140 301 5 185 3.728200e+17 0.131723
4435 16.22015 15.87364 756 301 4 298 3.097190e+17 0.059994
... ... ... ... ... ... ... ... ...
456 16.97952 16.68882 756 301 1 313 3.108150e+17 0.115637
6017 15.82586 15.54955 1331 301 3 167 4.989300e+17 0.053524
709 17.25350 17.09801 756 301 6 223 5.630080e+17 0.063258
8366 16.26115 15.88871 1345 301 6 572 6.982160e+17 0.102676
1146 16.34852 16.30215 756 301 1 330 3.119080e+17 -0.000160
plate mjd fiberid
5511 279 51984 634
7360 3306 54922 152
8286 287 52023 105
3450 331 52368 534
4435 275 51910 352
... ... ... ...
456 276 51909 242
6017 443 51873 568
709 500 51994 212
8366 620 52375 575
1146 277 51908 121
[9980 rows x 17 columns]
objid ra dec u g r \
1603 1.237650e+18 25.782278 13.296953 19.22461 17.64948 17.09655
8713 1.237650e+18 186.319094 1.006614 17.26722 16.33852 16.06820
4561 1.237650e+18 141.576582 0.472581 18.84490 17.35174 16.60627
6600 1.237650e+18 190.433912 -1.808007 16.42360 14.95384 14.40461
2558 1.237650e+18 171.912222 -1.428248 18.11480 16.31868 15.50009
7642 1.237650e+18 201.803287 67.509934 17.81444 16.22030 15.59226
8912 1.237650e+18 118.838142 43.802934 19.24615 17.50161 16.69884
3319 1.237650e+18 225.249333 -0.743244 19.47440 18.27215 18.26853
6852 1.237650e+18 195.638943 -3.338934 18.66313 17.81916 17.34327
1366 1.237650e+18 195.714613 0.332823 19.49343 19.06980 18.38652
3123 1.237650e+18 176.311159 64.590017 19.30013 17.51643 16.51510
262 1.237650e+18 206.374167 -0.781673 18.28101 17.18856 16.70175
4951 1.237650e+18 180.880583 -0.892455 15.96642 14.78763 14.26892
1314 1.237650e+18 216.159999 -0.731850 19.39956 17.76706 16.83544
7132 1.237650e+18 122.608079 48.440013 18.59266 17.60255 17.19375
9038 1.237650e+18 135.784022 54.959513 19.24032 18.38765 18.14323
5253 1.237650e+18 128.715020 54.054004 17.96559 16.19799 15.50962
596 1.237650e+18 145.292821 0.376326 19.31302 18.13527 17.67289
7811 1.237650e+18 200.537006 66.181104 17.01481 15.40643 14.87614
187 1.237650e+18 221.277738 0.494732 19.25605 18.11174 17.34868
i z run rerun camcol field specobjid redshift \
1603 16.91681 16.85408 1035 301 2 133 2.137030e+18 -0.000013
8713 15.94947 15.91341 752 301 6 286 4.331520e+18 0.000286
4561 16.12300 15.83033 1239 301 5 142 5.348820e+17 0.070166
6600 14.21981 14.15317 1140 301 5 251 3.261830e+18 -0.000031
2558 15.16404 14.95651 1231 301 6 18 3.640080e+18 0.000147
7642 15.33663 15.21829 1350 301 6 412 2.752940e+18 -0.000170
8912 16.31026 15.98532 1350 301 3 93 4.909270e+17 0.046553
3319 18.30815 18.37781 752 301 2 546 3.731250e+18 -0.000357
6852 17.04178 16.85804 1140 301 1 285 3.817360e+17 0.047103
1366 18.19380 17.66139 745 301 4 247 3.300620e+17 0.330422
3123 16.00140 15.59774 1302 301 3 352 6.733630e+17 0.065156
262 16.33173 16.14626 752 301 2 420 3.378520e+17 0.088136
4951 14.05750 13.94386 756 301 1 435 3.256160e+18 0.000230
1314 16.39579 16.07255 752 301 2 485 3.434530e+17 0.102965
7132 17.04646 16.96286 1350 301 4 129 8.238230e+18 -0.000478
9038 18.04648 18.05438 1345 301 4 190 6.449340e+18 0.000415
5253 15.24411 15.10445 1350 301 5 174 2.607700e+18 -0.000104
596 17.49996 17.44187 756 301 4 197 2.995850e+17 0.000213
7811 14.71505 14.69119 1412 301 2 210 2.752900e+18 -0.000038
187 16.93481 16.68558 752 301 5 519 3.469060e+17 0.109808
plate mjd fiberid
1603 1898 53260 254
8713 3847 55588 667
4561 475 51965 289
6600 2897 54585 373
2558 3233 54891 172
7642 2445 54573 415
8912 436 51883 127
3319 3314 54970 53
6852 339 51692 204
1366 293 51689 632
3123 598 52316 271
262 300 51943 299
4951 2892 54552 214
1314 305 51613 196
7132 7317 56992 81
9038 5728 56334 685
5253 2316 53757 423
596 266 51630 346
7811 2445 54573 275
187 308 51662 469
class
5511 GALAXY
7360 STAR
8286 GALAXY
3450 GALAXY
4435 GALAXY
... ...
456 GALAXY
6017 GALAXY
709 GALAXY
8366 GALAXY
1146 STAR
[9980 rows x 1 columns]
class
1603 STAR
8713 STAR
4561 GALAXY
6600 STAR
2558 STAR
7642 STAR
8912 GALAXY
3319 STAR
6852 GALAXY
1366 QSO
3123 GALAXY
262 GALAXY
4951 STAR
1314 GALAXY
7132 STAR
9038 STAR
5253 STAR
596 STAR
7811 STAR
187 GALAXY
## model optimation
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier# Model initialization
rf_classifier = RandomForestClassifier()
# List of parameters to optimize
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10]
}
# Initialize Grid Search with grid model and parameters
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5)
# Train Grid Search to find the best combination of parameters
grid_search.fit(X_train, y_train)
# Gets the best parameters found
best_params = grid_search.best_params_
best params
{'max_depth': None, 'min_samples_split': 5, 'n_estimators': 50}
The best parameters obtained through hyperparameter tuning are max_depth of 20, min_samples_split of 2, and n_estimators of 200. A max_depth of 20 allows the trees in the random forest to grow to a considerable depth, facilitating the capture of complex patterns in the data, although there is a risk of overfitting if the depth is set too high. With min_samples_split set to 2, the model splits nodes only if they contain at least 2 samples, indicating a focus on detail in node splits and potentially capturing fine-grained patterns in the data. The use of 200 trees in the ensemble model n_estimators provides diversity in predictions, contributing to robust predictions, albeit at the expense of increased computational complexity and training time. These parameter settings aim to balance model complexity and predictive performance, ensuring that the random forest model can effectively generalize to unseen data while capturing intricate patterns in the dataset.
#Implementation of Cross-Validation
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier# cross_val_score takes care of splitting X and y into the 10 folds that's why we pass X and y entirely instead of X_train and y_train
scores = cross_val_score(knn, X, Y, cv=10, scoring='accuracy')
# Displays the best validation score
print("Cross validation score:", scores.mean())
Cross validation score: 0.7225999999999999
K-fold cross-validation is a crucial step in classifier model development as it provides a more accurate estimation of how well the model will perform on unseen data. In classification contexts, k-fold cross-validation prevents overfitting by testing the model on different subsets of data each time, ensuring that the model performs well not only on the training data but also on new testing data. Additionally, k-fold cross-validation reduces estimation performance variability by averaging multiple tests using different subsets, resulting in more stable evaluation outcomes. By dividing the data into subsets, k-fold cross-validation also allows for optimal use of data for training and validation, enhancing the accuracy of model performance estimation. Overall, k-fold cross-validation aids in selecting the optimal model and tuning parameters more effectively, thereby producing a more reliable classification model with better generalization on new data.
#Prediction and Interpretation
Before importing data, what must be prepared is new data with 17 features that match the data format, namely CSV, then input it into Google Colab.
#New data
import pandas as pd# Reading cvs files
DataSloanNew = pd.read_csv("/content/Data1b.csv")
# Displays the first few lines
print(DataSloanNew .head())
The above syntax is used to read a CSV file named “DataNomor1b.csv” using the pandas library in Python. The CSV file is read using the pd.read_csv()
function, which converts the data from the CSV file into a DataFrame stored in the variable ‘DataSloanNew’. Afterwards, the first few rows of the DataFrame are printed using the .head()
method to provide an initial overview of the structure and content of the read data.
objid ra dec u g r i \
0 1.530000e+18 123.394529 0.032589 12.74638 11.73648 17.82639 19.16254 z run rerun camcol field specobjid redshift plate mjd \
0 10.82628 356 206 6 172 2.170000e+19 -0.000007 2836 35682
fiberid
0 357
# log linear regression prediction
prediksi_logler=logreg.predict(DataSloanNew)
print(prediksi_logler)
#knn neightboor predictions
predictions = knn.predict(DataSloanNew)
print(predictions)
The provided syntax comprises two sections for making predictions using different machine learning models on the new data represented by DataSloanNew
. In the first section, logistic regression prediction is performed using the predict()
method applied to the logistic regression model stored in the variable logreg
. The resulting predictions are stored in the variable prediksi_logler
, and then printed to the console. In the second section, K-Nearest Neighbors (KNN) predictions are conducted using the predict()
method applied to the KNN model stored in the variable knn
. The predictions are stored in the variable predictions
and subsequently printed to the console. These steps allow for the assessment of the models’ performance in predicting outcomes on the new dataset.
['GALAXY']
['STAR']
The outputs [‘GALAXY’] and [‘STAR’] are the class prediction results for the new data provided. Each value in square brackets ([]) indicates a class prediction for a single piece of data. In this case, the model predicts that the first data (index 0) has the class ‘GALAXY’, while the second data (index 1) has the class ‘STAR’.