Mastering the Foundations: Week 0 of Kifya AIM and 10 Academy’s Artificial Intelligence Program | by Bisrat Kebere | Aug, 2024


MoonLight Energy Solutions tasked us with analyzing solar radiation data to identify high-potential regions for solar installation. The objective was to deliver actionable insights that would align with the company’s long-term sustainability goals. This week’s challenge was a comprehensive introduction to the disciplines of Data Engineering, Financial Analytics, and Machine Learning Engineering, with a specific emphasis on resilience, proactive learning, and teamwork.

The dataset provided contained a variety of environmental measurements, including solar radiation, air temperature, relative humidity, barometric pressure, precipitation, wind speed, and more. The first task was to clean the data and perform exploratory data analysis (EDA).

Data Cleaning

To ensure the data was ready for analysis, I created a function to handle missing values, replace invalid entries, and remove duplicates. This step was crucial for maintaining the integrity of the dataset and avoiding skewed results.

import pandas as pd
import numpy as np

def clean_data(data):
# Replace invalid values with NaN
data.replace([np.inf, -np.inf], np.nan, inplace=True)

# Coerce invalid timestamps to NaN
data['Timestamp'] = pd.to_datetime(data['Timestamp'], errors='coerce')

# Drop rows with NaN values in 'Timestamp'
data.dropna(subset=['Timestamp'], inplace=True)

# Impute missing values in numeric columns with the mean
numeric_columns = data.select_dtypes(include=[np.number]).columns
data[numeric_columns] = data[numeric_columns].fillna(data[numeric_columns].mean())
return data

Exploratory Data Analysis (EDA)

The EDA was essential for uncovering patterns, trends, and potential anomalies in the data. Here’s a breakdown of the analyses performed:

1. Summary Statistics

I calculated the mean, median, standard deviation, and other key statistics for each numeric column. This provided a solid understanding of the data’s distribution and helped identify outliers and unusual entries.

# Calculate summary statistics
summary_stats = data.describe()
print("Summary Statistics:")
print(summary_stats)

2. Time Series Analysis

By plotting time-series data for Global Horizontal Irradiance (GHI), Direct Normal Irradiance (DNI), Diffuse Horizontal Irradiance (DHI), and ambient temperature (Tamb), I was able to observe daily and monthly patterns. This analysis highlighted fluctuations in solar irradiance and temperature, which are critical for identifying optimal solar installation periods.

import matplotlib.pyplot as plt
import seaborn as sns

# Set 'Timestamp' as the index for time-series analysis
data.set_index('Timestamp', inplace=True)
# Plot time-series trends
plt.figure(figsize=(12, 6))
sns.lineplot(data=data[['GHI', 'DNI', 'DHI', 'Tamb']])
plt.title("Time-series trends")
plt.xlabel("Time")
plt.ylabel("Value")
plt.show()

3. Correlation Analysis

Using heatmaps, I visualized the correlations between solar radiation components and temperature measures. This analysis revealed significant relationships that could impact solar energy efficiency, such as the correlation between wind conditions and solar irradiance.

# Calculate correlation matrix
correlation_matrix = data.corr()

# Plot heatmap of correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix")
plt.show()

4. Wind and Temperature Analysis

I conducted a detailed analysis of wind speed, direction, and their variability using polar plots. I also examined how relative humidity might influence temperature readings and solar radiation. These insights are valuable for understanding how environmental factors affect solar panel performance.

# Plot wind speed and direction using a scatter plot
plt.figure(figsize=(12, 6))
sns.scatterplot(data=data, x='WD', y='WS', hue='WSgust', palette='viridis')
plt.title("Wind Speed vs. Direction")
plt.xlabel("Wind Direction (°)")
plt.ylabel("Wind Speed (m/s)")
plt.show()

Version control was an integral part of this week’s challenge. I set up a GitHub repository to manage the project, ensuring that all code and analyses were properly versioned and documented.

# Initialize a new Git repository
git init

# Create a new branch for the task
git checkout -b task-1

# Add and commit changes
git add .
git commit -m "Initial data cleaning and EDA"

# Push changes to the remote repository
git push origin task-1

As a bonus task, I developed an interactive dashboard using Streamlit. This dashboard allows users to visualize the data insights dynamically, making the analysis accessible and easy to interpret for non-technical stakeholders.

import streamlit as st

st.title("Solar Radiation Data Dashboard")
# Sidebar for variable selection
variables = st.multiselect("Select variables to plot:", data.columns)

# Time-series plot
if variables:
st.line_chart(data[variables])

# Show correlation matrix
if st.checkbox("Show Correlation Matrix"):
st.write(sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f"))

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here