As both a weather enthusiast and a data engineer, I’ve always been fascinated by the challenge of hyperlocal weather prediction. While national weather services excel at regional forecasting, predicting temperatures for specific locations remains complex due to microclimate effects. I decided to tackle this challenge by building a machine learning model that combines upper air observations with local weather station data to predict daily high temperatures for my location in Atco, NJ.
The model isn’t just an academic exercise — it’s running on a server, making daily predictions. Every morning at 9am, a Docker container on my server executes the prediction pipeline. The operational code is available in my repo where main.py
contains the python code that fetches 12Z data, transforms it and inputs into the model for the high temperature for the current day. The prediction is persisted to S3. It also gets the high temperature observed by my weather station for the previous day so that we can see how well that model performed.
The core of my solution leverages two key data sources:
- Upper air soundings from surrounding National Weather Service (NWS) stations. Each atmospheric level (e.g., 1000mb, 925mb) provides different insights into weather patterns. I use NWS soundings from key levels that capture boundary layer effects and synoptic-scale influences.
- Local weather station observations from my personal weather station. Local observations add precision to the model by capturing near-surface conditions unique to Atco, NJ.
The power of this approach lies in combining broad atmospheric data with hyperlocal conditions. Let’s dive into the technical details.
I use the Extra Trees (Extremely Randomized Trees) algorithm for generating the model. It is a powerful choice for meteorological data due to its ensemble nature, which averages multiple decision trees to improve robustness. This model is well-suited for capturing non-linear relationships and interactions between high-dimensional features — ideal for atmospheric data where layers and surface data interplay.
The model ingests data from multiple NWS stations surrounding Atco, including Newport NC, Greensboro NC, Buffalo NY, and others. Below is a glimpse at how I flatten sounding data across multiple stations and pressure levels. The rest of the code is in features.py
in this repo.
STATIONS = {
"72305": {"city": "Newport, NC", "station_name": "MHX"},
"72317": {"city": "Greensboro, NC", "station_name": "GSO"},
"72528": {"city": "Buffalo, NY", "station_name": "BUF"},
"72501": {"city": "Upton, NY", "station_name": "OKX"},
"72403": {"city": "Sterling, VA", "station_name": "IAD"}
# Additional stations omitted for brevity
}def consolidate_pressure_levels(df, station, date, sounding_hr):
pressure_levels = [1000, 850, 700, 500, 300, 200]
df['pressure'] = df['pressure'].astype(float)
indexes = []
for p in pressure_levels:
idx = df.iloc[(df['pressure']-p).abs().argsort()].index.values[0]
indexes.append(idx)
# Process each pressure level
vals = {}
for p, idx in zip(pressure_levels, indexes):
row = df.iloc[idx,:]
for field in FIELDS:
val = row[field]
col = f"{field}_{p}"
vals[col] = [float(val)]
The result of this code leads to over 500 features for a given date, for example rel_hum_1000_MHX
is relative humidity at 1000 mb for station MHX. And mix_ratio_850_GSO
is the mixing ratio at 850 mb for GSO. The below fields are ingested for each station across different pressure levels.
FIELDS = ["pressure","height","temp","dew_point","rel_humidity",
"mix_ratio","direction", "knots","theta","theta_e","theta_v"]
What makes this approach interesting is how it captures data at specific pressure levels (1000mb, 850mb, 700mb, etc.) from each station. These levels are crucial as they represent different layers of the atmosphere, each playing a unique role in temperature development.
The secret sauce of this model lies in how it combines upper air data with local 12Z observations from my weather station. Below is the function that is the key piece that brings in my weather station data using the Weather Underground API which you can gain access to if you register your station with them.
def get_observation_data(df):
key = os.getenv("WEATHER_API_KEY")
station = "KNJATCO14" # my weather station
url = "https://api.weather.com/v2/pws/history/hourly?stationId={station}&format=json&units=m&date={date}&apiKey={key}"
vals = {'forecast_date': [], 'temp_f_12z': [], 'dew_point_f_12z':[],
'humidity_12z':[], 'pressure_12z':[], 'pressure_trend_12z':[]
}
df2 = df.copy()
df2['forecast_date'] = pd.to_datetime(df2['forecast_date'])
for date in df2['forecast_date']:
time.sleep(2)
dt = date.strftime("%Y%m%d")
url_date = url.format(station=station, date=dt, key=key)
try:
resp = session.get(url_date)
obs = resp.json()['observations']
except Exception as e:
continue
if len(obs)==0 or obs is None:
continuedt_12 = datetime(date.year, date.month, date.day, 12,0,0)
min_sec = None
min_index = None
for i, o in enumerate(obs):
dt = datetime.strptime(o["obsTimeUtc"], "%Y-%m-%dT%H:%M:%SZ")
seconds_diff = abs((dt - dt_12).total_seconds())
if min_sec is None:
min_sec = seconds_diff
min_index = i
else:
if seconds_diff < min_sec:
min_sec = seconds_diff
min_index = i
print(obs[min_index]["obsTimeUtc"])
temp_c = int(obs[min_index]['metric']['tempHigh'])
temp_f = round(float(temp_c)*(9/5) + 32,1)
dew_point_c = int(obs[min_index]['metric']['dewptHigh'])
dew_point_f = round(float(dew_point_c)*(9/5) + 32,1)
vals['forecast_date'].append(date)
vals['temp_f_12z'].append(temp_f)
vals['dew_point_f_12z'].append(dew_point_f)
vals['pressure_12z'].append(obs[min_index]["metric"]["pressureMax"])
vals['humidity_12z'].append(obs[min_index]["humidityAvg"])
vals['pressure_trend_12z'].append(obs[min_index]["metric"]["pressureTrend"])
return pd.DataFrame(vals)
We see we are retrieving 12Z observations like temperature, dew point, pressure, pressure trend and humidity. These are added to the list of features. The full code can be found in labels.py
as part of my repo.
This dataset is then merged with the sounding data as shown below. The result is one record per forecast date. Combined there are over 500 features from the 12Z soundings (metric and pressure level combinations) and 12Z observations.
def merge_feature_data(df_12zsoundings, df_12zobs):
df_12zobs['forecast_date'] = pd.to_datetime(df_12zobs['forecast_date']).dt.date
df_12zsoundings['forecast_date'] = pd.to_datetime(df_12zsoundings['forecast_date']).dt.date
df = df_12zsoundings.merge(df_12zobs, on='forecast_date', how='inner')
return df
For prediction, I chose the ExtraTreesRegressor model, optimized using Optuna for hyperparameter tuning. Below is the optimization process. The full code can again be found in training.py
in my repo.
def objective(trial):
param = {
'n_estimators': trial.suggest_int('n_estimators', 50, 300),
'max_depth': trial.suggest_int('max_depth', 3, 20),
'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 20),
'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', None]),
'bootstrap': trial.suggest_categorical('bootstrap', [True, False])
}
reg = ExtraTreesRegressor(**param)
reg.fit(X_train_scaled, y_train)
y_pred = reg.predict(X_test_scaled)
return mean_squared_error(y_test, y_pred)
What makes this project particularly interesting is how it combines meteorological principles with machine learning:
Multi-level Atmospheric Analysis: By sampling data at different pressure levels, the model can capture various atmospheric phenomena that influence surface temperatures:
- 850mb level often indicates warm/cold air advection
- 700mb level can signal incoming weather systems
- 500mb level helps identify larger weather patterns
Morning Initialization: Using 12Z (morning) soundings and observations provides a crucial baseline for temperature prediction, as morning conditions strongly influence daily maximum temperatures.
Spatial Intelligence: By incorporating data from surrounding stations, the model develops an understanding of regional weather patterns and their local impacts.
The model’s performance has been consistently strong, demonstrating its capability for accurate local temperature prediction. Here is a glimpse of predictions vs observed.
The model generally predicts lower than the observed. What is interesting is that on 11/7/2024 my location here in NJ had several concurrent wildfires and massive amount of smoke covering a wide area. On that day the predicted high temperature ended up being higher than the observed high temperature. My assumption is that the broad smoke coverage reduced daytime heating causing the observed high temperature to be lower than the predicted. As stated this is outside the norm of how the model has been performing. Wildfires are rare events that are not directly predicted by weather observation data, as their occurrence depends on a complex interplay of environmental factors beyond standard meteorological measurements. So it makes sense that the model deviated from the previous pattern of usually being lower than the observed.
While the current implementation shows promising results, several enhancements could further improve its accuracy:
- Incorporate derived parameters like lifted index (LI), Convective Available Potential Energy (CAPE), solar radiation.
- Add temporal features to capture seasonal patterns and trends
- Implement ensemble methods combining multiple prediction approaches
This project combines high-resolution atmospheric soundings, local observations, and a robust machine learning pipeline to deliver daily high-temperature forecasts with precision. By leveraging Docker and automation, the model runs reliably each morning at 9 am, integrating fresh data to provide up-to-date predictions. This setup serves as a scalable framework for localized weather forecasting, demonstrating how data science can be harnessed to enhance our understanding of atmospheric dynamics and improve predictive accuracy.
Again, the code is available in my repo, and I welcome contributions and suggestions for improvements. Whether you’re a meteorologist interested in machine learning or a data scientist curious about weather prediction, I hope this project provides valuable insights into both domains.