Atmospheric CO₂ Concentration Forecasting

Project Overview

A rigorous academic research project forecasting long-term global atmospheric CO₂ concentrations. The study uniquely integrates the Southern Oscillation Index (SOI) — a measure of El Niño and La Niña climate patterns — as an exogenous variable to capture complex macroclimatic interactions.

This project's contribution is not just the forecast accuracy, but the systematic benchmarking insights that reveal when deep learning truly outperforms classical statistics — and when it doesn't.

Research Questions

Can integrating SOI (El Niño/La Niña data) improve CO₂ forecasting accuracy?
Do hybrid SARIMA-LSTM architectures consistently outperform pure models?
When do traditional statistical methods outperform deep learning — and why?

Data Engineering

Data Sources

Dataset	Source	Frequency
Global CO₂ Concentration	NOAA Global Monitoring Lab	Monthly
Southern Oscillation Index (SOI)	NOAA Climate.gov	Monthly

Exploratory Data Analysis

Augmented Dickey-Fuller (ADF) Test — verified sequence stationarity conditions
ACF / PACF Analysis — determined optimal SARIMA parameters (p, d, q)(P, D, Q, s)
Decomposition into trend, seasonality, and residual components

Model Architectures

Statistical Baselines

SARIMA(p,d,q)(P,D,Q,s) — univariate
SARIMAX(p,d,q)(P,D,Q,s) + SOI exogenous — multivariate
→ Parameters optimized via Grid Search on AIC criterion

Deep Learning

Univariate LSTM — CO₂ sequence only
Multivariate LSTM — CO₂ + SOI features
→ Hyperparameters auto-tuned via Optuna (100 trials)

Hybrid Pipelines

SARIMA residuals → LSTM
SARIMAX residuals → LSTM
→ Tests whether DL can extract nonlinear patterns from SARIMA residuals

Results

Model	RMSE	Notes
SARIMA	0.9927	Baseline
SARIMAX + SOI	0.8843	SOI helps
Univariate LSTM	0.7201	DL competitive
Multivariate LSTM	0.5513	Best — +44% vs SARIMA
Hybrid SARIMA-LSTM	0.9134	Degraded vs pure LSTM

Key Analytical Insight

The hybrid model degraded performance — and we analyzed exactly why.

After SARIMA successfully extracted the linear trend and seasonality, the remaining residuals behaved statistically close to white noise (confirmed via Ljung-Box test). The LSTM found no meaningful patterns to learn from, adding only noise.

This demonstrates a nuanced understanding: hybrid models are only beneficial when SARIMA residuals contain genuine nonlinear structure. In this dataset, SARIMA's linear extraction was too effective for a hybrid to help.

This kind of interpretability analysis distinguishes strong engineers from ones who just run models.

Engineering Highlights

Optuna automated hyperparameter tuning over 100 trials (LSTM layers, units, dropout, learning rate, sequence length)
Rigorous walk-forward validation to prevent data leakage in time-series context
Clean modular codebase: each model architecture in its own Python module

Technology Stack

Deep Learning: TensorFlow, Keras
Statistical: Statsmodels (SARIMA/SARIMAX)
Optimization: Optuna (hyperparameter tuning)
ML: Scikit-learn
Data: Pandas, NumPy
Visualization: Matplotlib, Seaborn