Project Overview
A rigorous academic research project forecasting long-term global atmospheric CO₂ concentrations. The study uniquely integrates the Southern Oscillation Index (SOI) — a measure of El Niño and La Niña climate patterns — as an exogenous variable to capture complex macroclimatic interactions.
This project's contribution is not just the forecast accuracy, but the systematic benchmarking insights that reveal when deep learning truly outperforms classical statistics — and when it doesn't.
Research Questions
- Can integrating SOI (El Niño/La Niña data) improve CO₂ forecasting accuracy?
- Do hybrid SARIMA-LSTM architectures consistently outperform pure models?
- When do traditional statistical methods outperform deep learning — and why?
Data Engineering
Data Sources
| Dataset | Source | Frequency |
|---|---|---|
| Global CO₂ Concentration | NOAA Global Monitoring Lab | Monthly |
| Southern Oscillation Index (SOI) | NOAA Climate.gov | Monthly |
Exploratory Data Analysis
- Augmented Dickey-Fuller (ADF) Test — verified sequence stationarity conditions
- ACF / PACF Analysis — determined optimal SARIMA parameters (p, d, q)(P, D, Q, s)
- Decomposition into trend, seasonality, and residual components
Model Architectures
Statistical Baselines
SARIMA(p,d,q)(P,D,Q,s) — univariate
SARIMAX(p,d,q)(P,D,Q,s) + SOI exogenous — multivariate
→ Parameters optimized via Grid Search on AIC criterion
Deep Learning
Univariate LSTM — CO₂ sequence only
Multivariate LSTM — CO₂ + SOI features
→ Hyperparameters auto-tuned via Optuna (100 trials)
Hybrid Pipelines
SARIMA residuals → LSTM
SARIMAX residuals → LSTM
→ Tests whether DL can extract nonlinear patterns from SARIMA residuals
Results
| Model | RMSE | Notes |
|---|---|---|
| SARIMA | 0.9927 | Baseline |
| SARIMAX + SOI | 0.8843 | SOI helps |
| Univariate LSTM | 0.7201 | DL competitive |
| Multivariate LSTM | 0.5513 | Best — +44% vs SARIMA |
| Hybrid SARIMA-LSTM | 0.9134 | Degraded vs pure LSTM |
Key Analytical Insight
The hybrid model degraded performance — and we analyzed exactly why.
After SARIMA successfully extracted the linear trend and seasonality, the remaining residuals behaved statistically close to white noise (confirmed via Ljung-Box test). The LSTM found no meaningful patterns to learn from, adding only noise.
This demonstrates a nuanced understanding: hybrid models are only beneficial when SARIMA residuals contain genuine nonlinear structure. In this dataset, SARIMA's linear extraction was too effective for a hybrid to help.
This kind of interpretability analysis distinguishes strong engineers from ones who just run models.
Engineering Highlights
- Optuna automated hyperparameter tuning over 100 trials (LSTM layers, units, dropout, learning rate, sequence length)
- Rigorous walk-forward validation to prevent data leakage in time-series context
- Clean modular codebase: each model architecture in its own Python module
Technology Stack
- Deep Learning: TensorFlow, Keras
- Statistical: Statsmodels (SARIMA/SARIMAX)
- Optimization: Optuna (hyperparameter tuning)
- ML: Scikit-learn
- Data: Pandas, NumPy
- Visualization: Matplotlib, Seaborn