Back to Projects
Time-SeriesLSTMSARIMAClimate AIOptuna

Atmospheric CO₂ Concentration Forecasting

Comprehensive time-series study comparing SARIMA, SARIMAX, LSTM, and hybrid architectures for global CO₂ prediction — integrating El Niño/La Niña patterns as exogenous variables.

RoleTeam Leader (4 Members)
Period2025-092025-11
Statuscompleted
0.5513
Best RMSE
+44%
vs. SARIMA Baseline
5
Models Benchmarked
4
Team Size

Technology Stack

PythonTensorFlowKerasStatsmodelsOptunaScikit-learnPandasSeaborn

Project Overview

A rigorous academic research project forecasting long-term global atmospheric CO₂ concentrations. The study uniquely integrates the Southern Oscillation Index (SOI) — a measure of El Niño and La Niña climate patterns — as an exogenous variable to capture complex macroclimatic interactions.

This project's contribution is not just the forecast accuracy, but the systematic benchmarking insights that reveal when deep learning truly outperforms classical statistics — and when it doesn't.

Research Questions

  1. Can integrating SOI (El Niño/La Niña data) improve CO₂ forecasting accuracy?
  2. Do hybrid SARIMA-LSTM architectures consistently outperform pure models?
  3. When do traditional statistical methods outperform deep learning — and why?

Data Engineering

Data Sources

DatasetSourceFrequency
Global CO₂ ConcentrationNOAA Global Monitoring LabMonthly
Southern Oscillation Index (SOI)NOAA Climate.govMonthly

Exploratory Data Analysis

  • Augmented Dickey-Fuller (ADF) Test — verified sequence stationarity conditions
  • ACF / PACF Analysis — determined optimal SARIMA parameters (p, d, q)(P, D, Q, s)
  • Decomposition into trend, seasonality, and residual components

Model Architectures

Statistical Baselines

SARIMA(p,d,q)(P,D,Q,s) — univariate
SARIMAX(p,d,q)(P,D,Q,s) + SOI exogenous — multivariate
→ Parameters optimized via Grid Search on AIC criterion

Deep Learning

Univariate LSTM — CO₂ sequence only
Multivariate LSTM — CO₂ + SOI features
→ Hyperparameters auto-tuned via Optuna (100 trials)

Hybrid Pipelines

SARIMA residuals → LSTM
SARIMAX residuals → LSTM
→ Tests whether DL can extract nonlinear patterns from SARIMA residuals

Results

ModelRMSENotes
SARIMA0.9927Baseline
SARIMAX + SOI0.8843SOI helps
Univariate LSTM0.7201DL competitive
Multivariate LSTM0.5513Best — +44% vs SARIMA
Hybrid SARIMA-LSTM0.9134Degraded vs pure LSTM

Key Analytical Insight

The hybrid model degraded performance — and we analyzed exactly why.

After SARIMA successfully extracted the linear trend and seasonality, the remaining residuals behaved statistically close to white noise (confirmed via Ljung-Box test). The LSTM found no meaningful patterns to learn from, adding only noise.

This demonstrates a nuanced understanding: hybrid models are only beneficial when SARIMA residuals contain genuine nonlinear structure. In this dataset, SARIMA's linear extraction was too effective for a hybrid to help.

This kind of interpretability analysis distinguishes strong engineers from ones who just run models.

Engineering Highlights

  • Optuna automated hyperparameter tuning over 100 trials (LSTM layers, units, dropout, learning rate, sequence length)
  • Rigorous walk-forward validation to prevent data leakage in time-series context
  • Clean modular codebase: each model architecture in its own Python module

Technology Stack

  • Deep Learning: TensorFlow, Keras
  • Statistical: Statsmodels (SARIMA/SARIMAX)
  • Optimization: Optuna (hyperparameter tuning)
  • ML: Scikit-learn
  • Data: Pandas, NumPy
  • Visualization: Matplotlib, Seaborn