Back to Projects
Computer VisionDeep LearningPyTorchFastAPIReal-time Systems

Real-time Speech Intelligibility Monitoring System

End-to-end Computer Vision + Deep Learning system that predicts room acoustic quality in real-time using phone camera and custom neural network — eliminating expensive measurement hardware.

RoleSole Developer (Academic Research)
Period2026-012026-05
Statuscompleted
0.985
R² Score
0.0035
MAE
100%
Camera Error Corrected
94%
Data Dependency Reduced

Technology Stack

PythonPyTorchOpenCVXGBoostFastAPIStreamlitNumPyScikit-learn

Project Overview

Designed and built an end-to-end system that predicts room acoustic quality in real-time using Computer Vision and Deep Learning — eliminating the need for expensive physical measurement equipment.

A standard phone camera becomes the only required hardware input, detecting positions of a sound source and receivers on a physical scale model, then predicting the Speech Transmission Index (STI) in milliseconds.

Problem Statement

Traditional STI measurement requires specialized acoustic hardware costing thousands of dollars and cannot operate in real-time. This project makes acoustic quality monitoring accessible and instantaneous.

System Architecture

Phone Camera Input
    ↓
[ArUco Marker Detection + Color Segmentation] (OpenCV)
    ↓
[Grid Snapping Algorithm] (100% error correction)
    ↓
[XGBoost Surrogate Model] (5 inputs → 84 ray features)
    ↓
[TwoBranchRayNet] (Custom PyTorch architecture)
    ↓
STI Prediction → IEC 60268-16 Classification
    ↓
[FastAPI Backend + Streamlit Dashboard] (Live Visualization)

Key Technical Achievements

1. Computer Vision Pipeline

  • ArUco marker detection to locate sound source and receiver positions on a physical scale model
  • HSV color-based object segmentation for robust detection under varying light conditions
  • Perspective Transform to map camera coordinates to physical space
  • Custom Grid Snapping Algorithm — corrects 100% of camera hardware error (5–15 cm offset)

2. Custom Neural Architecture: TwoBranchRayNet

Designed a dual-branch architecture that processes two distinct sets of acoustic features in parallel before merging:

MetricValue
R² Score0.985
MAE0.0035
Inference Speed~0.9 FPS (CPU-only)

3. XGBoost Surrogate Model

Raw ray-tracing simulations require 84 complex features per prediction — expensive to compute in real-time.

Solution: Engineered an XGBoost surrogate model that synthesizes all 84 features from just 5 spatial inputs, achieving:

  • MAE = 0.20
  • 94% reduction in raw data dependency
  • Enables stable CPU-only real-time inference

4. Production Deployment

  • FastAPI REST backend for real-time inference endpoint
  • Streamlit dashboard: live camera feed + STI heatmap + IEC 60268-16 classification
  • Stable quasi-real-time inference (~0.9 FPS) on standard laptop CPU

IEC 60268-16 Classification

The system automatically classifies acoustic quality into international standard categories:

STI RangeQualityDescription
0.75 – 1.00ExcellentIdeal for lectures
0.60 – 0.75GoodSuitable for most rooms
0.45 – 0.60FairAcceptable
0.30 – 0.45PoorNeeds improvement
0.00 – 0.30BadUnintelligible

Technology Stack

  • Computer Vision: OpenCV (ArUco, Perspective Transform, HSV segmentation, Morphological operations)
  • Deep Learning: PyTorch (custom TwoBranchRayNet)
  • Machine Learning: XGBoost (Surrogate Model), Scikit-learn
  • Backend: FastAPI (REST API)
  • Frontend: Streamlit (monitoring dashboard)
  • Data: NumPy, Pandas, Matplotlib