Real-time Speech Intelligibility Monitoring System

Project Overview

Designed and built an end-to-end system that predicts room acoustic quality in real-time using Computer Vision and Deep Learning — eliminating the need for expensive physical measurement equipment.

A standard phone camera becomes the only required hardware input, detecting positions of a sound source and receivers on a physical scale model, then predicting the Speech Transmission Index (STI) in milliseconds.

Problem Statement

Traditional STI measurement requires specialized acoustic hardware costing thousands of dollars and cannot operate in real-time. This project makes acoustic quality monitoring accessible and instantaneous.

System Architecture

Phone Camera Input
    ↓
[ArUco Marker Detection + Color Segmentation] (OpenCV)
    ↓
[Grid Snapping Algorithm] (100% error correction)
    ↓
[XGBoost Surrogate Model] (5 inputs → 84 ray features)
    ↓
[TwoBranchRayNet] (Custom PyTorch architecture)
    ↓
STI Prediction → IEC 60268-16 Classification
    ↓
[FastAPI Backend + Streamlit Dashboard] (Live Visualization)

Key Technical Achievements

1. Computer Vision Pipeline

ArUco marker detection to locate sound source and receiver positions on a physical scale model
HSV color-based object segmentation for robust detection under varying light conditions
Perspective Transform to map camera coordinates to physical space
Custom Grid Snapping Algorithm — corrects 100% of camera hardware error (5–15 cm offset)

2. Custom Neural Architecture: TwoBranchRayNet

Designed a dual-branch architecture that processes two distinct sets of acoustic features in parallel before merging:

Metric	Value
R² Score	0.985
MAE	0.0035
Inference Speed	~0.9 FPS (CPU-only)

3. XGBoost Surrogate Model

Raw ray-tracing simulations require 84 complex features per prediction — expensive to compute in real-time.

Solution: Engineered an XGBoost surrogate model that synthesizes all 84 features from just 5 spatial inputs, achieving:

MAE = 0.20
94% reduction in raw data dependency
Enables stable CPU-only real-time inference

4. Production Deployment

FastAPI REST backend for real-time inference endpoint
Streamlit dashboard: live camera feed + STI heatmap + IEC 60268-16 classification
Stable quasi-real-time inference (~0.9 FPS) on standard laptop CPU

IEC 60268-16 Classification

The system automatically classifies acoustic quality into international standard categories:

STI Range	Quality	Description
0.75 – 1.00	Excellent	Ideal for lectures
0.60 – 0.75	Good	Suitable for most rooms
0.45 – 0.60	Fair	Acceptable
0.30 – 0.45	Poor	Needs improvement
0.00 – 0.30	Bad	Unintelligible

Technology Stack

Computer Vision: OpenCV (ArUco, Perspective Transform, HSV segmentation, Morphological operations)
Deep Learning: PyTorch (custom TwoBranchRayNet)
Machine Learning: XGBoost (Surrogate Model), Scikit-learn
Backend: FastAPI (REST API)
Frontend: Streamlit (monitoring dashboard)
Data: NumPy, Pandas, Matplotlib