Project Overview
Designed and built an end-to-end system that predicts room acoustic quality in real-time using Computer Vision and Deep Learning — eliminating the need for expensive physical measurement equipment.
A standard phone camera becomes the only required hardware input, detecting positions of a sound source and receivers on a physical scale model, then predicting the Speech Transmission Index (STI) in milliseconds.
Problem Statement
Traditional STI measurement requires specialized acoustic hardware costing thousands of dollars and cannot operate in real-time. This project makes acoustic quality monitoring accessible and instantaneous.
System Architecture
Phone Camera Input
↓
[ArUco Marker Detection + Color Segmentation] (OpenCV)
↓
[Grid Snapping Algorithm] (100% error correction)
↓
[XGBoost Surrogate Model] (5 inputs → 84 ray features)
↓
[TwoBranchRayNet] (Custom PyTorch architecture)
↓
STI Prediction → IEC 60268-16 Classification
↓
[FastAPI Backend + Streamlit Dashboard] (Live Visualization)
Key Technical Achievements
1. Computer Vision Pipeline
- ArUco marker detection to locate sound source and receiver positions on a physical scale model
- HSV color-based object segmentation for robust detection under varying light conditions
- Perspective Transform to map camera coordinates to physical space
- Custom Grid Snapping Algorithm — corrects 100% of camera hardware error (5–15 cm offset)
2. Custom Neural Architecture: TwoBranchRayNet
Designed a dual-branch architecture that processes two distinct sets of acoustic features in parallel before merging:
| Metric | Value |
|---|---|
| R² Score | 0.985 |
| MAE | 0.0035 |
| Inference Speed | ~0.9 FPS (CPU-only) |
3. XGBoost Surrogate Model
Raw ray-tracing simulations require 84 complex features per prediction — expensive to compute in real-time.
Solution: Engineered an XGBoost surrogate model that synthesizes all 84 features from just 5 spatial inputs, achieving:
- MAE = 0.20
- 94% reduction in raw data dependency
- Enables stable CPU-only real-time inference
4. Production Deployment
- FastAPI REST backend for real-time inference endpoint
- Streamlit dashboard: live camera feed + STI heatmap + IEC 60268-16 classification
- Stable quasi-real-time inference (~0.9 FPS) on standard laptop CPU
IEC 60268-16 Classification
The system automatically classifies acoustic quality into international standard categories:
| STI Range | Quality | Description |
|---|---|---|
| 0.75 – 1.00 | Excellent | Ideal for lectures |
| 0.60 – 0.75 | Good | Suitable for most rooms |
| 0.45 – 0.60 | Fair | Acceptable |
| 0.30 – 0.45 | Poor | Needs improvement |
| 0.00 – 0.30 | Bad | Unintelligible |
Technology Stack
- Computer Vision: OpenCV (ArUco, Perspective Transform, HSV segmentation, Morphological operations)
- Deep Learning: PyTorch (custom TwoBranchRayNet)
- Machine Learning: XGBoost (Surrogate Model), Scikit-learn
- Backend: FastAPI (REST API)
- Frontend: Streamlit (monitoring dashboard)
- Data: NumPy, Pandas, Matplotlib