Project Overview
Universities receive tens of thousands of student reviews per semester. Manual analysis is impractical at scale. This project delivers an End-to-End NLP pipeline that automatically classifies Vietnamese student feedback into three sentiment polarities: Positive, Neutral, and Negative.
Why This Problem Is Hard
Vietnamese NLP presents unique challenges:
- Compound words — "giảng_viên" (lecturer) must be treated as one token, not two
- Tone-dependent semantics — Vietnamese uses diacritical marks that fundamentally change word meaning
- School-specific vocabulary — domain-specific terminology rarely covered in general models
Solution: Fine-tuning PhoBERT
PhoBERT is a state-of-the-art BERT-based model pre-trained specifically on a large Vietnamese corpus by VinAI Research. Fine-tuning it on domain-specific student feedback data yields exceptional understanding of educational Vietnamese.
Methodology
1. Data Preprocessing
from pyvi import ViTokenizer
def preprocess_text(text: str) -> str:
# Word segmentation — critical for Vietnamese
# e.g., "giảng viên" → "giảng_viên"
return ViTokenizer.tokenize(text)
- Applied
ViTokenizerfor Vietnamese word segmentation - Handled massive dataset of student feedback with label mapping
- Dynamic padding + truncation strategy (max length: 128 tokens)
2. Model Architecture
PhoBERT-base (vinai/phobert-base)
↓
[CLS] Token Representation
↓
Classification Head (Linear: 768 → 3)
↓
Softmax → {Positive, Neutral, Negative}
- Base:
vinai/phobert-baseviaAutoModelForSequenceClassification - Training: AdamW optimizer with learning rate warmup
- Hardware: NVIDIA L4 GPU (Google Colab)
3. Evaluation Results
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| Positive | 0.94 | 0.96 | 0.95 |
| Neutral | 0.91 | 0.89 | 0.90 |
| Negative | 0.97 | 0.95 | 0.96 |
| Weighted Avg | — | — | 0.94 |
Overall Accuracy: 94.6% on 3,100+ unseen test samples.
Key Design Decision: Prioritizing Negative Detection
The F1-Score of 0.96 for Negative feedback is the most critical metric: missing a genuine complaint costs universities actionable insight. This was achieved by carefully balancing the dataset and tuning the classification threshold.
Production Readiness
The inference pipeline is modular and API-ready:
def predict_sentiment(text: str) -> dict:
# 1. Vietnamese word segmentation
segmented = ViTokenizer.tokenize(text)
# 2. Tokenize for PhoBERT
inputs = tokenizer(segmented, return_tensors="pt",
max_length=128, truncation=True, padding=True)
# 3. Inference
with torch.no_grad():
logits = model(**inputs).logits
# 4. Return human-readable result
label = ["Negative", "Neutral", "Positive"][logits.argmax()]
confidence = torch.softmax(logits, dim=1).max().item()
return {"label": label, "confidence": f"{confidence:.1%}"}
Wrappable directly into a FastAPI endpoint for a university analytics dashboard.
Technology Stack
- Deep Learning: PyTorch, HuggingFace
transformers,datasets - Pre-trained Model:
vinai/phobert-base - Vietnamese NLP:
pyvi(ViTokenizer) - Evaluation: Scikit-learn (classification report, confusion matrix)
- Data: Pandas, NumPy
- Hardware: NVIDIA L4 GPU (Google Colab)