This project is a simulated industry case study based on a real-world TikTok content moderation scenario, completed as part of the Google Advanced Data Analytics Professional Certificate.
The objective is to support TikTok’s content moderation pipeline by improving how user reports are prioritized. Since verified users are more likely to post opinions rather than claims, I built a logistic regression model to predict verified status using video-level features.
TikTok’s data team is developing a machine learning system to classify claims vs. opinions in videos. A reliable classification system enables TikTok to reduce moderation backlog, prioritize urgent reports, and improve platform safety.
As part of this workflow, I built a logistic regression model to analyze how video characteristics relate to user verification status, providing insights to inform the final classification model.
This project follows the PACE (Plan → Analyze → Construct → Execute) framework, commonly used in industry data science projects.
Dataset:
- 19,382 videos × 12 features
Tools:
- Python
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
Objective:
- Predict whether a video creator is verified using engagement metrics and metadata
- Removed 298 rows with missing values
- Verified absence of duplicates
- Visualized distributions using boxplots
- Applied IQR-based capping to control extreme values
- Only 6.3% of users were verified
- Applied upsampling to create a balanced 50/50 dataset, improving model learning
- Created
text_lengthfeature from video transcription - Checked multicollinearity using correlation heatmaps
- Removed
video_like_countdue to strong correlation (0.86) withvideo_view_count
- Engagement metrics
- Video duration
- Claim status
- Author ban status
- One-hot encoding for categorical variables
- Train-test split: 75% / 25%
- Logistic Regression (
max_iter=800)
| Metric | Value |
|---|---|
| Accuracy | 65% |
| Precision (Not Verified) | 61% |
| Recall (Not Verified) | 84% |
- High recall (84%) ensures most unverified users are correctly identified, which is critical for moderation prioritization.
- Precision is moderate but acceptable given the operational focus on minimizing missed detections.
Each additional second increases the log-odds of verification by 0.009, indicating that longer videos are more likely posted by verified users.
Views, shares, and comments show weak predictive power relative to video duration.
Verified status prediction provides valuable signal for downstream classification of claims vs. opinions, improving moderation efficiency.
This project demonstrates how logistic regression can support content moderation systems by improving understanding of creator behavior. The model’s insights help inform claim vs. opinion classification, enabling more efficient report prioritization and reduced moderation backlog.
- Exploratory Data Analysis (EDA)
- Feature Engineering
- Logistic Regression
- Handling Class Imbalance
- Model Evaluation
- Business-Oriented Data Science