This project involves scraping YouTube video comments using the YouTube Data API in Python, saving the data as a CSV file, and uploading it to an S3 bucket. The entire process is automated using an Apache Airflow DAG, which runs daily on an EC2 instance.
This project aims to automate the extraction, transformation, and loading (ETL) of YouTube comments data. The data is scraped using the YouTube Data API, saved as a CSV file, and uploaded to an S3 bucket. An Apache Airflow DAG orchestrates the entire workflow, ensuring that the process runs daily without manual intervention. The Airflow instance is hosted on an Amazon EC2 instance.
Python 3.5+
Apache Airflow
AWS Account with S3 bucket
YouTube Data API key
Amazon EC2 instance
-
Obtained "Developer Key" to access Youtube API. Google have provided detailed documentation on API.
-
Set up free AWS free tier account and launched an EC2 instance with an appropriate instance type (e.g., t2.micro) and Ubuntu/Debian as the OS.
-
Connected to EC2 instance and installed all dependancies.
Commands to install updates:
i. sudo yum update
ii. sudo yum install python3-pip
iii. sudo pip install apache-airflow
iv. sudo pip install pandas
v. sudo pip install s3fs -
Developed code to fetch comments from youtube video - "Google I/O '24 in under 10 minutes" and used pandas to save the
data into csv file and finally into s3 bucket. -
Created a DAG that runs daily and copied into Airflow directory.
-
Accessed airflow web interface through 8080 port and triggered the DAG.
The automated data pipeline successfully scraped YouTube comments, processed them, and uploaded to an S3 bucket.