Game deals analytics is a tool that allows you to download information about video game deals from different online stores.
After downloading the information, it will be transformed according to the business rules and loaded into a Redshift database to be subsequently analyzed.
The script should extract data in JSON and be able to read the format in a Python dictionary. The delivery involves the creation of an initial version of the table where the data will be loaded later.
Create a pyspark job that allows you to transform the data and load it into a table in Redshift.
Automate the extraction and transformation of data using Airflow.
The API selected for the extraction of information is "cheapshark.com", all the documentation is available here: https://apidocs.cheapshark.com/#b9b738bf-2916-2a13-e40d-d05bccdce2ba
The files to be deployed are in the project folder, the structure of the folders and files is as follows:
docker_images/
: Contains the Dockerfiles for the Airflow and Spark images.docker-compose.yml
: Docker compose file for the Airflow and Spark containers..env
: Environment variables file for the Airflow and Spark containers.dags/
: Cointains the DAGs for Airflow.etl_game_deals.py
: Pricipal DAG for the ETL process how is executed in Airflow for download, transform and load the data from the API to the Redshift database.
logs/
: Folder with the logs of Airflow.postgres_data/
: Folder with the data of Postgres.scripts/
: Folder with the scripts for the ETL process.postgresql-42.5.2.jar
: Jar file for the JDBC connection to Redshift.common.py
: Common class for the ETL processes.utils.py
: Utility functions for the ETL process.ETL_Game_Deals.py
: Script for the ETL process.
- Clone the repository.
- Move to the project folder.
- Create a .env file with the next environment variables:
REDSHIFT_HOST=...
REDSHIFT_PORT=5439
REDSHIFT_DB=...
REDSHIFT_USER=...
REDSHIFT_SCHEMA=...
REDSHIFT_PASSWORD=...
REDSHIFT_URL="jdbc:postgresql://${REDSHIFT_HOST}:${REDSHIFT_PORT}/${REDSHIFT_DB}?user=${REDSHIFT_USER}&password=${REDSHIFT_PASSWORD}"
DRIVER_PATH=/tmp/drivers/postgresql-42.5.2.jar
- Run the next command line statement for build the images and run the containers.
docker-compose up -d
- Once the containers are running, enter to the Airflow web interface in
http://localhost:8080/
. - On tab
Admin -> Connections
create a new connection with the following data for Redshift:- Conn Id:
redshift_default
- Conn Type:
Amazon Redshift
- Host:
host de redshift
- Database:
base de datos de redshift
- Schema:
esquema de redshift
- User:
usuario de redshift
- Password:
contraseña de redshift
- Port:
5439
- Conn Id:
- On tab
Admin -> Connections
create a new connection with the following data for Spark:- Conn Id:
spark_default
- Conn Type:
Spark
- Host:
spark://spark
- Port:
7077
- Extra:
{"queue": "default"}
- Conn Id:
- On tab
Admin -> Variables
create a new variable with the following data:- Key:
driver_class_path
- Value:
/tmp/drivers/postgresql-42.5.2.jar
- Key:
- On tab
Admin -> Variables
create a new variable with the following data:- Key:
spark_scripts_dir
- Value:
/opt/airflow/scripts
- Key:
- Execute DAG
etl_game_deals
.
- Clone the repository.
- Move to the project folder.
- Move to the
docker_images
folder. - Run the next command line statement for build the images and run the containers.
docker build -t airflow:2.1.2 .
docker build -t spark:3.1.2 .
- Change the
docker-compose.yml
file with the name of the new images. - Run the next command line statement for run the containers.
docker-compose up -d
In the database_scritps folder there are the scripts for creating the tables in redshift in which the information will later be dumped using the transformation scripts.
The database diagram will be as follows:
- Alvaro Garcia - alvarongg