Skip to content

Commit

Permalink
Merge pull request #32 from kevinsunny1996/task/fix_duplicates_and_fi…
Browse files Browse the repository at this point in the history
…x_ordering

add bridge surrogate keys to fix duplicates
  • Loading branch information
kevinsunny1996 authored Jul 22, 2024
2 parents fce3f9b + 21eff28 commit 5bd19f9
Show file tree
Hide file tree
Showing 13 changed files with 70 additions and 21 deletions.
23 changes: 19 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Overview
========

The following data pipeline is part of an ongoing Data Analysis project which involves analyzing Gaming Industry data for the last 30 years from `1990-2023`.
The following data pipeline is part of an ongoing Data Analysis project which involves analyzing Gaming Industry data for games rated under Metacritic.
In order to showcase data extraction capabilities , data from readymade sources like Kaggle etc are discouraged and dataset is being built from scratch using [RAWG API](https://rawg.io/apidocs).

Why RAWG API?
Expand All @@ -10,11 +10,11 @@ Why RAWG API?
- Uses API key , so its easier for people dipping their feet in the world of data to implement extraction logic.
- Has separate links to fetch reddit posts related to a game as well which can be further used for sentiment analysis.

Using the setup that will be talked about below, we would be fetching close to `250` pages worth of gaming data which is currently present at the time of writing in RAWG API side!!!
Using the setup that will be talked about below, we would be fetching close to `400`+ pages worth of gaming data which is currently present at the time of writing in RAWG API side!!!

There are 5 input files that will be created as part of JSON flattening / normalization to load to respective Bigquery tables resulting in total: `250*5` = `1250` files!!!
There are 5 input files that will be created as part of JSON flattening / normalization to load to respective Bigquery tables resulting in total: `400*5` = `2000` files!!!

Each table will contain on an average `40 rows` of gaming related data so close to `1250*40` = `50000` 50k rows will be loaded onto Bigquery for this project!!!
Each table will contain on an average `40 rows` of gaming related data so close to `2000*40` = `80000` 80k rows will be loaded onto Bigquery for this project as an estimate!!!

Flow Diagram
=============
Expand Down Expand Up @@ -84,6 +84,21 @@ This project created using astro cli contains the following parts:
- `remove_extracted_api_parquet_files`: Uses `GCSHook` to list the objects present in GCS and iterates over each item and delete them using GCSHook `delete` method.
- `update_page_number`: Updates the page number airflow variable by 1 so that next run takes into account the next page number results.

- #### Transform Section:
- These steps are done to make the loaded data ready for use in downstream systems for analytics and machine learning purposes.
- The section removes those games that have no Metacritic score and release date as well and the platforms which do not have a release date for the said game.
- Additionally, it facilitates creation of dimensional data modelling which are the following:
- `fct_games`: Fact table storing factual details regarding a game and related foreign keys to dimension tables.
- `bridge_games_genre`: Bridge table to map the Game ID's to the genres it belongs to.
- `bridge_games_platforms`: Bridge table to map the Game ID's to the platforms it released.
- `bridge_games_publishers`: Bridge table to map the Game ID's to the publishers the game got published under.
- `dim_genres`: Dimension table for Genres and stats related to a particular Genre.
- `dim_platforms`: Dimension table for Platforms and stats related to a particular Platform.
- `dim_publishers`: Dimension table for Publishers and stats related to a particular Publisher.
- `dim_ratings`: Dimension table for Ratings and stats related to a particular Rating Category.
- `dim_time`: Dimension table for Release Date and stats related to a particular release date for games.
- Additionally , the dimension tables are created as views. Bridge and Fact tables are incrementally updated.
- Generic tests include checking for not-null and unique column.

- Dockerfile: This file contains a versioned Astro Runtime Docker image that provides a differentiated Airflow experience. If you want to execute other commands or overrides at runtime, specify them here.
- include: This folder contains any additional files that you want to include as part of your project. It is empty by default.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
-- depends_on: {{ ref('stg_games') }}
-- depends_on: {{ ref('fct_games') }}

{{config(
materialized = 'incremental'
materialized = 'incremental',
unique_key = 'bridge_gg_id',
incremental_strategy = 'merge'
)}}

SELECT
{{ dbt_utils.generate_surrogate_key(['game_id','genre_id']) }} AS bridge_gg_id,
game_id,
genre_id
FROM {{ ref('stg_genres') }}
{% if is_incremental() %}
WHERE load_date >= (SELECT COALESCE(MAX(load_date), '1900-01-01') FROM {{ this }})
AND game_id IN (SELECT game_id FROM {{ ref('stg_games') }} WHERE metacritic != 'None')
AND game_id IN (SELECT game_id FROM {{ ref('fct_games') }} WHERE metacritic_score != 'None')
{% endif %}
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,12 @@ version: 2
models:
- name: bridge_games_genre
columns:
- name: bridge_gg_id
tests:
- not_null
- unique
description: Surrogate key to allow only unique entries of combination to be updated in the table.

- name: game_id
tests:
- not_null
Expand Down
Original file line number Diff line number Diff line change
@@ -1,15 +1,19 @@
-- depends_on: {{ ref('stg_games') }}
-- depends_on: {{ ref('fct_games') }}

{{config(
materialized = 'incremental'
materialized = 'incremental',
unique_key = 'bridge_gpl_id',
incremental_strategy = 'merge'
)}}

SELECT
{{ dbt_utils.generate_surrogate_key(['game_id','platform_id','released_at']) }} AS bridge_gpl_id,
game_id,
platform_id,
released_at
FROM {{ ref('stg_platforms') }}
{% if is_incremental() %}
WHERE load_date >= (SELECT COALESCE(MAX(load_date), '1900-01-01') FROM {{ this }})
AND game_id IN (SELECT game_id FROM {{ ref('stg_games') }} WHERE metacritic != 'None')
AND game_id IN (SELECT game_id FROM {{ ref('fct_games') }} WHERE metacritic_score != 'None')
AND released_at != 'NaT'
{% endif %}
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,12 @@ version: 2
models:
- name: bridge_games_platforms
columns:
- name: bridge_gpl_id
tests:
- not_null
- unique
description: Surrogate key to allow only unique entries of combination to be updated in the table.

- name: game_id
tests:
- not_null
Expand Down
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
-- depends_on: {{ ref('stg_games') }}
-- depends_on: {{ ref('fct_games') }}

{{config(
materialized = 'incremental'
materialized = 'incremental',
unique_key = 'bridge_gpu_id',
incremental_strategy = 'merge'
)}}

SELECT
{{ dbt_utils.generate_surrogate_key(['game_id','publisher_id']) }} AS bridge_gpu_id,
game_id,
publisher_id
FROM {{ ref('stg_publishers') }}
{% if is_incremental() %}
WHERE load_date >= (SELECT COALESCE(MAX(load_date), '1900-01-01') FROM {{ this }})
AND game_id IN (SELECT game_id FROM {{ ref('stg_games') }} WHERE metacritic != 'None')
AND game_id IN (SELECT game_id FROM {{ ref('fct_games') }} WHERE metacritic_score != 'None')
{% endif %}
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,12 @@ version: 2
models:
- name: bridge_games_publishers
columns:
- name: bridge_gpu_id
tests:
- not_null
- unique
description: Surrogate key to allow only unique entries of combination to be updated in the table.

- name: game_id
tests:
- not_null
Expand Down
3 changes: 2 additions & 1 deletion dags/dbt/games_analyzer_rawg_api/models/dim/dim_genres.sql
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
--- This model is responsible for creating the genres dimension table.
--- It groups the genres data by genre_id, genre_name, and genre_slug and displays the count of games present for that particular genre.
-- depends_on: {{ ref('fct_games') }}

{{config(
materialized = 'view',
Expand All @@ -11,6 +12,6 @@ SELECT
genre_name,
COUNT(game_id) AS genre_games_count
FROM {{ ref('stg_genres') }}
WHERE game_id IN (SELECT game_id FROM {{ ref('stg_games') }} WHERE metacritic != 'None')
WHERE game_id IN (SELECT game_id FROM {{ ref('fct_games') }} WHERE metacritic_score != 'None')
GROUP BY 1, 2
ORDER BY 1
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
--- This model is responsible for creating the dim_platforms table
--- It groups the platforms data by platform_id, platform_name, and platform_slug and displays the count of games present for that particular platform.
--- The release date of a game for the specific platform will be appended to the bridge table to provide more context.
-- depends_on: {{ ref('fct_games') }}

{{config(
materialized = 'view',
unique_key = 'id'
Expand All @@ -11,6 +13,6 @@ SELECT
platform_name,
COUNT(game_id) AS platform_games_count
FROM {{ ref('stg_platforms') }}
WHERE game_id IN (SELECT game_id FROM {{ ref('stg_games') }} WHERE metacritic != 'None')
WHERE game_id IN (SELECT game_id FROM {{ ref('fct_games') }} WHERE metacritic_score != 'None')
GROUP BY 1, 2
ORDER BY 1
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
--- This model creates dimension table for publishers data.
--- It groups the publishers data by publisher_id, publisher_name, and publisher_slug and displays the count of games present for that particular publisher.
-- depends_on: {{ ref('fct_games') }}

{{config(
materialized = 'view',
Expand All @@ -11,6 +12,6 @@ SELECT
publisher_name,
COUNT(game_id) AS publishers_game_count
FROM {{ ref('stg_publishers') }}
WHERE game_id IN (SELECT game_id FROM {{ ref('stg_games') }} WHERE metacritic != 'None')
WHERE game_id IN (SELECT game_id FROM {{ ref('fct_games') }} WHERE metacritic_score != 'None')
GROUP BY 1, 2
ORDER BY 1
8 changes: 5 additions & 3 deletions dags/dbt/games_analyzer_rawg_api/models/dim/dim_ratings.sql
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
-- models/dim/dim_ratings.sql

-- Define the model using the raw ratings data source
-- depends_on: {{ ref('fct_games') }}

{{ config(
materialized='view'
) }}
Expand All @@ -9,8 +11,8 @@ SELECT
id AS rating_id,
title AS rating_category,
COUNT(ratings_raw.game_id) AS rating_count
FROM {{ source('rawg_api_raw', 'ratings') }} ratings_raw LEFT JOIN {{ ref('stg_games') }} games_eph
ON ratings_raw.game_id=games_eph.game_id AND ratings_raw.id=games_eph.rating_top
WHERE games_eph.metacritic != 'None'
FROM {{ source('rawg_api_raw', 'ratings') }} ratings_raw LEFT JOIN {{ ref('fct_games') }} games_fact
ON ratings_raw.game_id=games_fact.game_id AND ratings_raw.id=games_fact.rating_id
WHERE games_fact.metacritic_score != 'None'
GROUP BY 1, 2
ORDER BY 1
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,5 @@ models:
description: Number of ratings given by users.

description: |
This model represents the dimension table for Ratings which include rating id , name , count and percent of games falling in that category.
This model represents the dimension table for Ratings which include rating id , name , count of games falling in that category.
2 changes: 1 addition & 1 deletion dags/dbt/games_analyzer_rawg_api/models/fct/fct_games.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,4 +38,4 @@ models:
description: Metacritic Category based on the metacritic score given to the game.

description: |
Fact table for data extracted and transformed fdrom RAWG API contains quantitative data about games such as ratings, playtime, and metacritic score.
Fact table for data extracted and transformed from RAWG API contains quantitative data about games such as ratings, playtime, and metacritic score.

0 comments on commit 5bd19f9

Please sign in to comment.