Merge pull request #32 from kevinsunny1996/task/fix_duplicates_and_fi…

…x_ordering add bridge surrogate keys to fix duplicates
kevinsunny1996 · Jul 22, 2024 · 5bd19f9 · 5bd19f9
2 parents fce3f9b + 21eff28
commit 5bd19f9
Show file tree

Hide file tree

Showing 13 changed files with 70 additions and 21 deletions.
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 Overview
 ========
 
-The following data pipeline is part of an ongoing Data Analysis project which involves analyzing Gaming Industry data for the last 30 years from `1990-2023`.
+The following data pipeline is part of an ongoing Data Analysis project which involves analyzing Gaming Industry data for games rated under Metacritic.
 In order to showcase data extraction capabilities , data from readymade sources like Kaggle etc are discouraged and dataset is being built from scratch using [RAWG API](https://rawg.io/apidocs).
 
 Why RAWG API?
@@ -10,11 +10,11 @@ Why RAWG API?
 - Uses API key , so its easier for people dipping their feet in the world of data to implement extraction logic.
 - Has separate links to fetch reddit posts related to a game as well which can be further used for sentiment analysis.
 
-Using the setup that will be talked about below, we would be fetching close to `250` pages worth of gaming data which is currently present at the time of writing in RAWG API side!!!
+Using the setup that will be talked about below, we would be fetching close to `400`+ pages worth of gaming data which is currently present at the time of writing in RAWG API side!!!
 
-There are 5 input files that will be created as part of JSON flattening / normalization to load to respective Bigquery tables resulting in total: `250*5` = `1250` files!!!
+There are 5 input files that will be created as part of JSON flattening / normalization to load to respective Bigquery tables resulting in total: `400*5` = `2000` files!!!
 
-Each table will contain on an average `40 rows` of gaming related data so close to `1250*40` = `50000` 50k rows will be loaded onto Bigquery for this project!!! 
+Each table will contain on an average `40 rows` of gaming related data so close to `2000*40` = `80000` 80k rows will be loaded onto Bigquery for this project as an estimate!!! 
 
 Flow Diagram
 =============
@@ -84,6 +84,21 @@ This project created using astro cli contains the following parts:
                     - `remove_extracted_api_parquet_files`: Uses `GCSHook` to list the objects present in GCS and iterates over each item and delete them using GCSHook `delete` method.
                     - `update_page_number`: Updates the page number airflow variable by 1 so that next run takes into account the next page number results.
 
+            - #### Transform Section:
+                - These steps are done to make the loaded data ready for use in downstream systems for analytics and machine learning purposes.
+                - The section removes those games that have no Metacritic score and release date as well and the platforms which do not have a release date for the said game.
+                - Additionally, it facilitates creation of dimensional data modelling which are the following:
+                    - `fct_games`: Fact table storing factual details regarding a game and related foreign keys to dimension tables.
+                    - `bridge_games_genre`: Bridge table to map the Game ID's to the genres it belongs to.
+                    - `bridge_games_platforms`: Bridge table to map the Game ID's to the platforms it released.
+                    - `bridge_games_publishers`: Bridge table to map the Game ID's to the publishers the game got published under.
+                    - `dim_genres`: Dimension table for Genres and stats related to a particular Genre.
+                    - `dim_platforms`: Dimension table for Platforms and stats related to a particular Platform.
+                    - `dim_publishers`: Dimension table for Publishers and stats related to a particular Publisher.
+                    - `dim_ratings`: Dimension table for Ratings and stats related to a particular Rating Category.
+                    - `dim_time`: Dimension table for Release Date and stats related to a particular release date for games.
+                - Additionally , the dimension tables are created as views. Bridge and Fact tables are incrementally updated.
+                - Generic tests include checking for not-null and unique column.
 
 - Dockerfile: This file contains a versioned Astro Runtime Docker image that provides a differentiated Airflow experience. If you want to execute other commands or overrides at runtime, specify them here.
 - include: This folder contains any additional files that you want to include as part of your project. It is empty by default.

diff --git a/dags/dbt/games_analyzer_rawg_api/models/bridge/bridge_games_genre.sql b/dags/dbt/games_analyzer_rawg_api/models/bridge/bridge_games_genre.sql
@@ -1,14 +1,17 @@
--- depends_on: {{ ref('stg_games') }}
+-- depends_on: {{ ref('fct_games') }}
 
 {{config(
-    materialized = 'incremental'
+    materialized = 'incremental',
+    unique_key = 'bridge_gg_id',
+    incremental_strategy = 'merge'
 )}}
 
 SELECT
+    {{ dbt_utils.generate_surrogate_key(['game_id','genre_id']) }} AS bridge_gg_id,
     game_id,
     genre_id
 FROM {{ ref('stg_genres') }}
 {% if is_incremental() %}
     WHERE load_date >= (SELECT COALESCE(MAX(load_date), '1900-01-01') FROM {{ this }}) 
-    AND game_id IN (SELECT game_id FROM {{ ref('stg_games') }} WHERE metacritic != 'None')     
+    AND game_id IN (SELECT game_id FROM {{ ref('fct_games') }} WHERE metacritic_score != 'None')     
 {% endif %}
diff --git a/dags/dbt/games_analyzer_rawg_api/models/bridge/bridge_games_genre.yml b/dags/dbt/games_analyzer_rawg_api/models/bridge/bridge_games_genre.yml
@@ -3,6 +3,12 @@ version: 2
 models:
   - name: bridge_games_genre
     columns:
+      - name: bridge_gg_id
+        tests:
+          - not_null
+          - unique
+        description: Surrogate key to allow only unique entries of combination to be updated in the table.
+
       - name: game_id
         tests:
           - not_null

diff --git a/dags/dbt/games_analyzer_rawg_api/models/bridge/bridge_games_platforms.sql b/dags/dbt/games_analyzer_rawg_api/models/bridge/bridge_games_platforms.sql
@@ -1,15 +1,19 @@
--- depends_on: {{ ref('stg_games') }}
+-- depends_on: {{ ref('fct_games') }}
 
 {{config(
-    materialized = 'incremental'
+    materialized = 'incremental',
+    unique_key = 'bridge_gpl_id',
+    incremental_strategy = 'merge'
 )}}
 
 SELECT
+    {{ dbt_utils.generate_surrogate_key(['game_id','platform_id','released_at']) }} AS bridge_gpl_id,
     game_id,
     platform_id,
     released_at
 FROM {{ ref('stg_platforms') }}
 {% if is_incremental() %}
     WHERE load_date >= (SELECT COALESCE(MAX(load_date), '1900-01-01') FROM {{ this }}) 
-    AND game_id IN (SELECT game_id FROM {{ ref('stg_games') }} WHERE metacritic != 'None')
+    AND game_id IN (SELECT game_id FROM {{ ref('fct_games') }} WHERE metacritic_score != 'None')
+    AND released_at != 'NaT'
 {% endif %}
diff --git a/dags/dbt/games_analyzer_rawg_api/models/bridge/bridge_games_platforms.yml b/dags/dbt/games_analyzer_rawg_api/models/bridge/bridge_games_platforms.yml
@@ -3,6 +3,12 @@ version: 2
 models:
   - name: bridge_games_platforms
     columns:
+      - name: bridge_gpl_id
+        tests:
+          - not_null
+          - unique
+        description: Surrogate key to allow only unique entries of combination to be updated in the table.
+
       - name: game_id
         tests:
           - not_null

diff --git a/dags/dbt/games_analyzer_rawg_api/models/bridge/bridge_games_publishers.sql b/dags/dbt/games_analyzer_rawg_api/models/bridge/bridge_games_publishers.sql
@@ -1,14 +1,17 @@
--- depends_on: {{ ref('stg_games') }}
+-- depends_on: {{ ref('fct_games') }}
 
 {{config(
-    materialized = 'incremental'
+    materialized = 'incremental',
+    unique_key = 'bridge_gpu_id',
+    incremental_strategy = 'merge'
 )}}
 
 SELECT
+    {{ dbt_utils.generate_surrogate_key(['game_id','publisher_id']) }} AS bridge_gpu_id,
     game_id,
     publisher_id
 FROM {{ ref('stg_publishers') }}
 {% if is_incremental() %}
     WHERE load_date >= (SELECT COALESCE(MAX(load_date), '1900-01-01') FROM {{ this }}) 
-    AND game_id IN (SELECT game_id FROM {{ ref('stg_games') }} WHERE metacritic != 'None')
+    AND game_id IN (SELECT game_id FROM {{ ref('fct_games') }} WHERE metacritic_score != 'None')
 {% endif %}
diff --git a/dags/dbt/games_analyzer_rawg_api/models/bridge/bridge_games_publishers.yml b/dags/dbt/games_analyzer_rawg_api/models/bridge/bridge_games_publishers.yml
@@ -3,6 +3,12 @@ version: 2
 models:
   - name: bridge_games_publishers
     columns:
+      - name: bridge_gpu_id
+        tests:
+          - not_null
+          - unique
+        description: Surrogate key to allow only unique entries of combination to be updated in the table.
+
       - name: game_id
         tests:
           - not_null

diff --git a/dags/dbt/games_analyzer_rawg_api/models/dim/dim_genres.sql b/dags/dbt/games_analyzer_rawg_api/models/dim/dim_genres.sql
@@ -1,5 +1,6 @@
 --- This model is responsible for creating the genres dimension table.
 --- It groups the genres data by genre_id, genre_name, and genre_slug and displays the count of games present for that particular genre.
+-- depends_on: {{ ref('fct_games') }}
 
 {{config(
     materialized = 'view',
@@ -11,6 +12,6 @@ SELECT
     genre_name,
     COUNT(game_id) AS genre_games_count
 FROM {{ ref('stg_genres') }}
-WHERE game_id IN (SELECT game_id FROM {{ ref('stg_games') }} WHERE metacritic != 'None')
+WHERE game_id IN (SELECT game_id FROM {{ ref('fct_games') }} WHERE metacritic_score != 'None')
 GROUP BY 1, 2
 ORDER BY 1
diff --git a/dags/dbt/games_analyzer_rawg_api/models/dim/dim_platforms.sql b/dags/dbt/games_analyzer_rawg_api/models/dim/dim_platforms.sql
@@ -1,6 +1,8 @@
 --- This model is responsible for creating the dim_platforms table
 --- It groups the platforms data by platform_id, platform_name, and platform_slug and displays the count of games present for that particular platform.
 --- The release date of a game for the specific platform will be appended to the bridge table to provide more context.
+-- depends_on: {{ ref('fct_games') }}
+
 {{config(
     materialized = 'view',
     unique_key = 'id'
@@ -11,6 +13,6 @@ SELECT
     platform_name,
     COUNT(game_id) AS platform_games_count
 FROM {{ ref('stg_platforms') }}
-WHERE game_id IN (SELECT game_id FROM {{ ref('stg_games') }} WHERE metacritic != 'None')
+WHERE game_id IN (SELECT game_id FROM {{ ref('fct_games') }} WHERE metacritic_score != 'None')
 GROUP BY 1, 2
 ORDER BY 1
diff --git a/dags/dbt/games_analyzer_rawg_api/models/dim/dim_publishers.sql b/dags/dbt/games_analyzer_rawg_api/models/dim/dim_publishers.sql
@@ -1,5 +1,6 @@
 --- This model creates dimension table for publishers data.
 --- It groups the publishers data by publisher_id, publisher_name, and publisher_slug and displays the count of games present for that particular publisher.
+-- depends_on: {{ ref('fct_games') }}
 
 {{config(
     materialized = 'view',
@@ -11,6 +12,6 @@ SELECT
     publisher_name,
     COUNT(game_id) AS publishers_game_count
 FROM {{ ref('stg_publishers') }}
-WHERE game_id IN (SELECT game_id FROM {{ ref('stg_games') }} WHERE metacritic != 'None')
+WHERE game_id IN (SELECT game_id FROM {{ ref('fct_games') }} WHERE metacritic_score != 'None')
 GROUP BY 1, 2
 ORDER BY 1
diff --git a/dags/dbt/games_analyzer_rawg_api/models/dim/dim_ratings.sql b/dags/dbt/games_analyzer_rawg_api/models/dim/dim_ratings.sql
@@ -1,6 +1,8 @@
 -- models/dim/dim_ratings.sql
 
 -- Define the model using the raw ratings data source
+-- depends_on: {{ ref('fct_games') }}
+
 {{ config(
     materialized='view'
 ) }}
@@ -9,8 +11,8 @@ SELECT
     id AS rating_id,
     title AS rating_category,
     COUNT(ratings_raw.game_id) AS rating_count
-FROM {{ source('rawg_api_raw', 'ratings') }} ratings_raw LEFT JOIN {{ ref('stg_games') }} games_eph
-ON ratings_raw.game_id=games_eph.game_id AND ratings_raw.id=games_eph.rating_top
-WHERE games_eph.metacritic != 'None'
+FROM {{ source('rawg_api_raw', 'ratings') }} ratings_raw LEFT JOIN {{ ref('fct_games') }} games_fact
+ON ratings_raw.game_id=games_fact.game_id AND ratings_raw.id=games_fact.rating_id
+WHERE games_fact.metacritic_score != 'None'
 GROUP BY 1, 2
 ORDER BY 1
diff --git a/dags/dbt/games_analyzer_rawg_api/models/dim/dim_ratings.yml b/dags/dbt/games_analyzer_rawg_api/models/dim/dim_ratings.yml
@@ -19,5 +19,5 @@ models:
         description: Number of ratings given by users.
 
     description: |
-      This model represents the dimension table for Ratings which include rating id , name , count and percent of games falling in that category.
+      This model represents the dimension table for Ratings which include rating id , name , count of games falling in that category.
       
diff --git a/dags/dbt/games_analyzer_rawg_api/models/fct/fct_games.yml b/dags/dbt/games_analyzer_rawg_api/models/fct/fct_games.yml
@@ -38,4 +38,4 @@ models:
         description: Metacritic Category based on the metacritic score given to the game.
 
     description: |
-      Fact table for data extracted and transformed fdrom RAWG API contains quantitative data about games such as ratings, playtime, and metacritic score.
+      Fact table for data extracted and transformed from RAWG API contains quantitative data about games such as ratings, playtime, and metacritic score.