Skip to content

Latest commit

 

History

History
56 lines (40 loc) · 2.51 KB

File metadata and controls

56 lines (40 loc) · 2.51 KB

Data Engineer Workshops

🔖 Workshops from Road to Data Engineer 'R2DE' course from Data TH.com — Data Science ชิลชิล

Screen Shot 2567-01-27 at 23 04 45

Note

picture from data.th

🖇️ Week 1 Data Collection with Python

  • MySQL Database connection using PyMySQL
  • REST API data collection using Package Requests

🖇️ Week 2: Data Cleansing with Spark

  • Data Profiling
  • Exploratory Data Analysis (EDA)
  • Data Anomalies Check – syntactic, semantic, missing values, outliers
  • Data Cleansing using Spark SQL

🖇️ Week 3: Data Lake, Cloud Computing with Google Cloud Platform (GCP)

  • Bash command line in Cloud Shell
  • Create Bucket and Upload Data into Google Cloud Storage (Data Lake), using gsutil command through Cloud Shell; Python code through Python SDK library
  • Storage Object Lifecycle

🖇️ Week 4: (Automated) Data Pipeline Orchestration using Apache Airflows and DAGs (ETL)

image

Note

picture from data.th

  • Create a Google Cloud Composer Cluster for running Apache Airflow
  • Create Airflow DAG definition file, instantiation
  • Build Task by creating BashOperator and PythonOperator
  • Setting up Dependencies
  • TaskGroup in DAG
image

Note

picture from data.th

🖇️ Week 5: (Serverless) Data Warehouse with BigQuery

  • Normalization VS Denormalization Concept (Tradeoff between Storage and Time usage in joining tables)
  • Denormalization for Data Warehouses (optimize performance for frequent querying)
  • Normalization for Databases
  • Columnar Storage / Column-oriented Database
image
  • Data Mart for different business units -> altogether become Data Warehouse
  • View VS Materialized View
  • Index & Partitioning
  • Automatic Data Importing through BashOperator -> bq load command
  • Automatic Data Importing through AirflowOperator -> GCSToBigQeryOperator on Airflow