Skip to content

thitirat-mnc/Data-Engineer-Workshop-R2DE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineer Workshops

🔖 Workshops from Road to Data Engineer 'R2DE' course from Data TH.com — Data Science ชิลชิล

Screen Shot 2567-01-27 at 23 04 45

Note

picture from data.th

🖇️ Week 1 Data Collection with Python

  • MySQL Database connection using PyMySQL
  • REST API data collection using Package Requests

🖇️ Week 2: Data Cleansing with Spark

  • Data Profiling
  • Exploratory Data Analysis (EDA)
  • Data Anomalies Check – syntactic, semantic, missing values, outliers
  • Data Cleansing using Spark SQL

🖇️ Week 3: Data Lake, Cloud Computing with Google Cloud Platform (GCP)

  • Bash command line in Cloud Shell
  • Create Bucket and Upload Data into Google Cloud Storage (Data Lake), using gsutil command through Cloud Shell; Python code through Python SDK library
  • Storage Object Lifecycle

🖇️ Week 4: (Automated) Data Pipeline Orchestration using Apache Airflows and DAGs (ETL)

image

Note

picture from data.th

  • Create a Google Cloud Composer Cluster for running Apache Airflow
  • Create Airflow DAG definition file, instantiation
  • Build Task by creating BashOperator and PythonOperator
  • Setting up Dependencies
  • TaskGroup in DAG
image

Note

picture from data.th

🖇️ Week 5: (Serverless) Data Warehouse with BigQuery

  • Normalization VS Denormalization Concept (Tradeoff between Storage and Time usage in joining tables)
  • Denormalization for Data Warehouses (optimize performance for frequent querying)
  • Normalization for Databases
  • Columnar Storage / Column-oriented Database
image
  • Data Mart for different business units -> altogether become Data Warehouse
  • View VS Materialized View
  • Index & Partitioning
  • Automatic Data Importing through BashOperator -> bq load command
  • Automatic Data Importing through AirflowOperator -> GCSToBigQeryOperator on Airflow

About

8-week workshop including database connectivity, data collection, Spark-powered data cleansing, Data Lakes and Cloud Computing integration, Automated Data Pipelines with Apache Airflow, Data Warehouses, and Data Visualizations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors