🔖 Workshops from Road to Data Engineer 'R2DE' course from Data TH.com — Data Science ชิลชิล
Note
picture from data.th
- MySQL Database connection using PyMySQL
- REST API data collection using Package Requests
- Data Profiling
- Exploratory Data Analysis (EDA)
- Data Anomalies Check – syntactic, semantic, missing values, outliers
- Data Cleansing using Spark SQL
- Bash command line in Cloud Shell
- Create Bucket and Upload Data into Google Cloud Storage (Data Lake), using gsutil command through Cloud Shell; Python code through Python SDK library
- Storage Object Lifecycle
Note
picture from data.th
- Create a Google Cloud Composer Cluster for running Apache Airflow
- Create Airflow DAG definition file, instantiation
- Build Task by creating BashOperator and PythonOperator
- Setting up Dependencies
- TaskGroup in DAG
Note
picture from data.th
- Normalization VS Denormalization Concept (Tradeoff between Storage and Time usage in joining tables)
- Denormalization for Data Warehouses (optimize performance for frequent querying)
- Normalization for Databases
- Columnar Storage / Column-oriented Database
- Data Mart for different business units -> altogether become Data Warehouse
- View VS Materialized View
- Index & Partitioning
- Automatic Data Importing through BashOperator -> bq load command
- Automatic Data Importing through AirflowOperator -> GCSToBigQeryOperator on Airflow