In this project, I build a simple data pipeline following the ELT(extract - load - transform) model using the Brazilian-Ecommerce dataset, perform data processing and transformation, serve to create reports, in-depth analysis and support for the Data Analyst team
- Data Source: The project uses the Brazilian Ecommerce public dataset by Olist, downloaded from kaggle.com in
.csv
format.- The 5 csv files are loaded into
PostgreSQL
, considering it a data source. - The remaining 4 csv files are extracted directly.
- The 5 csv files are loaded into
- Extract Data: Data is extracted using
Polars
as aDataFrame
from aPostgreSQL
database andCSV
file. - Load Data: After extracting data from the above two data sources, we load it into
Snowflake
atraw
layer fromPolars
DataFrame
. - Tranform Data: After loading the data, we perform
transform
withdbt
onSnowflake
to createdimension
andfact
tables in thestaging
layer and calculate aggregates in themart
layer. - Serving: Data is served for
reporting
,analysis
, anddecision support
usingMetabase
andApache Superset
. - package and orchestrator: The entire project is packaged and orchestrated by
Docker
andDagster
.
- olist_geolocation_dataset: This dataset has information Brazilian zip codes and its lat/lng coordinates.
- olist_customers_dataset: This dataset has information about the customer and its location.
- olist_order_items_dataset: This dataset includes data about the items purchased within each order.
- olist_order_payments_dataset: This dataset includes data about the orders payment options.
- olist_order_reviews_dataset: This dataset includes data about the reviews made by the customers.
- olist_orders_dataset: This is the core dataset. From each order you might find all other information.
- olist_products_dataset: This dataset includes data about the products sold by Olist.
- olist_sellers_dataset: This dataset includes data about the sellers that fulfilled orders made at Olist.
Graph Lineage (dagster) trong dα»± Γ‘n nΓ y bao gα»m 4 layer:
- source layer: This layer contains
assets
thatcollect
data fromPostgreSQL
andCSV
files usingPolars
DataFrame
. - raw layer: This layer contains
assets
that perform the task of loading data fromPolars
DataFrame
intoSnowflake
warehouse inraw
schema. - staging layer: This layer contains assets that handle data transformation from the
raw
schema, then the data is put into thestaging
schema. - mart layer: This layer contains
assets
that are responsible for synthesizing calculations from data in thestaging
schema and then putting the data into themart
schema.
PostgreSQL
Polars
Dbt
Dagster
Snowflake
Docker
Metabase
Apache Superset
Here's what you can do with NinjaSketch:
- You can completely change the logic or create new
assets
in thedata pipeline
as you wish, performaggregate
calculations
on theassets
in thepipeline
according to your purposes. - You can also create new
data charts
as well as change existingcharts
as you like with extremely diversechart types
onMetabase
andApache Superset
. - You can also create new or change my existing
dashboards
as you like
- Add more
data sources
to increase data richness. - Refer to other
data warehouses
besidesSnowflake
such asAmazon Redshift
orGoogle Bigquery
. - Perform more
cleaning
andoptimization
processing
of the data. - Perform more advanced
statistics
,analysis
andcalculations
. - Check out other popular and popular
data orchestration
tools likeApache Airflow
. - Separate
dbt
into a separate service (separatecontainer
) indocker
when the project expands - Learn about
dbt packages
likedbt-labs/dbt_utils
to help make thetransformation
process faster and more optimal.
To run the project in your local environment, follow these steps:
- Run
git clone https://github.com/longNguyen010203/ECommerce-ELT-Pipeline.git
to clone the repository to your local machine. - run
make build
to build the images from the Dockerfile - run
make up
to pull images from docker hub and launch services - run
make psql_create
to create tables with schema for PostgreSQL - run
make psql_import
to load data from CSV file to PostgreSQL - Open http://localhost:3001 and click
Materialize all
button to run the Pipeline - Open https://app.snowflake.com and login to check and monitor updated data
- Open http://localhost:3030 to see charts and dashboards