What it is? 💡 To generate insights about Life Expectancy, we created a pipeline from the ETL to create a data warehouse using APIs from the World Bank, World Health Organization, and United Nations. With this clean data, we created stunning dashboards and predictions based on machine learning models. Why did you build this project? 💡 This is the final project for Henry's Data Science Bootcamp.
What was my motivation? 💡 Reduce the gap by three years between developed and underdeveloped countries in the next 10 years!. What did you learn? 💡 Agile methodologies (SCRUM) and a GitHub flow to collaborate as a team. The pipeline was built using the architectures of "ETL with Airflow running as a web service," "Data Analytics with simple Notebooks," "Data Lake and Data Warehouse as a Service in Linode Cloud" "Interactive Dashboards using PoweBI and Streamlit for ML predictions," and "Docker for development and production environments".
- Life Expentancy
-
We are building a pipeline with Apache-airflow. It begins by Extracting raw data from the World Bank, World Health Organization, and United Nations via their respective APIs. Second, we Transform this raw using a bucket to save and retrieve data in parquet format. Third, we Load this clean data into a Postgres Data Warehouse so that it is ready for any PowerBI or Streamlit connections.
-
In production, it will use an EC2 for the airflow pipeline, an Object Storage (S3 bucket) for storing the parquet files, and an RDB for the data warehouse and backups, all on Linode Cloud platform because it is less expensive and simpler than AWS, Azure, or Google Cloud.
-
In development mode, we use docker-compose to orchestrate the airflow pipeline and the Postgres database; because there will be large files, We will save them as binary parquet files using GIT LFS (Large File System). The remote parquet files from GitHub will be used for connections with PowerBI or Streamlit.
-
In production mode, the EC2 instance will pull any changes from the development environment, save and retrieve data from the S3 bucket manually or automatically, and ingest the clean data into a Postgres data warehouse. PowerBI and Streamlit will be able to access this data warehouse and display dashboards in realtime.
-
First, we will select the indices that we will request from the APIs and add them to the corresponding json file.
# data/datos_inyectados/naciones_unidas.json { "22": ["mort", "tasa_mortalidad_infantil"], "54": ["pop", "densidad_población_por_kilómetro_cuadrado)"], "65": ["imigrt", "migración_neta_total"], "49": ["pop", "población_total_por_sexo"], "60": ["mort", "total_muertes_por_sexo"], "53": ["pop", "tasa_bruta_cambio_natural_población"], "66": ["imigrt", "tasa_bruta_migración_neta"], "72": ["pop", "proporción_sexos_población_total"], "1": ["fam", "prevalencia_anticonceptivos_porcentaje"], "67": ["pop", "mediana_edad_población"], "59": ["mort", "tasa_bruta_mortalidad_por_1000_habitantes"], "51": ["pop", "tasa_bruta_variación_total_población"], "50": ["pop", "cambio_de_la_población"], "41": ["pop", "población_femenina_edad_reproductiva_(15-49 años)"], "24": ["mort", "tasa_mortalidad_menores_cinco_años"], "52": ["pop", "cambio_natural_población"], "19": ["fert", "tasa_fertilidad"], "42": ["marstat", "estado_civil_casado_porcentaje"] }
# data/datos_inyectados/banco_mundial.json { "SP.DYN.LE00.IN": "esperanza_vida_total", "SP.DYN.LE00.FE.IN": "esperanza_vida_mujeres", "SP.DYN.LE00.MA.IN": "esperanza_vida_varones", "SI.POV.GINI": "índice_gini", "SE.XPD.TOTL.GD.ZS": "gasto_púb_educacion_pje", "SE.COM.DURS": "duración_educ_obligatoria", "NY.GDP.PCAP.CD": "pib_pc_usd_actuales", "NY.GDP.MKTP.PP.CD": "pib_ppa_prec_inter", "IQ.SCI.OVRL": "capacidad_estadística", "SP.POP.TOTL.FE.ZS": "población_mujeres_pje", "SP.POP.TOTL.MA.ZS": "población_hombres_pje", "NY.GDP.PCAP.PP.CD": "pib_pc_prec_inter", "AG.LND.FRST.ZS": "porcentaje_de_bosque", "EN.ATM.CO2E.PC": "emisiones_co2", "SH.XPD.CHEX.PC.CD": "inversion_salud_percapita", "SH.MED.BEDS.ZS": "camas_hospitales_c/1000personas", "SP.DYN.IMRT.IN": "mortalidad_infantil_c/1000nacimientos", "SH.H2O.BASW.ZS": "acceso_agua_potable(%)", "SH.STA.BASS.ZS": "acceso_servicios_sanitarios(%)", "SH.STA.SUIC.P5": "tasa_mortalidad_suicidio_c/100.000", "SL.UEM.TOTL.ZS": "tasa_desempleo", "SP.URB.TOTL.IN.ZS": "tasa_poblacion_urbana", "NY.GNP.PCAP.CD": "INB_percapita", "PV.EST": "estabilidad_política" }
# data/datos_inyectados/mundial_salud.json { "M_Est_cig_curr": "df_OMS_M_Est_cig_curr", "NCD_BMI_30A": "df_OMS_NCD_BMI_30A", "NUTRITION_ANAEMIA_CHILDREN_PREV": "df_OMS_NUTRITION_ANAEMIA_CHILDREN_PREV", "NUTRITION_ANAEMIA_REPRODUCTIVEAGE_PREV": "df_OMS_NUTRITION_ANAEMIA_REPRODUCTIVEAGE_PREV", "SA_0000001688": "df_OMS_SA_0000001688" }
-
So, in this first section, we create the folders called "datos brutos", "datos pre procesados", and "datos procesados" where we will save and retrieve data.
-
Next, from The World Bank and The United Nations API raw data will then be used to generate a countries dataframe.
-
We will generate raw data from the previously chosen indices in the injected json data files.
Development Production folders_creation folders_creation creation_of_df_contry_from_twb upload_from_datos_inyectados_to_S3_bucket creation_of_df_contry_from_unpd creation_of_df_contry_from_twb twb_indice_extraction creation_of_df_contry_from_unpd unpd_indice_extraction twb_indice_extraction unpd_indice_extraction upload_from_datos_brutos_to_S3_bucket
-
For the transformation state, we will create a table dataframe with all indices extracted previously.
-
Rename all columns with the extracted indices in Spanish.
-
We also create new temporal tables dataframes for each table in the database that we will create in the following step.
Development Production read_and_transformation_for_twb_unpd read_and_transformation_for_twb_unpd final_transformations_for_twb_unpd final_transformations_for_twb_unpd rename_country_columns rename_country_columns create_new_columns_for_temp_tables create_new_columns_for_temp_tables upload_from_datos_pre_procesados_to_S3_bucket
-
Finally, we create each data table for the data warehouse for this project. We chose two fact tables "ingreso" and two describe tables "pais" and "nivel."
-
The data is then loaded into a Postgres database.
Development Production create_df_table_ingreso create_df_table_ingreso create_df_table_pais create_df_table_pais create_df_table_nivel create_df_table_nivel create_df_table_indice create_df_table_indice remove_temporal_tables remove_temporal_tables load_to_postgres_db upload_from_datos_procesados_to_S3_bucket load_to_postgres_db_clusters
- ToDo
- ToDo
In development
-
we use Git-LFS to track, save, and retrieve data from the data/ folder.
-
We use notebooks in draft/folders to connect to API sources, perform exploratory data analysis on the extracted raw data, transform the extracted raw data, and load it into various data bases.
-
Extraction
-
Transformation
-
Load
In production
-
We use an EC2 instance on Linode to handle the Airflow, which is connected to an S3 bucket on Linode Object Storage to track, save, and retrieve data, and the data-warehouse is finally stored in Database Clusters on Linode's platform.
-
Extraction
-
Transformation
-
Load
ToDo
The dashboard in PowerBI. Some screenshots:
-
This is the data warehouse connection from Linode's Database Clusters to PowerBI.
-
Image 1: Show the income disparity between countries with low, mid-low, mid-high, and high incomes. A dynamic map displaying the average life expectancy in each country:
-
Image 2: A record of the behavior of life expectancy at birth according to year and income level:
-
Image 2: A record about the behavior of the 7 factors that have the greatest influence on life expectancy:
-
To experiment with it locally, you must have docker and docker-compose installed on your computer. You can check if you have it installed by using
docker --version docker-compose --version
-
If not, consult the Docker for installation instructions specific to your operating system.
-
To create the environment variables and folders that will be used during the Airflow workflow.
mkdir -p ./dags ./logs ./plugins ./data echo -e "AIRFLOW_UID=$(id -u)" > .env
-
For Linux, this code will automatically generate the .env file and folders; alternatively, create the folders manually and rename the file.env.sample to.env and add the following line.
AIRFLOW_UID=50000
-
When using an extended version of Airflow, the following code must be executed to extend the Airflow Docker Image.
docker build . --tag extending_airflow:latest
-
When creating the container with the following command, the docker-compose YAML file will be executed without error.
docker-compose up airflow-init
-
This will start the Docker containers with the Airflow service, the schedule, and a Postgres database to store the data.
-
Finally we can run it with
docker-compose up -d
-
So we're launching Ubuntu 22.04.1 LTS. Say yes to the fingerprint and add the password on a Linode instance, which is very similar to an EC2 AWS instance.
-
Install docker & docker compose
-
Make a ssk-key for GitHub
ssh-keygen -t ed25519 -b 4096
-
We clone and add the corresponding .env file with the S3 bucket and Postgres database keys.
# .env AIRFLOW_UID=1000 AWS_ACCESS_KEY_ID=AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY=AWS_SECRET_ACCESS_KEY AWS_ENDPOINT_URL=AWS_ENDPOINT_URL DB_USERNAME=DB_USERNAME DB_PASSWORD=DB_PASSWORD DB_HOST=DB_HOST DB_PORT=DB_PORT DB_NAME=DB_NAME
-
Remember to change the third line of the Dockerfile because we are no longer using jupyter or black.
# FROM apache/airflow:2.4.2 # COPY /requirements/ /requirements/ # RUN pip install --user --upgrade pip RUN pip install --no-cache-dir --user -r /requirements/production.txt
-
Extend version of Airflow, create the container with the docker-compose YAML, start the Docker containers with the Airflow service.
docker build . --tag extending_airflow:latest docker-compose up airflow-init docker-compose up -d
-
Open 8080 port and run the Dag_for_production.
-
If you want to learn more about the code's development, check out the documentation on Wiki (Sorry, but the documentation for the KPI we want to demonstrate is in Spanish).
-
Alternatively, you can access the notebooks in the draft file in which the code is developed step by step.
-
You can learn from the architecture we designed and the pipeline we implemented.
-
You can play around with the PowerBI dashboard or get your own conclusions using the machine learning models we use. The links can be found in the Produce section.
ToDo
- helper link
- helper link
- helper link
- helper link
- dependency resolver and version conflicts
- add border to image in github markdown
- linode object-storage guides for python
- delete multiple files and specific pattern in s3 boto3
- linode urls
- hidden file env not copied using docker copy
- how can i access environment variables in python