Life Expentancy

Project Description

What it is? 💡 To generate insights about Life Expectancy, we created a pipeline from the ETL to create a data warehouse using APIs from the World Bank, World Health Organization, and United Nations. With this clean data, we created stunning dashboards and predictions based on machine learning models. Why did you build this project? 💡 This is the final project for Henry's Data Science Bootcamp.

What was my motivation? 💡 Reduce the gap by three years between developed and underdeveloped countries in the next 10 years!. What did you learn? 💡 Agile methodologies (SCRUM) and a GitHub flow to collaborate as a team. The pipeline was built using the architectures of "ETL with Airflow running as a web service," "Data Analytics with simple Notebooks," "Data Lake and Data Warehouse as a Service in Linode Cloud" "Interactive Dashboards using PoweBI and Streamlit for ML predictions," and "Docker for development and production environments".

Parts of the Project

1. Architecture

We are building a pipeline with Apache-airflow. It begins by Extracting raw data from the World Bank, World Health Organization, and United Nations via their respective APIs. Second, we Transform this raw using a bucket to save and retrieve data in parquet format. Third, we Load this clean data into a Postgres Data Warehouse so that it is ready for any PowerBI or Streamlit connections.
In production, it will use an EC2 for the airflow pipeline, an Object Storage (S3 bucket) for storing the parquet files, and an RDB for the data warehouse and backups, all on Linode Cloud platform because it is less expensive and simpler than AWS, Azure, or Google Cloud.

2. DevOps

In development mode, we use docker-compose to orchestrate the airflow pipeline and the Postgres database; because there will be large files, We will save them as binary parquet files using GIT LFS (Large File System). The remote parquet files from GitHub will be used for connections with PowerBI or Streamlit.
In production mode, the EC2 instance will pull any changes from the development environment, save and retrieve data from the S3 bucket manually or automatically, and ingest the clean data into a Postgres data warehouse. PowerBI and Streamlit will be able to access this data warehouse and display dashboards in realtime.

3. Extraction

First, we will select the indices that we will request from the APIs and add them to the corresponding json file.

 # data/datos_inyectados/naciones_unidas.json
 {
     "22": ["mort", "tasa_mortalidad_infantil"],
     "54": ["pop", "densidad_población_por_kilómetro_cuadrado)"],
     "65": ["imigrt", "migración_neta_total"],
     "49": ["pop", "población_total_por_sexo"],
     "60": ["mort", "total_muertes_por_sexo"],
     "53": ["pop", "tasa_bruta_cambio_natural_población"],
     "66": ["imigrt", "tasa_bruta_migración_neta"],
     "72": ["pop", "proporción_sexos_población_total"],
     "1": ["fam", "prevalencia_anticonceptivos_porcentaje"],
     "67": ["pop", "mediana_edad_población"],
     "59": ["mort", "tasa_bruta_mortalidad_por_1000_habitantes"],
     "51": ["pop", "tasa_bruta_variación_total_población"],
     "50": ["pop", "cambio_de_la_población"],
     "41": ["pop", "población_femenina_edad_reproductiva_(15-49 años)"],
     "24": ["mort", "tasa_mortalidad_menores_cinco_años"],
     "52": ["pop", "cambio_natural_población"],
     "19": ["fert", "tasa_fertilidad"],
     "42": ["marstat", "estado_civil_casado_porcentaje"]
 }

# data/datos_inyectados/banco_mundial.json
{
    "SP.DYN.LE00.IN": "esperanza_vida_total",
    "SP.DYN.LE00.FE.IN": "esperanza_vida_mujeres",
    "SP.DYN.LE00.MA.IN": "esperanza_vida_varones",
    "SI.POV.GINI": "índice_gini",
    "SE.XPD.TOTL.GD.ZS": "gasto_púb_educacion_pje",
    "SE.COM.DURS": "duración_educ_obligatoria",
    "NY.GDP.PCAP.CD": "pib_pc_usd_actuales",
    "NY.GDP.MKTP.PP.CD": "pib_ppa_prec_inter",
    "IQ.SCI.OVRL": "capacidad_estadística",
    "SP.POP.TOTL.FE.ZS": "población_mujeres_pje",
    "SP.POP.TOTL.MA.ZS": "población_hombres_pje",
    "NY.GDP.PCAP.PP.CD": "pib_pc_prec_inter",
    "AG.LND.FRST.ZS": "porcentaje_de_bosque",
    "EN.ATM.CO2E.PC": "emisiones_co2",
    "SH.XPD.CHEX.PC.CD": "inversion_salud_percapita",
    "SH.MED.BEDS.ZS": "camas_hospitales_c/1000personas",
    "SP.DYN.IMRT.IN": "mortalidad_infantil_c/1000nacimientos",
    "SH.H2O.BASW.ZS": "acceso_agua_potable(%)",
    "SH.STA.BASS.ZS": "acceso_servicios_sanitarios(%)",
    "SH.STA.SUIC.P5": "tasa_mortalidad_suicidio_c/100.000",
    "SL.UEM.TOTL.ZS": "tasa_desempleo",
    "SP.URB.TOTL.IN.ZS": "tasa_poblacion_urbana",
    "NY.GNP.PCAP.CD": "INB_percapita",
    "PV.EST": "estabilidad_política"
}

# data/datos_inyectados/mundial_salud.json
{
   "M_Est_cig_curr": "df_OMS_M_Est_cig_curr",
   "NCD_BMI_30A": "df_OMS_NCD_BMI_30A",
   "NUTRITION_ANAEMIA_CHILDREN_PREV": "df_OMS_NUTRITION_ANAEMIA_CHILDREN_PREV",
   "NUTRITION_ANAEMIA_REPRODUCTIVEAGE_PREV": "df_OMS_NUTRITION_ANAEMIA_REPRODUCTIVEAGE_PREV",
   "SA_0000001688": "df_OMS_SA_0000001688"
}

So, in this first section, we create the folders called "datos brutos", "datos pre procesados", and "datos procesados" where we will save and retrieve data.
Next, from The World Bank and The United Nations API raw data will then be used to generate a countries dataframe.

We will generate raw data from the previously chosen indices in the injected json data files.

Development	Production
folders_creation	folders_creation
creation_of_df_contry_from_twb	upload_from_datos_inyectados_to_S3_bucket
creation_of_df_contry_from_unpd	creation_of_df_contry_from_twb
twb_indice_extraction	creation_of_df_contry_from_unpd
unpd_indice_extraction	twb_indice_extraction
	unpd_indice_extraction
	upload_from_datos_brutos_to_S3_bucket

4. Transformation

For the transformation state, we will create a table dataframe with all indices extracted previously.
Rename all columns with the extracted indices in Spanish.

We also create new temporal tables dataframes for each table in the database that we will create in the following step.

Development	Production
read_and_transformation_for_twb_unpd	read_and_transformation_for_twb_unpd
final_transformations_for_twb_unpd	final_transformations_for_twb_unpd
rename_country_columns	rename_country_columns
create_new_columns_for_temp_tables	create_new_columns_for_temp_tables
	upload_from_datos_pre_procesados_to_S3_bucket

5. Load

Finally, we create each data table for the data warehouse for this project. We chose two fact tables "ingreso" and two describe tables "pais" and "nivel."

The data is then loaded into a Postgres database.

Development	Production
create_df_table_ingreso	create_df_table_ingreso
create_df_table_pais	create_df_table_pais
create_df_table_nivel	create_df_table_nivel
create_df_table_indice	create_df_table_indice
remove_temporal_tables	remove_temporal_tables
load_to_postgres_db	upload_from_datos_procesados_to_S3_bucket
	load_to_postgres_db_clusters

6. EDA

ToDo

7. Machine Learning

ToDo

Products

1. Airflow

In development

we use Git-LFS to track, save, and retrieve data from the data/ folder.
We use notebooks in draft/folders to connect to API sources, perform exploratory data analysis on the extracted raw data, transform the extracted raw data, and load it into various data bases.
Extraction
Transformation
Load

In production

We use an EC2 instance on Linode to handle the Airflow, which is connected to an S3 bucket on Linode Object Storage to track, save, and retrieve data, and the data-warehouse is finally stored in Database Clusters on Linode's platform.
Extraction
Transformation
Load

2. Streamlit

ToDo

3. PowerBI

The dashboard in PowerBI. Some screenshots:

This is the data warehouse connection from Linode's Database Clusters to PowerBI.
Image 1: Show the income disparity between countries with low, mid-low, mid-high, and high incomes. A dynamic map displaying the average life expectancy in each country:
Image 2: A record of the behavior of life expectancy at birth according to year and income level:
Image 2: A record about the behavior of the 7 factors that have the greatest influence on life expectancy:

How to Install and Run the Project

1. Localy

To experiment with it locally, you must have docker and docker-compose installed on your computer. You can check if you have it installed by using
```
docker --version
docker-compose --version
```
If not, consult the Docker for installation instructions specific to your operating system.
To create the environment variables and folders that will be used during the Airflow workflow.
```
mkdir -p ./dags ./logs ./plugins ./data
echo -e "AIRFLOW_UID=$(id -u)" > .env
```
For Linux, this code will automatically generate the .env file and folders; alternatively, create the folders manually and rename the file.env.sample to.env and add the following line.
```
AIRFLOW_UID=50000
```
When using an extended version of Airflow, the following code must be executed to extend the Airflow Docker Image.
```
docker build . --tag extending_airflow:latest
```
When creating the container with the following command, the docker-compose YAML file will be executed without error.
```
docker-compose up airflow-init
```
This will start the Docker containers with the Airflow service, the schedule, and a Postgres database to store the data.
Finally we can run it with
```
docker-compose up -d
```

2. Production

So we're launching Ubuntu 22.04.1 LTS. Say yes to the fingerprint and add the password on a Linode instance, which is very similar to an EC2 AWS instance.
Install docker & docker compose
Make a ssk-key for GitHub
```
ssh-keygen -t ed25519 -b 4096
```

We clone and add the corresponding .env file with the S3 bucket and Postgres database keys.

# .env
AIRFLOW_UID=1000
AWS_ACCESS_KEY_ID=AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY=AWS_SECRET_ACCESS_KEY
AWS_ENDPOINT_URL=AWS_ENDPOINT_URL
DB_USERNAME=DB_USERNAME
DB_PASSWORD=DB_PASSWORD
DB_HOST=DB_HOST
DB_PORT=DB_PORT
DB_NAME=DB_NAME

Remember to change the third line of the Dockerfile because we are no longer using jupyter or black.

# FROM apache/airflow:2.4.2
# COPY /requirements/ /requirements/
# RUN pip install --user --upgrade pip
RUN pip install --no-cache-dir --user -r /requirements/production.txt

Extend version of Airflow, create the container with the docker-compose YAML, start the Docker containers with the Airflow service.
```
docker build . --tag extending_airflow:latest
docker-compose up airflow-init
docker-compose up -d
```
Open 8080 port and run the Dag_for_production.

How to ...

1. Contribute

If you want to learn more about the code's development, check out the documentation on Wiki (Sorry, but the documentation for the KPI we want to demonstrate is in Spanish).
Alternatively, you can access the notebooks in the draft file in which the code is developed step by step.

2. Use the Project

You can learn from the architecture we designed and the pipeline we implemented.
You can play around with the PowerBI dashboard or get your own conclusions using the machine learning models we use. The links can be found in the Produce section.

Test

ToDo

Name		Name	Last commit message	Last commit date
Latest commit History 187 Commits
dags		dags
data		data
draft		draft
life-expectancy.wiki		life-expectancy.wiki
logs/scheduler		logs/scheduler
requirements		requirements
.dockerignore		.dockerignore
.env.sample		.env.sample
.gitattributes		.gitattributes
.gitignore		.gitignore
Dashboard_conexion_Linode.pbix		Dashboard_conexion_Linode.pbix
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Life Expentancy

Project Description

Table of Contents

Parts of the Project

1. Architecture

2. DevOps

3. Extraction

4. Transformation

5. Load

6. EDA

7. Machine Learning

Products

1. Airflow

2. Streamlit

3. PowerBI

How to Install and Run the Project

1. Localy

2. Production

How to ...

1. Contribute

2. Use the Project

Test

+ Info

Licence GNU GPLv3

About

Releases

Packages

Contributors 4

Languages

License

jorgeav527/life-expectancy

Folders and files

Latest commit

History

Repository files navigation

Life Expentancy

Project Description

Table of Contents

Parts of the Project

1. Architecture

2. DevOps

3. Extraction

4. Transformation

5. Load

6. EDA

7. Machine Learning

Products

1. Airflow

2. Streamlit

3. PowerBI

How to Install and Run the Project

1. Localy

2. Production

How to ...

1. Contribute

2. Use the Project

Test

+ Info

Licence GNU GPLv3

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages