This is a data pipeline that takes in Fitbit data, uploads it a Data Lake (Google Cloud Storage), copies it into a BigQuery database and injects SQL into BigQuery to create a data schema ready for analysis.
- Visualize changes in Fitbit biometrics across time periods
- WHY: my fitbit wellness report has been broken for months and I would like to see how activities impact my health
- Example Use Cases: see the impact of a...
- fitness routine
- medication
- stressful life event
- Practice building a scalable data pipeline with best practices and fault tolerance
- develop a schema based on the Data API and Analysis needs
- incrementally read each technology's documentation and build pipeline
- implement pipeline steps with key metrics in mind
- learn from feedback from the course peer review process
- Meet submission deadlines for the DataTalks.club 2025 Course Schedule
- OPTIONALLY Download Your Fitbit Data – Retrieves biometric data from the Fitbit API and stores it in JSON format.
- Flattens JSON Tables and Converts to Parquet – Transforms the JSON files into Parquet format for optimized storage, transmission, and processing.
- Upload Fitbit Data to a Google Cloud Storage Bucket – Transfers the Parquet files to GCS, utilizing the bucket as a Data Lake.
- Create BigQuery Heart Rate, Sleep and Profile Tables – Creates external BigQuery tables for user profiles, sleep and heart rate data stored in GCS.
- Partitions and Transforms Data in BigQuery – Using DBT, inject SQL-based transformations to BigQuery to clean, standardize, and prepare data for analysis.
- Python to connect and download from the Fitbit API and reformat the downloaded json files to parquet
- Apache Airflow orchestrates and schedules download, reformatting, upload, database transfer and SQL transformation.
- PostgreSQL provides Airflow a database to store workflow metadata about DAGs, tasks, runs, and other elements
- Google BigQuery to process data analytics. Table partitioning is done in the dbt staging process
- dbt (Data Build Tool) injects SQL data transformations into BigQuery. Keeping SQL externally allows version control better to better maintain SQL code.
- Docker encapsulates the pipeline ensuring portability.
Terraform, Google Cloud Platform Project, Google Cloud CLI
- create a service account and download a .json key file
- GCP Dashboard -> IAM & Admin > Service accounts > Create service account
- set a name & Leave all other fields with default values -> Create and continue
- Grant the Viewer role (Basic > Viewer) -> Continue -> Done
- 3 dots below Actions -> Manage keys -> Add key -> Create new key -> JSON -> Create
- Add Cloud Storage & BigQuery permissions to your Service Account
- find your service account at IAM Cloud UI
- use
+Add another roleto add these roles- Storage Admin
- Storage Object Admin
- BigQuery Admin
- Viewer
- Enable the IAM API
- Enable the IAM Service Account Credentials API
- Add Compute VM permissions to your Service Account
- find your service account at IAM Cloud UI
- use
+Add another roleto add these roles- Compute Instance Admin
- Compute Network Admin
- Compute Security Admin
- Enable the Compute Engine API
git clone https://github.com/MichaelSalata/compare-my-biometrics.gitssh-keygen -t rsa -b 2048 -C "[email protected]"NOTE: pick a unique gcs_bucket_name like projectName-fitbit-bucket
credentials = "/path/to/service_credentials.json"
project = "google_project_name"
gcs_bucket_name = "UNIQUE-google-bucket-name"
ssh_user = "[email protected]"
public_ssh_key_path = "~/path/to/id_rsa.pub"
private_ssh_key_path = "~/path/to/id_rsa"
example
credentials = "/home/michael/.google/credentials/google_credentials.json"
project = "dtc-de-1287361"
gcs_bucket_name = "dtc-de-1287361-fb-bucket"
ssh_user = "michael"
public_ssh_key_path = "~/.ssh/id_rsa.pub"
private_ssh_key_path = "~/.ssh/id_rsa"
OPTIONAL: Use YOUR Fitbit Data
Alternatively, you can run the example DAG(parallel_backfill_fitbit_example_data) which uses my example fitbit data spanning 11-21-2024 to 3-16-2025
cd ./terraform
terraform init
terraform applyOPTION 1:
- run
bash ./setup_scripts/visit_8080_on_vm.sh
OPTION 2:
- get your Compute Instance's External IP in your Google VM instances
- visit External IP:8080
- choose and run the appropriate DAG
- Ctrl+C will stop Terraform running
- These commands will destroy the resources the your services account provisioned
cd ./terraform
terraform destroyThanks to Alexey, Manuel and the Datatalks Club community. Their Data Engineering Course was instrumental in creating this project.
- get the project hosted in the cloud ✅ 2025-04-07
- make Idempotent
- implement CI/CD
- handle secure user data with Airflow Secrets

