This repository contains my personal data pipeline, built with Prefect, to automate data extraction, transformation, and loading (ETL). The pipeline runs on my NAS and integrates multiple data sources, processing them and exporting the results to MotherDuck for analysis in Metabase.
The pipeline serves two main purposes:
- Automation: Automates repetitive data tasks by extracting data from APIs, cloud services, and other sources.
- Analysis: Loads transformed data into MotherDuck, where Metabase connects for visualization and insights.
The pipeline is orchestrated with Prefect, using flows and tasks to manage dependencies and execution. The scheduling logic is defined in schedules/hourly.py
, ensuring that jobs run on an hourly basis. The execution happens on my NAS.
- Extract: Data is sourced from various integrations, including Dropbox, Google Sheets, APIs, and local files.
- Load: Processed data is exported to MotherDuck for storage and analysis.
- Transform: Data processing is primarily handled using
dbt
, with additional transformations as needed. - Visualization: Metabase connects to MotherDuck to generate insights and reports.
β vtasks
βββ common # Shared utilities and helpers
βββ jobs # Individual data extraction and processing jobs
β βββ backups # Backup management
β βββ crypto # Crypto price tracking
β βββ dropbox # Dropbox integrations
β βββ gcal # Google Calendar data processing
β βββ gsheets # Google Sheets integrations
β βββ indexa # Indexa Capital data extraction
βββ schedules # Prefect scheduling logic
β ββββ hourly.py # Main schedule triggering all jobs hourly
βββ vdbt # dbt project for data modeling
Note
This project uses UV for dependency management and execution.
- Install UV (if not installed):
pip install uv
- Set up the virtual environment:
uv venv .venv
- Install dependencies (in editable mode for local development):
uv pip install --editable .
- Ensure pre-commit hooks are installed:
pre-commit install
To run the main hourly schedule manually:
uv run python -m vtasks.schedules.hourly
To run a specific job, use module-based execution:
uv run python -m vtasks.jobs.dropbox.export_tables
This ensures that relative imports work correctly.
Note This pipeline is deployed manually to my NAS since I prefer not to grant GitHub Actions access to it. This ensures better control and security over the deployment process.
- Ensure you are connected to Tailscale.
- Set the Prefect API URL to point to the NAS.
- Deploy all flows manually using Prefect.
set PREFECT_API_URL=http://tnas:6006/api
prefect --no-prompt deploy --all
To ensure all files reflect the correct version, use bump2version
:
bump2version patch # Or major/minor
This automates version updates across prefect.yaml
, pyproject.toml
, and uv.lock
.
This repository is licensed under MIT.