lea is a minimalist alternative to SQL orchestrators like dbt and SQLMesh.
lea aims to be simple and provides sane defaults. We happily use it every day at Carbonfact to manage our BigQuery data warehouse. We will actively maintain it and add features, while welcoming contributions.
- Jaffle shop π₯ͺ
 - Incremental π
 - School π«
 - Compare development to production π―ββοΈ
 - Using MotherDuck π¦
 
Use one of the following commands, depending on which warehouse you wish to use:
pip install lea-cliThis installs the lea command. It also makes the lea Python library available.
lea is configured via environment variables.
# Required
LEA_WAREHOUSE=bigquery
# Required
LEA_BQ_LOCATION=EU
# Required
LEA_BQ_DATASET_NAME=kaya
# Required, the project where the dataset is located
LEA_BQ_PROJECT_ID=carbonfact-dwh
# Optional, allows using a different project for compute
LEA_BQ_COMPUTE_PROJECT_ID=carbonfact-dwh-compute
# Not necessary if you're logged in with the gcloud CLI
LEA_BQ_SERVICE_ACCOUNT=<JSON dump of the service account file>  # not a path β οΈ
# Defaults to https://www.googleapis.com/auth/bigquery
LEA_BQ_SCOPES=https://www.googleapis.com/auth/bigquery,https://www.googleapis.com/auth/drive
# LOGICAL or PHYSICAL, defaults to PHYSICAL
LEA_BQ_STORAGE_BILLING_MODEL=PHYSICAL# Required
LEA_WAREHOUSE=duckdb
# Required
LEA_DUCKDB_PATH=duckdb.db
# Optional
LEA_DUCKDB_EXTENSIONS=parquet,httpfs# Required
LEA_WAREHOUSE=motherduck
# Required
MOTHERDUCK_TOKEN=<get this from https://app.motherduck.com/settings/tokens>
# Required
LEA_MOTHERDUCK_DATABASE=bike_sharing
# Optional
LEA_DUCKDB_EXTENSIONS=parquet,httpfs# Required
LEA_WAREHOUSE=ducklake
# Required
LEA_DUCKLAKE_DATA_PATH=gcs://bike-sharing-analytics
# Required
LEA_DUCKLAKE_CATALOG_DATABASE=metadata.ducklake
# Optional
LEA_DUCKLAKE_S3_ENDPOINT=storage.googleapis.com
# Optional
LEA_DUCKDB_EXTENSIONS=parquet,httpfsDuckLake needs a database to manage metadata, which is what LEA_DUCKLAKE_CATALOG_DATABASE is for.
These parameters can be provided in an .env file, or directly in the shell. Each command also has an --env flag to provide a path to an .env file.
This is the main command. It runs SQL queries stored in the scripts directory:
lea runYou can indicate the directory where the scripts are stored:
lea run --scripts /path/to/scriptsThe scripts are run concurrently. They are organized in a DAG, which is traversed in a topological order. The DAG's structure is determined automatically by analyzing the dependency between queries.
Each script is expected to be placed under a schema, represented by a directory. Schemas can have sub-schemas. Here's an example:
scripts/
    schema_1/
        table_1.sql
        table_2.sql
    schema_2/
        table_3.sql
        table_4.sql
        sub_schema_2_1/
            table_5.sql
            table_6.sql
Each script is materialized into a table. The table is named according to the script's name, following the warehouse convention.
A script may contain multiple SQL statements, separated by semicolons. The last statement has to be a SELECT statement, while the previous ones are expected to be procedural statements (e.g. DECLARE and SET in BigQuery). The way it works under the hood is by creating a session for the script, and running all statements within the same session.
SQL queries can be templated with Jinja. A .sql.jinja extension is necessary for lea to recognise them.
You have access to an env variable within the template context, which is simply an access point to os.environ.
By default, lea creates an isolation layer with production. The way this is done depends on your warehouse:
- BigQuery : by appending a 
_<user>suffix to schema names - DuckDB : by adding a suffix 
_<user>to database file. 
In other words, a development environment is used by default. Use the --production flag when executing lea run to disable this behaviour, and instead target the product environment.
lea run --productionThe <user> is determined automatically from the login name. It can be overriden by setting the LEA_USERNAME environment variable.
A single script can be run:
lea run --select core.usersSeveral scripts can be run:
lea run --select core.users --select core.ordersSimilar to dbt, lea also supports graph operators:
lea run --select core.users+   # users and everything that depends on it
lea run --select +core.users   # users and everything it depends on
lea run --select +core.users+  # users and all its dependenciesYou can select all scripts in a schema:
lea run --select core/  # the trailing slash mattersThis also work with sub-schemas:
lea run --select analytics.finance/There are thus 8 possible operators:
schema.table    (table by itself)
schema.table+   (table with its descendants)
+schema.table   (table with its ancestors)
+schema.table+  (table with its ancestors and descendants)
schema/         (all tables in schema)
schema/+        (all tables in schema with their descendants)
+schema/        (all tables in schema with their ancestors)
+schema/+       (all tables in schema with their ancestors and descendants)
Combinations are possible:
lea run --select core.users+ --select +core.ordersThere's an Easter egg that allows choosing scripts that have been committed or modified in the current Git branch:
lea run --select git
lea run --select git+  # includes all descendantsThis becomes very handy when using lea in continuous integration.
WAP is a data engineering pattern that ensures data consistency and reliability. It's the data engineering equivalent of blue-green deployment in the software engineering world.
lea follows the WAP pattern by default. When you execute lea run, it actually creates temporary tables that have an ___audit suffix. The latter tables are promoted to replace the existing tables, once they have all been materialized without errors.
This is a good default behavior. Let's say you refresh table foo. Then you refresh table bar that depends on foo. If the refresh of bar fails, you're left with a corrupt state. This is what the WAP pattern solves. In WAP mode, when you run foo's script, it creates a foo___audit table. If bar's script fails, then the run stops and foo is not modified.
There is no lea test command. Tests are run together with the regular script when lea run is executed. The run stops whenever a test fails.
There are two types of tests:
- Singular tests β these are queries which return failing rows. They are stored in a 
testsdirectory. - Assertion tests β these are comment annotations in the queries themselves:
#NO_NULLSβ checks that all values in a column are not null.#UNIQUEβ checks that a column's values are unique.#UNIQUE_BY(<by>)β checks that a column's values are unique within a group.#SET{<elements>}β checks that a column's values are in a set of values.
 
Here's an example of a query annotated with assertion tests:
SELECT
    -- #UNIQUE
    -- #NO_NULLS
    user_id,
    -- #NO_NULLS
    address,
    -- #UNIQUE_BY(address)
    full_name,
    -- #SET{'A', 'B', 'AB', 'O'}
    blood_type
FROM core.usersYou can run a single test via the --select flag:
lea run --select tests.check_n_usersOr even run all the tests, as so:
lea run --select tests/  # the trailing slash mattersβοΈ When you run a script that is not a test, all the applicable tests are run as well. For instance, the following command will run the core.users script and all the tests that are applicable to it:
lea run --select core.usersYou may decide to run all scripts without executing tests, which is obviously not advisable:
lea run --unselect tests/
lea run --select core.users --unselect tests/When you call lea run, it generates audit tables, which are then promoted to replace the original tables. This is done to ensure that the data is consistent and reliable. lea doesn't run scripts when the audit table already exists, and when the script hasn't modified since the last time the audit table was created. This is to avoid unnecessary re-runs of scripts that haven't changed.
For instance:
- You execute 
lea runto sync all tables from sources, no errors, all tables are materialized. - You modify a script named 
core/expenses.sqldepending onstaging/customers.sqlandstaging/orders.sql - You execute 
lea run core.expenses+to run again all impacted tables core__expenses___auditis materialized in your data warehouse but the-- #NO_NULLSassertion test on a column fails- After reviewing data in 
core__expenses___audit, you edit and fixcore/expenses.sqlto filter out results where NULLs are appearing - You execute 
lea run - The 
staging/customers.sqlandstaging/orders.sqlscripts are skipped because they were modified beforestaging__customersandstaging__orderswas last materialized - The 
core/expenses.sqlscript is run because it was modified aftercore__expenseswas last materialized - All audit tables are wipped out from database as the whole DAG has run successfully ! π
 
You can disable this behavior altogether:
lea run --restartAt Carbonfact, we cluster most of our tables by customer. This is done to optimize query performance and reduce costs. lea allows you to automatically cluster tables that contain a given field:
LEA_BQ_DEFAULT_CLUSTERING_FIELDS=account_slugYou can also specify multiple fields, meaning that tables which contain both fields will be clustered:
LEA_BQ_DEFAULT_CLUSTERING_FIELDS=account_slug,brand_slugFor each table, lea will use the clustering fields it can and ignore the others. With the previous configuration, if your table defines account_slug and not brand_slug, it will cluster by account_slug.
You can also define clustering fields for a specific table:
SELECT
  account_slug,
  -- #CLUSTERING_FIELD
  object_kind,
  value
FROM my_table
If you define specific clustering fields for a table, they will be added in addition to the default ones. This is done to ensure that project-wide clustering fields are kept up to date in each table.
Big Blue is a SaaS product to monitor and optimize BigQuery costs. As part of their offering, they provide a Pick API. The idea is that some queries should be run on-demand, while others should be run on a reservation. Big Blue's Pick API suggests which billing model to use for each query.
We use this at Carbonfact, and so this API is available out of the box in lea. You can enable it by setting the following environment variables:
LEA_BQ_BIG_BLUE_PICK_API_KEY=<get is from https://your-company.biq.blue/settings.html>
LEA_BQ_BIG_BLUE_PICK_API_URL=https://pick.biq.blue
LEA_BQ_BIG_BLUE_PICK_API_ON_DEMAND_PROJECT_ID=on-demand-compute-project-id
LEA_BQ_BIG_BLUE_PICK_API_REVERVATION_PROJECT_ID=reservation-compute-project-idFeel free to reach out to [email protected] if you want to know more and/or contribute π
We have suggested some issues as good places to get started.
lea is free and open-source software licensed under the Apache License, Version 2.0.