This repository hosts the material for a workshop given at Small Data SF 2025.
During this workshop, you will use the Python library dlt to build an extract, load, transform (ELT) pipeline for the official GitHub REST API.
We'll go through the full lifecyle of a data project:
- Load data from a REST API
- Ensure data quality via manual exploration and checks
- Transform raw data into clean data and metrics
- Build a data product (e.g., report, web app)
- Deploy the pipeline and data product
We'll introduce and suggest several tools throughout the workshop: dlt, LLM scaffolding, Continue, duckdb, Motherduck, marimo, ibis, and more!
This workshop will alternate between:
- Tutorial: speakers explain and demonstrate concepts
- Exercise: participants code to solve a task
Most exercises are open-ended and participants are invited to explore their own path (e.g., ingest data from different endpoints). It's also possible to follow along the speaker during exercise segments.
To avoid getting stuck, this repository includes several checkpoints to resume from.
All of the workshop material is in this repository. A brief overview:
README.mdcontains all of the written instructions for the workshop. See the## Setupsection for installation instructionspyproject.tomlanduv.lockdefine Python dependencies.continue/contains MCP configuration for the Continue IDE extension.cursor/contains MCP configuration for the Cursor IDE.vscode/contains MCP configuration for the GitHub Copilot extension
- Start by cloning this repository on your local machine
git clone https://github.com/dlt-hub/small-data-sf-2025- Move to the repository directory
cd small-data-sf-2025-
Create a virtual Python environment and active it
# on Linux & MacOS python -m venv .venv && .venv/bin/activate
# on Windows python -m venv .venv && .venv/Scripts/activate
-
Install Python dependencies
pip install -r requirements.txt
During the workshop, we will use the Python library dlt. It is open source and under the Apache 2.0 license. We will also use the Python library dlthub, which includes paid features. A 30-day trial license can be self-issued anonymously for development, education, and CI operations purposes.
- Setup the Python environment, which includes
dltanddlthub - Self-issue a license for
dlthuband the specified features:
dlt license issue dlthub.transformation- It should automatically store the token. If it prints a warning, follow the instructions.
- Verify the token is properly set:
dlt license info. The result should look likeSearching dlt license in environment or secrets toml License found License Id: 736xxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx Licensee: machine:4366xxxxxxxxxxxxxxxxxxxxxxxxxxxx Issuer: dltHub Inc. License Type: self-issued-trial Issued: 2025-11-xx xx:xx:xx Scopes: dlthub.transformation Valid Until: 2025-12-xx xx:xx:xx
The workshop will focus on using public data from the official GitHub REST API. The REST API is free to use, but you need a GitHub token to make requests to most endpoints.
- Login on GitHub
- Go to https://github.com/settings/apps
- Select
Personal access tokens > Tokens (classic) - Click
Generate new token > Generate new token (classic) - Set a
note. You don't need to select anyscope. - Click
Generate token - Store the token value securely. We will add it to
.dlt/secrets.tomlduring the workshop.
References:
The vast majority of the workshop will be happening on your local machine. We'll be using Motherduck during the later steps to show how to go from local development to production. You can signup via email, Google, or GitHub. The free tier is sufficient for the workshop (you might receive a free business trial on first signup).
- Login / signup to Motherduck https://app.motherduck.com/
- (on signup) Go through onboarding flow. Look out for the
Skipbutton. - Go to https://app.motherduck.com/settings/tokens
- Click
Create tokento generate aRead/Write Token - Store the token value securely. We will add it to
.dlt/secrets.tomlduring the workshop.
During the workshop, we will use the VSCode extension Continue to build data pipelines using dlt, LLMs, and MCP servers. It is open source and under the Apache 2.0 license. You will need an LLM API key to use it (OpenAI, Anthropic, Mistral, etc.). Using self-hosted LLMs is also possible.
Note. If you already have a subscription with Cursor, Copilot, Windsurf, etc., you will be able to follow along with these tools. The interface and configuration will differ slightly though.
- Install VSCode https://code.visualstudio.com/
- Inside VSCode, go to the
Extensionstab - Search for
continueand install the extension (bycontinue.dev, identifier iscontinue.continue) - Go to the Continue chat panel. You can find it by doing
CTRL + P(command palette) and executeContinue: Focus Continue Chat - Under
Models, set your LLM API key. We suggest select models X or YTODO complete steps - Under
Tools, you should see the MCP server loaded with some tools if you properly setup your Python environment.