Keep It Simple and Scalable

This repository hosts the material for a workshop given at Small Data SF 2025.

During this workshop, you will use the Python library dlt to build an extract, load, transform (ELT) pipeline for the official GitHub REST API.

We'll go through the full lifecyle of a data project:

Load data from a REST API
Ensure data quality via manual exploration and checks
Transform raw data into clean data and metrics
Build a data product (e.g., report, web app)
Deploy the pipeline and data product

We'll introduce and suggest several tools throughout the workshop: dlt, LLM scaffolding, Continue, duckdb, Motherduck, marimo, ibis, and more!

Workshop format

This workshop will alternate between:

Tutorial: speakers explain and demonstrate concepts
Exercise: participants code to solve a task

Most exercises are open-ended and participants are invited to explore their own path (e.g., ingest data from different endpoints). It's also possible to follow along the speaker during exercise segments.

To avoid getting stuck, this repository includes several checkpoints to resume from.

Repository structure

All of the workshop material is in this repository. A brief overview:

README.md contains all of the written instructions for the workshop. See the ## Setup section for installation instructions
pyproject.toml and uv.lock define Python dependencies
.continue/ contains MCP configuration for the Continue IDE extension
.cursor/ contains MCP configuration for the Cursor IDE
.vscode/ contains MCP configuration for the GitHub Copilot extension

Setup

Start by cloning this repository on your local machine

git clone https://github.com/dlt-hub/small-data-sf-2025

Move to the repository directory

cd small-data-sf-2025

Python environment

Create a virtual Python environment and active it

# on Linux & MacOS
python -m venv .venv && .venv/bin/activate

# on Windows
python -m venv .venv && .venv/Scripts/activate

Install Python dependencies
```
pip install -r requirements.txt
```

dltHub

During the workshop, we will use the Python library dlt. It is open source and under the Apache 2.0 license. We will also use the Python library dlthub, which includes paid features. A 30-day trial license can be self-issued anonymously for development, education, and CI operations purposes.

Setup the Python environment, which includes dlt and dlthub
Self-issue a license for dlthub and the specified features:

dlt license issue dlthub.transformation

It should automatically store the token. If it prints a warning, follow the instructions.

Verify the token is properly set: dlt license info. The result should look like

Searching dlt license in environment or secrets toml
License found

License Id: 736xxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx
Licensee: machine:4366xxxxxxxxxxxxxxxxxxxxxxxxxxxx
Issuer: dltHub Inc.
License Type: self-issued-trial
Issued: 2025-11-xx xx:xx:xx
Scopes: dlthub.transformation
Valid Until: 2025-12-xx xx:xx:xx

GitHub REST API

The workshop will focus on using public data from the official GitHub REST API. The REST API is free to use, but you need a GitHub token to make requests to most endpoints.

Login on GitHub
Go to https://github.com/settings/apps
Select Personal access tokens > Tokens (classic)
Click Generate new token > Generate new token (classic)
Set a note. You don't need to select any scope.
Click Generate token
Store the token value securely. We will add it to .dlt/secrets.toml during the workshop.

References:

Authenticating with a personal token

Motherduck

The vast majority of the workshop will be happening on your local machine. We'll be using Motherduck during the later steps to show how to go from local development to production. You can signup via email, Google, or GitHub. The free tier is sufficient for the workshop (you might receive a free business trial on first signup).

Login / signup to Motherduck https://app.motherduck.com/
(on signup) Go through onboarding flow. Look out for the Skip button.
Go to https://app.motherduck.com/settings/tokens
Click Create token to generate a Read/Write Token
Store the token value securely. We will add it to .dlt/secrets.toml during the workshop.

Continue

During the workshop, we will use the VSCode extension Continue to build data pipelines using dlt, LLMs, and MCP servers. It is open source and under the Apache 2.0 license. You will need an LLM API key to use it (OpenAI, Anthropic, Mistral, etc.). Using self-hosted LLMs is also possible.

Note. If you already have a subscription with Cursor, Copilot, Windsurf, etc., you will be able to follow along with these tools. The interface and configuration will differ slightly though.

Install VSCode https://code.visualstudio.com/
Inside VSCode, go to the Extensions tab
Search for continue and install the extension (by continue.dev, identifier is continue.continue)
Go to the Continue chat panel. You can find it by doing CTRL + P (command palette) and execute Continue: Focus Continue Chat
Under Models, set your LLM API key. We suggest select models X or Y TODO complete steps
Under Tools, you should see the MCP server loaded with some tools if you properly setup your Python environment.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.continue		.continue
.cursor		.cursor
.dlt		.dlt
.trunk		.trunk
.vscode		.vscode
1_basics		1_basics
2_extract-and-load		2_extract-and-load
elvis		elvis
sql		sql
.gitignore		.gitignore
README.md		README.md
elvis.ipynb		elvis.ipynb
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Keep It Simple and Scalable

Workshop format

Repository structure

Setup

Python environment

dltHub

GitHub REST API

Motherduck

Continue

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

dlt-hub/small-data-sf-2025

Folders and files

Latest commit

History

Repository files navigation

Keep It Simple and Scalable

Workshop format

Repository structure

Setup

Python environment

dltHub

GitHub REST API

Motherduck

Continue

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages