This repository serves as a local data platform for working with MTA data. It encompasses the following key functionalities:
- Data Ingestion: Fetches data from the Socrata API.
- Data Cleaning: Performs necessary cleaning and preprocessing of the ingested data.
- SQL Transformation Pipeline: Executes a series of SQL transformations to prepare the data for analysis.
- Data Visualization: Generates insights and visualizes them through a data application.
This end-to-end workflow enables efficient data processing and insight generation from MTA datasets.
What does this repo use
This project assumes you are using a code IDE, either locally such as with VSCode or with Github Codespaces. Codespaces can be run by first making a free Github account, clicking the green Code button at the top of this repo, and then selecting Codespaces.
Before proceeding, you will need to install uv
. You can install it via pip:
pip install uv
Alternatively, follow the instructions here: Install UV.
Once uv
is installed, proceed to clone the repository.
To clone the repository, run the following command:
git clone https://github.com/ChristianCasazza/mtadata
You can also make the repo have a custom name by adding it at the end:
git clone https://github.com/ChristianCasazza/mtadata custom_name
Then, navigate into the repository directory:
cd custom_name
This repository includes two setup scripts:
setup.sh
: For Linux/macOSsetup.bat
: For Windows
These scripts automate the following tasks:
- Create and activate a virtual environment using
uv
. - Install project dependencies.
- Ask for your Socrata App Token (
SOCRATA_API_TOKEN
). If no key is provided, the script will use the community key:uHoP8dT0q1BTcacXLCcxrDp8z
.- Important: The community key is shared and rate-limited. Please use your own key if possible. You can obtain one in two minutes by signing up here and following these instructions.
- Copy
.env.example
to.env
and appendSOCRATA_API_TOKEN
to the file. - Dynamically generate the
LAKE_PATH
variable for your system and append it to.env
. - Start the Dagster development server.
./setup.sh
If you encounter a Permission denied
error, ensure the script is executable by running:
chmod +x setup.sh
setup.bat
If PowerShell does not recognize the script, ensure you're in the correct directory and use .\
before the script name:
.\setup.bat
The script will guide you through the setup interactively. Once complete, your .env
file will be configured, and the Dagster server will be running.
After the setup script finishes, you can access the Dagster web UI. The script will display a URL in the terminal. Click on the URL or paste it into your browser to access the Dagster interface.
- In the Dagster web UI, click on the Assets tab in the top-left corner.
- Then, in the top-right corner, click on View Global Asset Lineage.
- In the top-right corner, click Materialize All to start downloading and processing all of the data.
This will execute the following pipeline:
- Ingest MTA data from the Socrata API, weather data from the Open Mateo API, and the 67M hourly subway dataset from R2 as parquet files in
data/opendata/nyc/mta/nyc/
. - Create a DuckDB file with views on each raw dataset's parquet files.
- Execute a SQL transformation pipeline with DBT on the raw datasets.
The entire pipeline should take 2-5 minutes, with most of the time spent ingesting the large hourly dataset.
- SOCRATA_API_TOKEN: If you use the community key, you may encounter rate limits. It's strongly recommended to use your own key.
- LAKE_PATH: This variable is dynamically generated by
exportpath.py
and added to your.env
file during setup. It represents the location of the DuckDB file for DBT transformations.
After your pipeline run finishes, you can run a local UI to view the datasets we have downloaded and view their schema. Setting the toggle to LLM mode makes it easier to copy and paste a table to an LLM.
To run the UI, open a new terminal from your existing Dagster instance. Then, run the command:
uv run scripts/create.py app
Harlequin is a terminal based local SQL editor.
To start it, open a new terminal, then, run the following command to install the Harlequin SQL editor:
pip install harlequin
Then use it to connect to the duckdb file we created with scripts/create.py
harlequin mta/mtastats/sources/mta/mtastats.duckdb
The duckdb file will already have the views to the tables to query. it can be queried like
SELECT
COUNT(*) AS total_rows,
MIN(transit_timestamp) AS min_transit_timestamp,
MAX(transit_timestamp) AS max_transit_timestamp
FROM mta_hourly_subway_socrata
This query will return the total number of rows, the earliest timestamp, and the latest timestamp in the dataset.
The DuckDBWrapper
class provides a simple interface to interact with DuckDB, allowing you to register data files (Parquet, CSV, JSON), execute queries, and export results in multiple formats.
In the top right corner of your notebook, select your .venv in python enviornments. If using VScode, it may suggest to install Jupyter and python extensions.
Then, in the notebook, you just need to run the first two cells. The first cell will load the DuckDBWrapper Class. Then, you can initialize a DuckDBWrapper
instance in the second cell with:
con = DuckDBWrapper()
con = DuckDBWrapper("my_database.duckdb")
You can run the rest of the cells to learn how to utilize the class.
---
# How to Run the Data App UI
## Step 1: Open a New Terminal
## Step 2: Check if Node.js is Installed
Before running the app, check if you have Node.js installed by running the following command:
```bash
node -v
If Node.js is installed, this will display the current version (e.g., v16.0.0
or higher). If you see a version number, you're ready to proceed to the next step.
- Go to the Node.js download page.
- Download the appropriate installer for your operating system (Windows, macOS, or Linux).
- Follow the instructions to install Node.js.
Once installed, verify the installation by running the node -v
command again to ensure it displays the version number.
Change to the mtastats
directory where the app is located by running the following command:
cd mtastats
With Node.js installed, run the following command to install the necessary dependencies:
npm install
After installing the dependencies, start the data sources by running:
npm run sources
Now, run the following command to start the Data App UI locally:
npm run dev
This will open up the Data App UI, and it will be running on your local machine. You should be able to access it by visiting the address shown in your terminal, typically http://localhost:3000
.