This repository provides a quickstart guide for conducting base operations with DataFusion, including registering datasets, executing SQL queries, exporting results, and inspecting table schemas. The core functionality is encapsulated in the DataFusionWrapper
class, which simplifies common tasks, like specifying the file type when reading a dataset, and provides a user-friendly interface for working with DataFusion. The example .ipynb file contains an example of reading different files from the data/examples folder, and then conducting some SQL queries.
To manage your Python environment and dependencies, we recommend using uv
, a tool that enhances pip
with features like better dependency management and simplified virtual environment creation.
pip install uv
- Simplified Virtual Environments: Easily create and manage virtual environments.
- Improved Package Management: Handles dependencies more effectively than
pip
. - Intuitive Commands: Streamlines workflows for Python projects.
-
Create a virtual environment:
uv venv
This will create a virtual environment in the
.venv
folder. -
Activate the virtual environment:
- On Linux/Mac:
source .venv/bin/activate
- On Windows:
.venv\Scripts\activate
- On Linux/Mac:
-
Add
datafusion
andipykernel
to your virtual environment:uv add datafusion ipykernel
-
Your environment is now ready for running the examples in this repository. When using the .ipynb file, you will need to manually set the kernel to use your venv.
- File:
datafusion_quickstart.ipynb
- This notebook demonstrates how to use the
DataFusionWrapper
class for:- Registering datasets.
- Executing queries.
- Inspecting schemas.
- Exporting results.
- File:
python_files/sql_pipeline.py
- This script provides an example of running a multi-query SQL pipeline using the
DataFusionWrapper
class:- Registers datasets from
data/examples/
. - Reads SQL queries from
python_files/sql/
. - Exports query results to
data/exports/
as Parquet files.
- Registers datasets from
- Dataset Registration: Simplifies adding datasets (Parquet, CSV, JSON) as tables.
- SQL Query Execution: Run SQL queries on registered tables with ease.
- Export Results: Save query results in
Parquet
,CSV
, orJSON
formats. - Table and Schema Inspection: Use built-in functions to explore the structure of your data.
from datafusion_wrapper import DataFusionWrapper
con = DataFusionWrapper()
# Register datasets
paths = [
"data/examples/mta_operations_statement/file_1.parquet",
"data/examples/mta_hourly_subway_socrata/*.parquet"
]
table_names = [
"mta_operations_statement",
"mta_hourly_subway_socrata"
]
con.register_data(paths, table_names)
# Show tables
con.show_tables()
# Show schema of a specific table
con.show_schema("mta_operations_statement")
Refer to python_files/sql_pipeline.py
for a detailed example of registering datasets, running multiple SQL queries, and exporting results to Parquet files.
This script:
- Dynamically resolves paths relative to the repository root.
- Reads SQL queries from
python_files/sql/
. - Exports the results to
data/exports/
.
With this setup, you can quickly start working with DataFusion for SQL-based operations on structured data!