-
Notifications
You must be signed in to change notification settings - Fork 112
Add DataFrame API Documentation for DataFusion Python #1132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kosiew
wants to merge
9
commits into
apache:main
Choose a base branch
from
kosiew:dataframe-rst
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
584afd7
feat: add API reference documentation for DataFrame and index
kosiew f544348
feat: add tests for validating RST syntax, code blocks, and internal …
kosiew 962f9ce
refactor: remove test script for DataFrame API documentation in RST f…
kosiew 5f203e3
Merge branch 'main' into dataframe-rst
kosiew dca035c
fix: correct formatting inconsistencies in dataframe.rst
kosiew d7d67fb
fix: correct header formatting in functions.rst
kosiew c4c8d14
fix: adjust formatting for code block in dataframe.rst
kosiew 391bfe4
fix: skip documentation for duplicate modules in autoapi configuration
kosiew ce9dd91
fix: add cross reference to io pages
kosiew File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,387 @@ | ||
.. Licensed to the Apache Software Foundation (ASF) under one | ||
.. or more contributor license agreements. See the NOTICE file | ||
.. distributed with this work for additional information | ||
.. regarding copyright ownership. The ASF licenses this file | ||
.. to you under the Apache License, Version 2.0 (the | ||
.. "License"); you may not use this file except in compliance | ||
.. with the License. You may obtain a copy of the License at | ||
|
||
.. http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
.. Unless required by applicable law or agreed to in writing, | ||
.. software distributed under the License is distributed on an | ||
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
.. KIND, either express or implied. See the License for the | ||
.. specific language governing permissions and limitations | ||
.. under the License. | ||
|
||
================= | ||
DataFrame API | ||
================= | ||
|
||
Overview | ||
-------- | ||
|
||
The ``DataFrame`` class is the core abstraction in DataFusion that represents tabular data and operations | ||
on that data. DataFrames provide a flexible API for transforming data through various operations such as | ||
filtering, projection, aggregation, joining, and more. | ||
|
||
A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when | ||
terminal operations like ``collect()``, ``show()``, or ``to_pandas()`` are called. | ||
|
||
Creating DataFrames | ||
------------------- | ||
|
||
DataFrames can be created in several ways: | ||
|
||
* From SQL queries via a ``SessionContext``: | ||
|
||
.. code-block:: python | ||
|
||
from datafusion import SessionContext | ||
|
||
ctx = SessionContext() | ||
df = ctx.sql("SELECT * FROM your_table") | ||
|
||
* From registered tables: | ||
|
||
.. code-block:: python | ||
|
||
df = ctx.table("your_table") | ||
|
||
* From various data sources: | ||
|
||
.. code-block:: python | ||
|
||
# From CSV files (see :ref:`io_csv` for detailed options) | ||
df = ctx.read_csv("path/to/data.csv") | ||
|
||
# From Parquet files (see :ref:`io_parquet` for detailed options) | ||
df = ctx.read_parquet("path/to/data.parquet") | ||
|
||
# From JSON files (see :ref:`io_json` for detailed options) | ||
df = ctx.read_json("path/to/data.json") | ||
|
||
# From Avro files (see :ref:`io_avro` for detailed options) | ||
df = ctx.read_avro("path/to/data.avro") | ||
|
||
# From Pandas DataFrame | ||
import pandas as pd | ||
pandas_df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) | ||
df = ctx.from_pandas(pandas_df) | ||
|
||
# From Arrow data | ||
import pyarrow as pa | ||
batch = pa.RecordBatch.from_arrays( | ||
[pa.array([1, 2, 3]), pa.array([4, 5, 6])], | ||
names=["a", "b"] | ||
) | ||
df = ctx.from_arrow(batch) | ||
|
||
For detailed information about reading from different data sources, see the :doc:`I/O Guide <../user-guide/io/index>`. | ||
For custom data sources, see :ref:`io_custom_table_provider`. | ||
|
||
Common DataFrame Operations | ||
--------------------------- | ||
|
||
DataFusion's DataFrame API offers a wide range of operations: | ||
|
||
.. code-block:: python | ||
|
||
from datafusion import column, literal | ||
|
||
# Select specific columns | ||
df = df.select("col1", "col2") | ||
|
||
# Select with expressions | ||
df = df.select(column("a") + column("b"), column("a") - column("b")) | ||
|
||
# Filter rows | ||
df = df.filter(column("age") > literal(25)) | ||
|
||
# Add computed columns | ||
df = df.with_column("full_name", column("first_name") + literal(" ") + column("last_name")) | ||
|
||
# Multiple column additions | ||
df = df.with_columns( | ||
(column("a") + column("b")).alias("sum"), | ||
(column("a") * column("b")).alias("product") | ||
) | ||
|
||
# Sort data | ||
df = df.sort(column("age").sort(ascending=False)) | ||
|
||
# Join DataFrames | ||
df = df1.join(df2, on="user_id", how="inner") | ||
|
||
# Aggregate data | ||
from datafusion import functions as f | ||
df = df.aggregate( | ||
[], # Group by columns (empty for global aggregation) | ||
[f.sum(column("amount")).alias("total_amount")] | ||
) | ||
|
||
# Limit rows | ||
df = df.limit(100) | ||
|
||
# Drop columns | ||
df = df.drop("temporary_column") | ||
|
||
Terminal Operations | ||
------------------- | ||
|
||
To materialize the results of your DataFrame operations: | ||
|
||
.. code-block:: python | ||
|
||
# Collect all data as PyArrow RecordBatches | ||
result_batches = df.collect() | ||
|
||
# Convert to various formats | ||
pandas_df = df.to_pandas() # Pandas DataFrame | ||
polars_df = df.to_polars() # Polars DataFrame | ||
arrow_table = df.to_arrow_table() # PyArrow Table | ||
py_dict = df.to_pydict() # Python dictionary | ||
py_list = df.to_pylist() # Python list of dictionaries | ||
|
||
# Display results | ||
df.show() # Print tabular format to console | ||
|
||
# Count rows | ||
count = df.count() | ||
|
||
HTML Rendering in Jupyter | ||
------------------------- | ||
|
||
When working in Jupyter notebooks or other environments that support rich HTML display, | ||
DataFusion DataFrames automatically render as nicely formatted HTML tables. This functionality | ||
is provided by the ``_repr_html_`` method, which is automatically called by Jupyter. | ||
|
||
Basic HTML Rendering | ||
~~~~~~~~~~~~~~~~~~~~ | ||
|
||
In a Jupyter environment, simply displaying a DataFrame object will trigger HTML rendering: | ||
|
||
.. code-block:: python | ||
|
||
# Will display as HTML table in Jupyter | ||
df | ||
|
||
# Explicit display also uses HTML rendering | ||
display(df) | ||
|
||
HTML Rendering Customization | ||
---------------------------- | ||
|
||
DataFusion provides extensive customization options for HTML table rendering through the | ||
``datafusion.html_formatter`` module. | ||
|
||
Configuring the HTML Formatter | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
You can customize how DataFrames are rendered by configuring the formatter: | ||
|
||
.. code-block:: python | ||
|
||
from datafusion.html_formatter import configure_formatter | ||
|
||
configure_formatter( | ||
max_cell_length=30, # Maximum length of cell content before truncation | ||
max_width=800, # Maximum width of table in pixels | ||
max_height=400, # Maximum height of table in pixels | ||
max_memory_bytes=2 * 1024 * 1024,# Maximum memory used for rendering (2MB) | ||
min_rows_display=10, # Minimum rows to display | ||
repr_rows=20, # Number of rows to display in representation | ||
enable_cell_expansion=True, # Allow cells to be expandable on click | ||
custom_css=None, # Custom CSS to apply | ||
show_truncation_message=True, # Show message when data is truncated | ||
style_provider=None, # Custom style provider class | ||
use_shared_styles=True # Share styles across tables to reduce duplication | ||
) | ||
|
||
Custom Style Providers | ||
~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
For advanced styling needs, you can create a custom style provider class: | ||
|
||
.. code-block:: python | ||
|
||
from datafusion.html_formatter import configure_formatter | ||
|
||
class CustomStyleProvider: | ||
def get_cell_style(self) -> str: | ||
return "background-color: #f5f5f5; color: #333; padding: 8px; border: 1px solid #ddd;" | ||
|
||
def get_header_style(self) -> str: | ||
return "background-color: #4285f4; color: white; font-weight: bold; padding: 10px;" | ||
|
||
# Apply custom styling | ||
configure_formatter(style_provider=CustomStyleProvider()) | ||
|
||
Custom Type Formatters | ||
~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
You can register custom formatters for specific data types: | ||
|
||
.. code-block:: python | ||
|
||
from datafusion.html_formatter import get_formatter | ||
|
||
formatter = get_formatter() | ||
|
||
# Format integers with color based on value | ||
def format_int(value): | ||
return f'<span style="color: {"red" if value > 100 else "blue"}">{value}</span>' | ||
|
||
formatter.register_formatter(int, format_int) | ||
|
||
# Format date values | ||
def format_date(value): | ||
return f'<span class="date-value">{value.isoformat()}</span>' | ||
|
||
formatter.register_formatter(datetime.date, format_date) | ||
|
||
Custom Cell Builders | ||
~~~~~~~~~~~~~~~~~~~~ | ||
|
||
For complete control over cell rendering: | ||
|
||
.. code-block:: python | ||
|
||
formatter = get_formatter() | ||
|
||
def custom_cell_builder(value, row, col, table_id): | ||
try: | ||
num_value = float(value) | ||
if num_value > 0: # Positive values get green | ||
return f'<td style="background-color: #d9f0d3">{value}</td>' | ||
if num_value < 0: # Negative values get red | ||
return f'<td style="background-color: #f0d3d3">{value}</td>' | ||
except (ValueError, TypeError): | ||
pass | ||
|
||
# Default styling for non-numeric or zero values | ||
return f'<td style="border: 1px solid #ddd">{value}</td>' | ||
|
||
formatter.set_custom_cell_builder(custom_cell_builder) | ||
|
||
Custom Header Builders | ||
~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Similarly, you can customize the rendering of table headers: | ||
|
||
.. code-block:: python | ||
|
||
def custom_header_builder(field): | ||
tooltip = f"Type: {field.type}" | ||
return f'<th style="background-color: #333; color: white" title="{tooltip}">{field.name}</th>' | ||
|
||
formatter.set_custom_header_builder(custom_header_builder) | ||
|
||
Managing Formatter State | ||
-----------------------~ | ||
|
||
The HTML formatter maintains global state that can be managed: | ||
|
||
.. code-block:: python | ||
|
||
from datafusion.html_formatter import reset_formatter, reset_styles_loaded_state, get_formatter | ||
|
||
# Reset the formatter to default settings | ||
reset_formatter() | ||
|
||
# Reset only the styles loaded state (useful when styles were loaded but need reloading) | ||
reset_styles_loaded_state() | ||
|
||
# Get the current formatter instance to make changes | ||
formatter = get_formatter() | ||
|
||
Advanced Example: Dashboard-Style Formatting | ||
------------------------------------------~~ | ||
|
||
This example shows how to create a dashboard-like styling for your DataFrames: | ||
|
||
.. code-block:: python | ||
|
||
from datafusion.html_formatter import configure_formatter, get_formatter | ||
|
||
# Define custom CSS | ||
custom_css = """ | ||
.datafusion-table { | ||
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; | ||
border-collapse: collapse; | ||
width: 100%; | ||
box-shadow: 0 2px 3px rgba(0,0,0,0.1); | ||
} | ||
.datafusion-table th { | ||
position: sticky; | ||
top: 0; | ||
z-index: 10; | ||
} | ||
.datafusion-table tr:hover td { | ||
background-color: #f1f7fa !important; | ||
} | ||
.datafusion-table .numeric-positive { | ||
color: #0a7c00; | ||
} | ||
.datafusion-table .numeric-negative { | ||
color: #d13438; | ||
} | ||
""" | ||
|
||
class DashboardStyleProvider: | ||
def get_cell_style(self) -> str: | ||
return "padding: 8px 12px; border-bottom: 1px solid #e0e0e0;" | ||
|
||
def get_header_style(self) -> str: | ||
return ("background-color: #0078d4; color: white; font-weight: 600; " | ||
"padding: 12px; text-align: left; border-bottom: 2px solid #005a9e;") | ||
|
||
# Apply configuration | ||
configure_formatter( | ||
max_height=500, | ||
enable_cell_expansion=True, | ||
custom_css=custom_css, | ||
style_provider=DashboardStyleProvider(), | ||
max_cell_length=50 | ||
) | ||
|
||
# Add custom formatters for numbers | ||
formatter = get_formatter() | ||
|
||
def format_number(value): | ||
try: | ||
num = float(value) | ||
cls = "numeric-positive" if num > 0 else "numeric-negative" if num < 0 else "" | ||
return f'<span class="{cls}">{value:,}</span>' if cls else f'{value:,}' | ||
except (ValueError, TypeError): | ||
return str(value) | ||
|
||
formatter.register_formatter(int, format_number) | ||
formatter.register_formatter(float, format_number) | ||
|
||
Best Practices | ||
-------------- | ||
|
||
1. **Memory Management**: For large datasets, use ``max_memory_bytes`` to limit memory usage. | ||
|
||
2. **Responsive Design**: Set reasonable ``max_width`` and ``max_height`` values to ensure tables display well on different screens. | ||
|
||
3. **Style Optimization**: Use ``use_shared_styles=True`` to avoid duplicate style definitions when displaying multiple tables. | ||
|
||
4. **Reset When Needed**: Call ``reset_formatter()`` when you want to start fresh with default settings. | ||
|
||
5. **Cell Expansion**: Use ``enable_cell_expansion=True`` when cells might contain longer content that users may want to see in full. | ||
|
||
Additional Resources | ||
-------------------- | ||
|
||
* :doc:`../user-guide/dataframe` - Complete guide to using DataFrames | ||
* :doc:`../user-guide/io/index` - I/O Guide for reading data from various sources | ||
* :doc:`../user-guide/data-sources` - Comprehensive data sources guide | ||
* :ref:`io_csv` - CSV file reading | ||
* :ref:`io_parquet` - Parquet file reading | ||
* :ref:`io_json` - JSON file reading | ||
* :ref:`io_avro` - Avro file reading | ||
* :ref:`io_custom_table_provider` - Custom table providers | ||
* `API Reference <https://arrow.apache.org/datafusion-python/api/index.html>`_ - Full API reference |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to cross reference to the io pages.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.