Skip to content

Add DataFrame usage guide with HTML rendering customization options #1108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 27, 2025

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Apr 22, 2025

Which issue does this PR close?

Closes #1100


Rationale for this change

This change provides users with a dedicated and detailed guide for working with DataFrames in DataFusion. It introduces essential concepts, usage examples, and advanced features like HTML rendering customization, making it easier for both new and experienced users to take full advantage of the DataFrame API. This documentation enhancement will improve developer experience and usability.


What changes are included in this PR?

  • Adds a new documentation file: docs/source/user-guide/dataframe.rst
  • Introduces detailed sections covering:
    • Overview and basic usage of DataFrames
    • HTML rendering behavior in notebook environments
    • Customization options for DataFrame HTML display (e.g., themes, precision, truncation)
    • Support for custom style providers and formatters
    • Contextual formatting using a context manager
  • Updates docs/source/user-guide/basics.rst with a reference to the new DataFrame guide

Are there any user-facing changes?

Yes — this PR adds new user-facing documentation:

  • A complete guide to working with DataFrames
  • Instructions for customizing their display in interactive environments like Jupyter Notebooks

There are no breaking changes to the public API — only enhancements to documentation.

Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent! Thank you very much for it. My only comments are to fix the rst parsing.

and Arrow.

A DataFrame represents a logical plan that can be composed through operations like filtering, projection, and aggregation.
The actual execution happens when terminal operations like `collect()` or `show()` are called.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need double back ticks to render properly.

``collect()`` or ``show()``

When working in Jupyter notebooks or other environments that support HTML rendering, DataFrames will
automatically display as formatted HTML tables, making it easier to visualize your data.

The `_repr_html_` method is called automatically by Jupyter to render a DataFrame. This method
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double back ticks

The actual execution happens when terminal operations like `collect()` or `show()` are called.

Basic Usage
----------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ---- needs to be the same length as the title above it. It's one - too short

df.show()

HTML Rendering
-------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs on extra -

plain text output.

Customizing HTML Rendering
-------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs one more -

The formatter settings affect all DataFrames displayed after configuration.

Custom Style Providers
---------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs one more -

configure_formatter(style_provider=MyStyleProvider())

Creating a Custom Formatter
--------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs one more -

custom_html = formatter.format_html(batches, schema)

Managing Formatters
------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs one more -

print(formatter.theme)

Contextual Formatting
--------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs one more -

@kosiew
Copy link
Contributor Author

kosiew commented Apr 27, 2025

Thank you @timsaucer for the detailed review.
I have corrected the above.

@timsaucer timsaucer merged commit 91b6635 into apache:main Apr 27, 2025
16 checks passed
@timsaucer
Copy link
Contributor

Thank you again!

kosiew added a commit to kosiew/datafusion-python that referenced this pull request Apr 28, 2025
…pache#1108)

* docs: enhance user guide with detailed DataFrame operations and examples

* move /docs/source/api/dataframe.rst into user-guide

* docs: remove  DataFrame API documentation

* docs: fix formatting inconsistencies in DataFrame user guide

* Two minor corrections to documentation rendering

---------

Co-authored-by: Tim Saucer <[email protected]>
timsaucer added a commit that referenced this pull request May 5, 2025
…Memory and Display Controls (#1119)

* feat: add configurable max table bytes and min table rows for DataFrame display

* Revert "feat: add configurable max table bytes and min table rows for DataFrame display"

This reverts commit f9b78fa.

* feat: add FormatterConfig for configurable DataFrame display options

* refactor: simplify attribute extraction in get_formatter_config function

* refactor: remove hardcoded constants and use FormatterConfig for display options

* refactor: simplify record batch collection by using FormatterConfig for display options

* feat: add max_memory_bytes, min_rows_display, and repr_rows parameters to DataFrameHtmlFormatter

* feat: add tests for HTML formatter row display settings and memory limit

* refactor: extract Python formatter retrieval into a separate function

* Revert "feat: add tests for HTML formatter row display settings and memory limit"

This reverts commit e089d7b.

* feat: add tests for HTML formatter row and memory limit configurations

* Revert "feat: add tests for HTML formatter row and memory limit configurations"

This reverts commit 4090fd2.

* feat: add tests for new parameters and validation in DataFrameHtmlFormatter

* Reorganize tests

* refactor: rename and restructure formatter functions for clarity and maintainability

* feat: implement PythonFormatter struct and refactor formatter retrieval for improved clarity

* refactor: improve comments and restructure FormatterConfig usage in PyDataFrame

* Add DataFrame usage guide with HTML rendering customization options (#1108)

* docs: enhance user guide with detailed DataFrame operations and examples

* move /docs/source/api/dataframe.rst into user-guide

* docs: remove  DataFrame API documentation

* docs: fix formatting inconsistencies in DataFrame user guide

* Two minor corrections to documentation rendering

---------

Co-authored-by: Tim Saucer <[email protected]>

* Update documentation

* refactor: streamline HTML rendering documentation

* refactor: extract validation logic into separate functions for clarity

* Implement feature X to enhance user experience and optimize performance

* feat: add validation method for FormatterConfig to ensure positive integer values

* add comment - ensure minimum rows are collected even if memory or row limits are hit

* Update html_formatter documentation

* update tests

* remove unused type hints from imports in html_formatter.py

* remove redundant tests for DataFrameHtmlFormatter and clean up assertions

* refactor get_attr function to support generic default values

* build_formatter_config_from_python return PyResult

* fix ruff errors

* trigger ci

* fix: remove redundant newline in test_custom_style_provider_html_formatter

* add more tests

* trigger ci

* Fix ruff errors

* fix clippy error

* feat: add validation for parameters in configure_formatter

* test: add tests for invalid parameters in configure_formatter

* Fix ruff errors

---------

Co-authored-by: Tim Saucer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants