Skip to content

Conversation

BlakeOrth
Copy link
Contributor

Which issue does this PR close?

This does not fully close, but is an incremental building block component for:

The full context of how this code is likely to progress can be seen in the POC for this effort:

Rationale for this change

For particularly large requests, in terms of number of objects in a table or large objects, the number of operations for a query may be quite large. In these cases, understanding the aggregate impact of various object store operations is likely the best way to understand the impact those operations had on a particular query. This PR allows users of an instrumented object store to understand and display basic summary statistics related to the RequestDetails collected during a query.

What changes are included in this PR?

  • Adds a RequestSummary type for the instrumented object store to display summary statistics about instrumented requests
  • Adds a generic Stats type to track the statistics for the summary
  • Adds tests for the new code
  • Adds a basic summary output to the user-facing display when profiling is enabled
  • Adds docs for new and newly exported public items

Are these changes tested?

Yes. The new functionality has tests implemented, aside from testing the actual display output. The functional output can be seen below:

DataFusion CLI v50.1.0
> \object_store_profiling enabled
ObjectStore Profile mode set to Enabled
> CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
0 row(s) fetched.
Elapsed 0.268 seconds.

Object Store Profiling
Instrumented Object Store: instrument_mode: Enabled, inner: HttpStore
2025-10-13T22:15:50.518465131+00:00 operation=Get duration=0.030742s size=8 range: bytes=174965036-174965043 path=hits_compatible/athena_partitioned/hits_1.parquet
2025-10-13T22:15:50.549263341+00:00 operation=Get duration=0.033060s size=34322 range: bytes=174930714-174965035 path=hits_compatible/athena_partitioned/hits_1.parquet

Summaries:
Get
count: 2
duration min: 0.030742s
duration max: 0.033060s
duration avg: 0.031901s
size min: 8 B
size max: 34322 B
size avg: 17165 B
size sum: 34330 B

>

Are there any user-facing changes?

Yes? Just like the previous PR this does change the user-facing output, but there's no API breaking changes.

cc @alamb

 - Adds a `RequestSummary` type for the instrumented object store to
   display summary statistics about instrumented requests
 - Adds a generic Stats type to track the statistics for the summary
 - Adds tests for the new code
 - Adds a basic summary output to the user-facing display when profiling
   is enabled
 - Adds docs for new and newly exported public items
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @BlakeOrth

Summaries:
Get
count: 1
[SUMMARY_DURATION]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a nice to have future feature, it would be great to have these values be more human readable -- e.g. in stead of 17446363 B have it be 17.4 MB

@alamb
Copy link
Contributor

alamb commented Oct 14, 2025

Another potential improvement is to reduce the number of lines used for display:

From:

Summaries:
Get
count: 2
duration min: 0.030742s
duration max: 0.033060s
duration avg: 0.031901s
size min: 8 B
size max: 34322 B
size avg: 17165 B
size sum: 34330 B

To something like

Summaries :
GET count: 2
duration(min/max/avg): 0.030742s / 0.033060s / 0.031901s
size(min/max/avg/sum): 8 B / 34322 B / 17165 B / 34330 B

but then again maybe that is even harder to read 🤔

@alamb alamb enabled auto-merge October 14, 2025 21:26
@alamb alamb added this pull request to the merge queue Oct 14, 2025
Merged via the queue into apache:main with commit ca2585e Oct 14, 2025
28 checks passed
@BlakeOrth
Copy link
Contributor Author

@alamb Yeah, I'm not sure what the best way to display these is in a concise but readable manner. I think in the original issue for this I had also postulated that perhaps a table output, similar to the table information and query results might work.

+----------+-----------+-----------+-----------+---------+
|  Metric  |    min    |    max    |    avg    |   sum   |
+----------+-----------+-----------+-----------+---------+
| duration | 0.030742s | 0.033060s | 0.031901s | N/A     |
| size     | 8 B       | 34322 B   | 17165 B   | 34330 B |
+----------+-----------+-----------+-----------+---------+

Admittedly I'm probably not the right person to make that decision; UI/UX are not my strong suit.

dariocurr pushed a commit to dariocurr/datafusion that referenced this pull request Oct 15, 2025
* Adds summary output to CLI instrumented object stores
 - Adds a `RequestSummary` type for the instrumented object store to
   display summary statistics about instrumented requests
 - Adds a generic Stats type to track the statistics for the summary
 - Adds tests for the new code
 - Adds a basic summary output to the user-facing display when profiling
   is enabled
 - Adds docs for new and newly exported public items

* - Updates integration test and validation snapshot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants