Skip to content

Conversation

@yuancu
Copy link
Collaborator

@yuancu yuancu commented Oct 16, 2025

Description

The chart command returns an aggregation result that can be easily pivoted to a two-dimension table format.

Work items

  • support span
  • support limit, limit=top x, limit=bottom x
  • support useother, otherstr
  • correct limit behavior with non-accumulative aggregation functions (min, max, avg, etc) // fixed in Fix timechart OTHER category aggregation for non-cumulative functions #4594
  • support usenull, nullstr
  • support non-string fields as column split
  • add integration tests
  • add explain tests
  • add a doc
  • Add a brief walk-through of the implementation
  • Anonymizer & test

Difference between stats, timechart, and chart

  • with stats
    chart is conceptually similar to stats in that they all compute an aggregation value and then group them by a given criterion. The main differences lies in their output format.

    For example, for the query in introduction, we can rewrite it with stats: ... | stats count BY status, host. It gives the following result:

    status host count
    200 www1 11835
    200 www2 11186
    200 www3 11261
    400 www1 233
    400 www2 257
    400 www3 211
    403 www2 228
    404 www1 244
    404 www2 209
    404 www3 237

    Each field specified in BY clause becomes a separate column in the result table; each row is a unique combination of them. Whereas in chart, each unique status becomes a row; each individual value of host becomes a column. This format makes it easier to view and visualize the results.

    Similar to the syntax of stats, an equivalent expression of ... | CHART count OVER status BY host is ... | CHART count BY status, host.

    • It is worth noticing that chart will ignore documents with NULL in row split and in aggregation results, while stats will keep them.
  • with timechart

    The key difference between chart and timechart is that timechart leverages a default @timestamp field as the field to perform aggregation on. ... | timechart agg BY field is conceptually equivalent to ... | CHART agg OVER _time BY field

    The following table summarizes the differences between the three commands.

    Feature chart stats timechart
    BY clause fields Limited to two (row-split, column-split) Multiple (3+ possible) Always uses _time + one optional field
    Primary purpose Consolidated visualizations Detailed statistical calculations Time-based analysis
    Output format Table format optimized for visualization (Not implemented) Row-based results Table format time series visualization (Not implemented)
    Best use case Data comparisons across categories Detailed data analysis Trend analysis over time
    X-axis control Any field N/A Always _time

Related Issues

Resolves #399

Implementation Walk-through

Ideally, chart should pivot the result into a 2-dimension table. E.g. for the following table:

a b val
m x 3
m y 4

| chart avg(val) by a, b should make it a table like this:

a x y
m 3 4

However, it seems dynamic pivoting is not supported in SQL/Calcite (see original discussion in #3965 (comment)). Therefore, the result table for the implementedchart is like:

a b avg(val)
m x 3
m y 4

The pivoting can be performed in the front-end.

The above operation is equivalent to stats avg(val) by a, b -- this is the case when parameters like usenull, useother, and limit is not involved in the result.

When these parameters are involved, chart command will find the top-N categories of b, aggregating the rest to an OTHER category, and aggregating those whose b is null to a "NULL" category. This leads to the following implementation:

  1. normal aggregation based on a, b (equivalent to stats agg_func by a, b)
  2. find out the top-N categories (unique values of column b) by aggregating on the above aggregation results
    1. aggregate on b
    2. sort on aggregation results
    3. number the rows
  3. left join the ranked results with the original aggregation
  4. keep rows whose row number is no greater than the limit, categorizing the rest to OTHER or NULL
  5. Aggregate again because values categorized into OTHER or NULL need to be merged

Note:

This implementation did not reuse the implementation of timechart to circumvent some existing bugs. A following PR will merge their implementation as chart essentially is a superset of timechart in terms of functionality.

Future work items

  • support multiple aggregation functions (Left as a TODO in the future: the output will be messy when multiple aggregations are involved because the results are not pivoted.)
  • unify implementation of timechart and chart
  • support more bin options like bins (after Fix bins on time-related fields #4612 )

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@yuancu yuancu added the feature label Oct 16, 2025
@yuancu yuancu force-pushed the issues/399 branch 2 times, most recently from 8297023 to 6b8934e Compare October 24, 2025 06:12
@yuancu yuancu marked this pull request as ready for review October 24, 2025 08:56
@yuancu yuancu marked this pull request as draft October 28, 2025 14:38
@yuancu yuancu marked this pull request as ready for review October 29, 2025 01:58
@yuancu yuancu changed the title WIP: Support chart command in PPL Support chart command in PPL Oct 29, 2025
@yuancu yuancu force-pushed the issues/399 branch 2 times, most recently from 182c87b to 2bc738c Compare November 4, 2025 08:00
@yuancu yuancu force-pushed the issues/399 branch 2 times, most recently from d7ecaf9 to b5bd9b7 Compare November 5, 2025 09:37
Comment on lines +292 to +293
: CHART chartOptions* statsAggTerm (OVER rowSplit)? (BY columnSplit)?
| CHART chartOptions* statsAggTerm BY rowSplit (COMMA)? columnSplit
Copy link
Collaborator

@penghuo penghuo Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, but we should support chartOptions at beginning and at the end. This will allow users to modify queries easily without changing the aggregation and group-by components.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused on what do you mean by support chartOptions eventually. Do you mean that we should support more chart options or we should support placing them not only before statsAggTerm, etc?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, we should support chartOptions at beginning and at the end

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. chart count by row over col limit=10

The reason is that PPL discovery customer are typically keyboard-focused, making it easier for them to modify options at the end of the query rather than at the beginning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it. I'll improve the grammar in the following PR unifying timechart & chart

Comment on lines 2085 to 2091
if (!SqlTypeUtil.isCharacter(colSplit.getType())) {
colSplit =
relBuilder.alias(
context.rexBuilder.makeCast(
UserDefinedFunctionUtils.NULLABLE_STRING, colSplit, true, true),
columSplitName);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocker, can we rely on type-corecetion? if(row_number<topK, row, "OTHER")

Copy link
Collaborator

@penghuo penghuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change!
Please follow up on unifiy chart and timechart implementation in following PRs.

@penghuo penghuo enabled auto-merge (squash) November 6, 2025 17:24
@penghuo penghuo merged commit 5523932 into opensearch-project:main Nov 7, 2025
35 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.19-dev failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/sql/backport-2.19-dev 2.19-dev
# Navigate to the new working tree
pushd ../.worktrees/sql/backport-2.19-dev
# Create a new branch
git switch --create backport/backport-4579-to-2.19-dev
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 55239323cac124df414ad1cee319f0b0f33a4513
# Push it to GitHub
git push --set-upstream origin backport/backport-4579-to-2.19-dev
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/sql/backport-2.19-dev

Then, create a pull request where the base branch is 2.19-dev and the compare/head branch is backport/backport-4579-to-2.19-dev.

@yuancu yuancu deleted the issues/399 branch November 7, 2025 09:08
yuancu added a commit to yuancu/sql-plugin that referenced this pull request Nov 7, 2025
* WIP: Make poc implementation for chart command

Signed-off-by: Yuanchun Shen <[email protected]>

* Support param useother and otherstr

Signed-off-by: Yuanchun Shen <[email protected]>

* Support usenull and nullstr (when both row split and col split present)

Signed-off-by: Yuanchun Shen <[email protected]>

* Append a final aggregation to merge OTHER categories

Signed-off-by: Yuanchun Shen <[email protected]>

* Handle common agg functions for OTHER category for timechart

Signed-off-by: Yuanchun Shen <[email protected]>

* Fix timechart IT

Signed-off-by: Yuanchun Shen <[email protected]>

* Sort earliest results with asc order

Signed-off-by: Yuanchun Shen <[email protected]>

* Support non-string fields as column split

Signed-off-by: Yuanchun Shen <[email protected]>

* Fix min/earliest order & fix non-accumulative agg for chart

Signed-off-by: Yuanchun Shen <[email protected]>

* Hint non-null in aggregateWithTrimming

Signed-off-by: Yuanchun Shen <[email protected]>

* Add integration tests for chart command

Signed-off-by: Yuanchun Shen <[email protected]>

* Add unit tests

Signed-off-by: Yuanchun Shen <[email protected]>

* Add doc for chart command

Signed-off-by: Yuanchun Shen <[email protected]>

* Prompt users that multiple agg is not supported

Signed-off-by: Yuanchun Shen <[email protected]>

* Add explain ITs

Signed-off-by: Yuanchun Shen <[email protected]>

* Remove unimplemented support for multiple aggregations in chart command

Signed-off-by: Yuanchun Shen <[email protected]>

* Add unit tests for chart command

Signed-off-by: Yuanchun Shen <[email protected]>

* Remove irrelevant yaml test

Signed-off-by: Yuanchun Shen <[email protected]>

* Tweak chart.rst

Signed-off-by: Yuanchun Shen <[email protected]>

* Swap the order of chart output to ensure metrics come last

Signed-off-by: Yuanchun Shen <[email protected]>

* Filter rows without col split when calculate grand total

Signed-off-by: Yuanchun Shen <[email protected]>

* Chores: tweak code order

Signed-off-by: Yuanchun Shen <[email protected]>

* Add anonymize test to chart command

Signed-off-by: Yuanchun Shen <[email protected]>

* Change grammart from limit=top 10 to limit=top10

Signed-off-by: Yuanchun Shen <[email protected]>

* Update chart doc

Signed-off-by: Yuanchun Shen <[email protected]>

* Rename __row_number__ for chart to _row_number_chart_

Signed-off-by: Yuanchun Shen <[email protected]>

* Sort by row and col splits on top of chart results

Signed-off-by: Yuanchun Shen <[email protected]>

* Ignore rows without a row split in chart command

Signed-off-by: Yuanchun Shen <[email protected]>

* Keep categories with max summed values when top k is set

Signed-off-by: Yuanchun Shen <[email protected]>

* Simplify toAddHintsOnAggregate condition

Signed-off-by: Yuanchun Shen <[email protected]>

* Chores: eliminate unnecessary variables

Signed-off-by: Yuanchun Shen <[email protected]>

* Apply a non-null filter on fields referred by aggregations

Signed-off-by: Yuanchun Shen <[email protected]>

* Fix chart plans

Signed-off-by: Yuanchun Shen <[email protected]>

* Get rid of record class

Signed-off-by: Yuanchun Shen <[email protected]>

* Move ranking by column split to a helper function

Signed-off-by: Yuanchun Shen <[email protected]>

---------

Signed-off-by: Yuanchun Shen <[email protected]>
(cherry picked from commit 5523932)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC] PPL Chart Command

4 participants