-
Notifications
You must be signed in to change notification settings - Fork 177
Support chart command in PPL
#4579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8297023 to
6b8934e
Compare
chart command in PPLchart command in PPL
Signed-off-by: Yuanchun Shen <[email protected]>
182c87b to
2bc738c
Compare
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
d7ecaf9 to
b5bd9b7
Compare
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
| : CHART chartOptions* statsAggTerm (OVER rowSplit)? (BY columnSplit)? | ||
| | CHART chartOptions* statsAggTerm BY rowSplit (COMMA)? columnSplit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker, but we should support chartOptions at beginning and at the end. This will allow users to modify queries easily without changing the aggregation and group-by components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused on what do you mean by support chartOptions eventually. Do you mean that we should support more chart options or we should support placing them not only before statsAggTerm, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad, we should support chartOptions at beginning and at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. chart count by row over col limit=10
The reason is that PPL discovery customer are typically keyboard-focused, making it easier for them to modify options at the end of the query rather than at the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get it. I'll improve the grammar in the following PR unifying timechart & chart
integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteChartCommandIT.java
Outdated
Show resolved
Hide resolved
integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
Show resolved
Hide resolved
| if (!SqlTypeUtil.isCharacter(colSplit.getType())) { | ||
| colSplit = | ||
| relBuilder.alias( | ||
| context.rexBuilder.makeCast( | ||
| UserDefinedFunctionUtils.NULLABLE_STRING, colSplit, true, true), | ||
| columSplitName); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocker, can we rely on type-corecetion? if(row_number<topK, row, "OTHER")
integ-test/src/test/resources/expectedOutput/calcite/explain_chart_with_limit.yaml
Outdated
Show resolved
Hide resolved
core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change!
Please follow up on unifiy chart and timechart implementation in following PRs.
|
The backport to To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/sql/backport-2.19-dev 2.19-dev
# Navigate to the new working tree
pushd ../.worktrees/sql/backport-2.19-dev
# Create a new branch
git switch --create backport/backport-4579-to-2.19-dev
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 55239323cac124df414ad1cee319f0b0f33a4513
# Push it to GitHub
git push --set-upstream origin backport/backport-4579-to-2.19-dev
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/sql/backport-2.19-devThen, create a pull request where the |
* WIP: Make poc implementation for chart command Signed-off-by: Yuanchun Shen <[email protected]> * Support param useother and otherstr Signed-off-by: Yuanchun Shen <[email protected]> * Support usenull and nullstr (when both row split and col split present) Signed-off-by: Yuanchun Shen <[email protected]> * Append a final aggregation to merge OTHER categories Signed-off-by: Yuanchun Shen <[email protected]> * Handle common agg functions for OTHER category for timechart Signed-off-by: Yuanchun Shen <[email protected]> * Fix timechart IT Signed-off-by: Yuanchun Shen <[email protected]> * Sort earliest results with asc order Signed-off-by: Yuanchun Shen <[email protected]> * Support non-string fields as column split Signed-off-by: Yuanchun Shen <[email protected]> * Fix min/earliest order & fix non-accumulative agg for chart Signed-off-by: Yuanchun Shen <[email protected]> * Hint non-null in aggregateWithTrimming Signed-off-by: Yuanchun Shen <[email protected]> * Add integration tests for chart command Signed-off-by: Yuanchun Shen <[email protected]> * Add unit tests Signed-off-by: Yuanchun Shen <[email protected]> * Add doc for chart command Signed-off-by: Yuanchun Shen <[email protected]> * Prompt users that multiple agg is not supported Signed-off-by: Yuanchun Shen <[email protected]> * Add explain ITs Signed-off-by: Yuanchun Shen <[email protected]> * Remove unimplemented support for multiple aggregations in chart command Signed-off-by: Yuanchun Shen <[email protected]> * Add unit tests for chart command Signed-off-by: Yuanchun Shen <[email protected]> * Remove irrelevant yaml test Signed-off-by: Yuanchun Shen <[email protected]> * Tweak chart.rst Signed-off-by: Yuanchun Shen <[email protected]> * Swap the order of chart output to ensure metrics come last Signed-off-by: Yuanchun Shen <[email protected]> * Filter rows without col split when calculate grand total Signed-off-by: Yuanchun Shen <[email protected]> * Chores: tweak code order Signed-off-by: Yuanchun Shen <[email protected]> * Add anonymize test to chart command Signed-off-by: Yuanchun Shen <[email protected]> * Change grammart from limit=top 10 to limit=top10 Signed-off-by: Yuanchun Shen <[email protected]> * Update chart doc Signed-off-by: Yuanchun Shen <[email protected]> * Rename __row_number__ for chart to _row_number_chart_ Signed-off-by: Yuanchun Shen <[email protected]> * Sort by row and col splits on top of chart results Signed-off-by: Yuanchun Shen <[email protected]> * Ignore rows without a row split in chart command Signed-off-by: Yuanchun Shen <[email protected]> * Keep categories with max summed values when top k is set Signed-off-by: Yuanchun Shen <[email protected]> * Simplify toAddHintsOnAggregate condition Signed-off-by: Yuanchun Shen <[email protected]> * Chores: eliminate unnecessary variables Signed-off-by: Yuanchun Shen <[email protected]> * Apply a non-null filter on fields referred by aggregations Signed-off-by: Yuanchun Shen <[email protected]> * Fix chart plans Signed-off-by: Yuanchun Shen <[email protected]> * Get rid of record class Signed-off-by: Yuanchun Shen <[email protected]> * Move ranking by column split to a helper function Signed-off-by: Yuanchun Shen <[email protected]> --------- Signed-off-by: Yuanchun Shen <[email protected]> (cherry picked from commit 5523932)
Description
The chart command returns an aggregation result that can be easily pivoted to a two-dimension table format.
Work items
limit,limit=top x,limit=bottom xuseother,otherstrusenull,nullstrDifference between stats, timechart, and chart
with
statschartis conceptually similar tostatsin that they all compute an aggregation value and then group them by a given criterion. The main differences lies in their output format.For example, for the query in introduction, we can rewrite it with stats:
... | stats count BY status, host. It gives the following result:Each field specified in BY clause becomes a separate column in the result table; each row is a unique combination of them. Whereas in
chart, each unique status becomes a row; each individual value of host becomes a column. This format makes it easier to view and visualize the results.Similar to the syntax of
stats, an equivalent expression of... | CHART count OVER status BY hostis... | CHART count BY status, host.chartwill ignore documents with NULL in row split and in aggregation results, whilestatswill keep them.with
timechartThe key difference between
chartandtimechartis thattimechartleverages a default@timestampfield as the field to perform aggregation on.... | timechart agg BY fieldis conceptually equivalent to... | CHART agg OVER _time BY fieldThe following table summarizes the differences between the three commands.
Related Issues
Resolves #399
Implementation Walk-through
Ideally, chart should pivot the result into a 2-dimension table. E.g. for the following table:
| chart avg(val) by a, bshould make it a table like this:However, it seems dynamic pivoting is not supported in SQL/Calcite (see original discussion in #3965 (comment)). Therefore, the result table for the implemented
chartis like:The pivoting can be performed in the front-end.
The above operation is equivalent to
stats avg(val) by a, b-- this is the case when parameters likeusenull,useother, andlimitis not involved in the result.When these parameters are involved,
chartcommand will find the top-N categories ofb, aggregating the rest to anOTHERcategory, and aggregating those whosebis null to a "NULL" category. This leads to the following implementation:stats agg_func by a, b)Note:
This implementation did not reuse the implementation of timechart to circumvent some existing bugs. A following PR will merge their implementation as chart essentially is a superset of timechart in terms of functionality.
Future work items
timechartandchartbins(after Fixbinson time-related fields #4612 )Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.