Skip to content

feat: some optimistic paths for instant manipulate#7812

Open
waynexia wants to merge 2 commits intomainfrom
inst-mani-fast-path
Open

feat: some optimistic paths for instant manipulate#7812
waynexia wants to merge 2 commits intomainfrom
inst-mani-fast-path

Conversation

@waynexia
Copy link
Copy Markdown
Member

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

Some optimistic paths to avoid unnecessary array materialization

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR requires documentation updates.
  • API changes are backward compatible.
  • Schema or data changes are backward compatible.

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant performance optimizations within the InstantManipulate operator by intelligently reusing column data. By identifying and propagating information about constant tag columns, the system can avoid redundant materialization of array data, leading to more efficient query execution, especially in scenarios where the output record batch size changes but certain columns remain static.

Highlights

  • Optimized Column Materialization: Introduced optimistic paths to avoid unnecessary array materialization for columns that remain constant across a series, significantly improving performance for certain query patterns.
  • Tag Column Propagation: Added a tag_columns field to the InstantManipulate logical plan, allowing the planner to pass hints about tag columns to the execution plan. This enables the new optimization for reusing non-sample columns.
  • Dynamic Tag Column Resolution: Implemented logic to dynamically resolve tag_columns from the input SeriesDivide logical plan if not explicitly provided, ensuring the optimization can be applied even when tag columns are inferred.
  • New Helper Function for Column Reuse: Added a reuse_constant_column helper function to efficiently handle the reuse of constant columns when the output record batch length changes, avoiding recomputation or re-allocation.
Changelog
  • src/promql/src/extension_plan/instant_manipulate.rs
    • Imported DataType, ScalarValue, and Extension for enhanced type handling and logical plan extensions.
    • Removed unused ArrowResult import.
    • Imported SeriesDivide for logical plan introspection.
    • Added tag_columns field to InstantManipulate struct to store planner-provided tag column hints.
    • Updated InstantManipulate::new constructor and UserDefinedLogicalNodeCore implementations to accept and propagate tag_columns.
    • Implemented resolve_tag_columns method to infer tag columns from SeriesDivide input if not explicitly set.
    • Added reuse_all_non_sample_columns field to InstantManipulateExec and InstantManipulateStream to control the new optimization.
    • Optimized vector capacity allocation for take_indices and aligned_ts in InstantManipulateStream.
    • Modified InstantManipulateStream::take_record_batch to implement the logic for reusing non-sample columns based on reuse_all_non_sample_columns.
    • Introduced reuse_constant_column function to efficiently slice or extend constant columns.
    • Added new unit tests for rebuild_should_recover_tag_columns_from_series_divide_input and tsid_fast_path_reuses_non_sample_columns_when_output_grows to validate the new functionality.
  • src/query/src/promql/planner.rs
    • Modified PromPlanner::plan_instant_manipulate to pass the tag_columns from the context to the InstantManipulate logical plan.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions github-actions bot added the docs-not-required This change does not impact docs. label Mar 14, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an optimistic execution path for InstantManipulate to improve performance by avoiding unnecessary array materialization, particularly when using __tsid. The changes are mostly centered around instant_manipulate.rs, with a key modification in take_record_batch_optional to reuse constant columns. While the optimization itself is well-implemented, I've identified a critical issue in the planner logic that currently prevents this new fast path from being activated. Additionally, I've suggested an improvement to make the column resolution logic more robust. Addressing these points will ensure the new feature works as intended and improves the overall resilience of the query plan.

.time_index_column
.clone()
.expect("time index should be set in `setup_context`"),
self.ctx.tag_columns.clone(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The tag_columns passed to InstantManipulate::new here is incorrect when use_tsid is true. self.ctx.tag_columns holds the original tag columns from the table metadata, not ["__tsid"]. This prevents the reuse_all_non_sample_columns optimization in InstantManipulate from being triggered, which undermines a key goal of this pull request.

To fix this, you should conditionally pass vec!["__tsid".to_string()] when self.ctx.use_tsid is true, similar to how series_key_columns is determined for SeriesDivide.

            if self.ctx.use_tsid {
                vec![store_api::metric_engine_consts::DATA_SCHEMA_TSID_COLUMN_NAME.to_string()]
            } else {
                self.ctx.tag_columns.clone()
            },

Comment on lines +228 to +235
let LogicalPlan::Extension(Extension { node }) = input else {
return Vec::new();
};

node.as_any()
.downcast_ref::<SeriesDivide>()
.map(|series_divide| series_divide.tags().to_vec())
.unwrap_or_default()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This implementation of resolve_tag_columns only inspects the immediate input of InstantManipulate. However, the planner constructs a plan where the input is a SeriesNormalize node, which in turn wraps the SeriesDivide node. Consequently, the downcast_ref::<SeriesDivide>() will fail, and tag columns won't be resolved from the input plan if they are not explicitly provided.

While the primary fix should be in the planner to pass the correct tag_columns, making this function more robust by looking through SeriesNormalize would be a valuable improvement for future use cases. You might consider traversing the plan to find the underlying SeriesDivide node.

Comment on lines +239 to +240
let reuse_all_non_sample_columns =
matches!(self.tag_columns.as_slice(), [tag] if tag == "__tsid");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we only set reuse_all_non_sample_columns to true when there are no tags except tsid?

.with_exprs_and_inputs(vec![], vec![series_divide])
.unwrap();

assert_eq!(plan.tag_columns, vec!["__tsid".to_string()]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also assert reuse_all_non_sample_columns?

reuse_all_non_sample_columns: true,
input,
metric: ExecutionPlanMetricsSet::new(),
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In non-test code, we won't set reuse_all_non_sample_columns to true because tags also contain a host column. Is it expected?

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@waynexia
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe2a9f91d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +653 to +654
if self.reuse_all_non_sample_columns && Some(index) != self.field_index {
arrays.push(reuse_constant_column(array, output_len)?);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep secondary value columns on the take() path

When __tsid is used as the series key, this branch treats every non-time, non-field_index column as a constant and rebuilds it with reuse_constant_column(). PromPlanner::prom_vector_selector_to_plan still passes only self.ctx.field_columns.first() into InstantManipulate, so on metric-engine tables with multiple value columns (field_1, field_2, …) those extra samples now bypass take_indices entirely. Any instant query that expands or skips points via lookback will therefore return stale/corrupted values for the secondary fields instead of the rows selected by compute::take().

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required This change does not impact docs. size/S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants