Conversation
|
🔴 Unable to merge your PR into the base branch. Please rebase or merge it with the base branch. |
|
🟢 |
There was a problem hiding this comment.
Pull request overview
This PR extends KQP fulltext index access to support LIKE/StartsWith/EndsWith-style predicates (and a “Query” mode for FulltextMatch) by rewriting logical plans to push these predicates through fulltext index reads, and by updating the fulltext query representation to support composing query fragments.
Changes:
- Change fulltext query representation from a single value to a list of values (proto + expr node), and plumb it through compilation/execution.
- Update logical fulltext index rewrite to recognize
FulltextMatch“Mode=Query” and LIKE/StartsWith/EndsWith predicates, generating appropriate index query terms and (when needed) a post-filter. - Update fulltext index unit tests to use the new “Mode=Query” path and add coverage for LIKE/StartsWith/EndsWith cases.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| ydb/core/protos/kqp_physical.proto | Make QueryValue repeated to carry query fragments. |
| ydb/core/kqp/expr_nodes/kqp_expr_nodes.json | Change FullTextIndex Query node type to TExprList. |
| ydb/core/kqp/query_compiler/kqp_query_compiler.cpp | Emit multiple QueryValue entries from the query list. |
| ydb/core/kqp/executer_actor/kqp_tasks_graph.cpp | Concatenate repeated QueryValue parts into the runtime query string. |
| ydb/core/kqp/opt/logical/kqp_opt_log_indexes.cpp | Rewrite detection/rewriting for fulltext predicates; add “Mode=Query” + LIKE-based postfilter construction; build query fragments. |
| ydb/core/kqp/opt/logical/kqp_opt_log.cpp | Adjust optimizer handler registration order to accommodate the updated rewrite. |
| ydb/core/kqp/runtime/kqp_full_text_source.cpp | Remove runtime Pire-based postfiltering and adjust ngram detection logic. |
| ydb/core/kqp/common/kqp_yql.h | Rename BM25 setting keys to B/K1. |
| ydb/core/kqp/ut/indexes/fulltext/kqp_indexes_fulltext_ut.cpp | Update FulltextMatch tests to use “Mode=Query”; add LIKE/StartsWith/EndsWith assertions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| auto& query = Queries[0]; | ||
| if (!query.NamedOptions) return settings; | ||
| for(auto& arg : query.NamedOptions->Children()) { | ||
| auto nameValueTuple = TExprBase(arg).Cast<TCoNameValueTuple>(); | ||
| TExprBase value = TExprBase(nameValueTuple.Value().Cast().Ptr()); | ||
| TString name = nameValueTuple.Name().StringValue(); | ||
| name.to_lower(); | ||
| if (name == "mode") | ||
| continue; | ||
|
|
||
| if (DefaultOperator) { | ||
| settings.push_back(Build<TCoNameValueTuple>(ctx, pos) | ||
| .Name<TCoAtom>() | ||
| .Value(TKqpReadTableFullTextIndexSettings::DefaultOperatorSettingName) | ||
| .Value(nameValueTuple.Name().StringValue()) | ||
| .Build() | ||
| .Value(DefaultOperator) | ||
| .Value(value) | ||
| .Done()); |
There was a problem hiding this comment.
Settings() currently forwards all named options from the first fulltext predicate into TKqlReadTableFullTextIndex settings (except Mode). But TKqpReadTableFullTextIndexSettings::Parse() hard-fails (YQL_ENSURE(false)) on any unknown setting name, so passing through unrelated/typoed options will crash compilation. Please either (1) filter to the supported setting names only (B/K1/DefaultOperator/MinimumShouldMatch/ItemsLimit/SkipLimit) or (2) validate and reject unknown options here with a clear error, instead of emitting them into the index-read settings list.
| @@ -1797,7 +1750,7 @@ class TFullTextMatchSource : public TActorBootstrapped<TFullTextMatchSource>, pu | |||
| Words.emplace_back(MakeIntrusive<TWordReadState>(wordIndex++, query, IndexTableReader)); | |||
| } | |||
|
|
|||
| GeneratePostfilterMatchers(analyzer.analyzers(), expr); | |||
| IsNgram = true; | |||
| } | |||
There was a problem hiding this comment.
IsNgram is set to true for any column analyzer that uses ngram/edge-ngram, and then set to true unconditionally when the analyzer matches the searched column. This makes non-ngram fulltext indexes (or indexes where only other columns use ngrams) run the ngram-specific heuristic in StartWordReads() (dropping “imbalanced” terms), which can change matching semantics and/or performance unexpectedly. Please set IsNgram only based on the analyzers for the matched search column and only when use_filter_ngram/use_filter_edge_ngram is enabled for that column.
| static constexpr TStringBuf BFactorSettingName = "B"; | ||
| static constexpr TStringBuf K1FactorSettingName = "K1"; |
There was a problem hiding this comment.
Renaming the fulltext BM25 setting keys from BFactor/K1Factor to B/K1 is a breaking change for existing queries that might be passing the old names. If compatibility matters, consider accepting both spellings in TKqpReadTableFullTextIndexSettings::Parse() (treat old names as aliases) while keeping the new canonical names for output.
| TExprNode::TPtr NamedOptions; | ||
| THashMap<std::string_view, TExprNode::TPtr> Settings; | ||
| bool StartsWithAny = false; | ||
| bool EndsWithWithAny = false; |
There was a problem hiding this comment.
Typo/inconsistent naming: EndsWithWithAny reads like a duplication of “With”. Consider renaming to EndsWithAny (or HasSuffixWildcard) to make the intent clearer and avoid propagating the typo to future code.
| bool EndsWithWithAny = false; | |
| bool EndsWithAny = false; |
Changelog entry
Changelog category
Description for reviewers
...