-
Notifications
You must be signed in to change notification settings - Fork 177
Support wildcard for replace command #4698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| /** | ||
| * Utility for wildcard-based string replacement in PPL replace command. | ||
| * | ||
| * <p>Supports SPL-style wildcard matching where '*' matches zero or more characters. Captured |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: No way to escape asterisks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great catch! For now I am marking it as a limitation in the docs since the use case for literal asterisks is rare. Do you think we should implement escape instead or we can keep it as a limitation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep it as a limitation for now, but I'm a bit concerned about breaking a potential existing usage with no workaround. I don't know if who's using asterisks today, the number is probably low, but it'd be surprise me if it was 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially if we're already building the final pattern with regex split (L76), we might be able to just make the split have negative lookahead with Pattern.split("(?!\\\\)\\*")
Feel free to skip this if it's more complicated than that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to support escaping asterisks
| } | ||
|
|
||
| // Fast path: no wildcards = literal replacement | ||
| if (!pattern.contains("*")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: What are the perf implications of reprocessing the pattern from scratch on every record?
Ideally it'd be faster to compile the pattern once and reuse it to reduce branching in the inner loop. But I don't know if it's a measurable difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good catch! Added LRU pattern cache to compile once per unique pattern string instead of compiling on every row
| * @param pattern The pattern with wildcards | ||
| * @return List of captured strings (one per wildcard), or null if no match | ||
| */ | ||
| public static List<String> matchAndCapture(String input, String pattern) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (if-minor): Compiling an equivalent Regex matcher might be faster and simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great suggestion! Refactored to use regex matching
| /** Match pattern against input and capture wildcard portions. */ | ||
| public static List<String> matchAndCapture(String input, String pattern) { | ||
| Pattern compiledPattern = | ||
| PATTERN_CACHE.computeIfAbsent(pattern, WildcardReplaceUtils::compileWildcardPattern); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if pattern contains regex? I think we need to escape regex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
He's using Pattern.quote to escape the literal parts on line 80.
| import org.opensearch.sql.expression.function.UDFOperandMetadata; | ||
|
|
||
| /** UDF for wildcard-based string replacement in PPL replace command. */ | ||
| public class WildcardReplaceFunctionImpl extends ImplementorUDF { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use regex to execute the replace, should we rather utilize existing function for regex replacement? In that case, we just need to compile pattern during planning, and don't need caching during execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is a great improvement! I have updated, now wildcard patterns convert to regex at planning time. Also updated to use Clacite's REGEXP_REPLACE_3 operator instead of custom UDF, removed WildcardReplaceFunctionImpl, WildcardReplaceUtils, and runtime caching
fd2f6c3 to
125d868
Compare
|
The bwc-tests failed due to no space left on device which can be fixed by #4716 , no related to this PR but better to merge upstream (after 4716 merged) then rerun. |
ecab069 to
c2cfc01
Compare
| * @param wildcardPattern wildcard pattern with '*' and escape sequences (\*, \\) | ||
| * @return regex pattern with capture groups | ||
| */ | ||
| public static String convertWildcardPatternToRegex(String wildcardPattern) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add comprehensive unit tests for the methods in WildcardUtils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added unit tests in WildcardUtilsTest
| * @param pattern wildcard pattern with escape sequences | ||
| * @return array of pattern parts | ||
| */ | ||
| public static String[] splitWildcards(String pattern) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to private
| * @param str string to count wildcards in | ||
| * @return number of unescaped wildcards | ||
| */ | ||
| public static int countWildcards(String str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to private
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]> # Conflicts: # integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java # Conflicts: # integ-test/src/test/java/org/opensearch/sql/calcite/CalciteNoPushdownIT.java
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
c2cfc01 to
1ddce43
Compare
|
The backport to To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/sql/backport-2.19-dev 2.19-dev
# Navigate to the new working tree
pushd ../.worktrees/sql/backport-2.19-dev
# Create a new branch
git switch --create backport/backport-4698-to-2.19-dev
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 f97adb00227d0b209becc4cea72501af221533e6
# Push it to GitHub
git push --set-upstream origin backport/backport-4698-to-2.19-dev
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/sql/backport-2.19-devThen, create a pull request where the |
(cherry picked from commit f97adb0) Signed-off-by: Kai Huang <[email protected]>
(cherry picked from commit f97adb0) Signed-off-by: Kai Huang <[email protected]>
Description
Support wildcard for replace command
Related Issues
Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.