Skip to content

Conversation

@ahkcs
Copy link
Contributor

@ahkcs ahkcs commented Oct 29, 2025

Description

Support wildcard for replace command

Related Issues

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

/**
* Utility for wildcard-based string replacement in PPL replace command.
*
* <p>Supports SPL-style wildcard matching where '*' matches zero or more characters. Captured
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: No way to escape asterisks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great catch! For now I am marking it as a limitation in the docs since the use case for literal asterisks is rare. Do you think we should implement escape instead or we can keep it as a limitation?

Copy link
Collaborator

@Swiddis Swiddis Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep it as a limitation for now, but I'm a bit concerned about breaking a potential existing usage with no workaround. I don't know if who's using asterisks today, the number is probably low, but it'd be surprise me if it was 0.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially if we're already building the final pattern with regex split (L76), we might be able to just make the split have negative lookahead with Pattern.split("(?!\\\\)\\*")

Feel free to skip this if it's more complicated than that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to support escaping asterisks

}

// Fast path: no wildcards = literal replacement
if (!pattern.contains("*")) {
Copy link
Collaborator

@Swiddis Swiddis Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: What are the perf implications of reprocessing the pattern from scratch on every record?

Ideally it'd be faster to compile the pattern once and reuse it to reduce branching in the inner loop. But I don't know if it's a measurable difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good catch! Added LRU pattern cache to compile once per unique pattern string instead of compiling on every row

* @param pattern The pattern with wildcards
* @return List of captured strings (one per wildcard), or null if no match
*/
public static List<String> matchAndCapture(String input, String pattern) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (if-minor): Compiling an equivalent Regex matcher might be faster and simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great suggestion! Refactored to use regex matching

/** Match pattern against input and capture wildcard portions. */
public static List<String> matchAndCapture(String input, String pattern) {
Pattern compiledPattern =
PATTERN_CACHE.computeIfAbsent(pattern, WildcardReplaceUtils::compileWildcardPattern);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if pattern contains regex? I think we need to escape regex.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He's using Pattern.quote to escape the literal parts on line 80.

import org.opensearch.sql.expression.function.UDFOperandMetadata;

/** UDF for wildcard-based string replacement in PPL replace command. */
public class WildcardReplaceFunctionImpl extends ImplementorUDF {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use regex to execute the replace, should we rather utilize existing function for regex replacement? In that case, we just need to compile pattern during planning, and don't need caching during execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is a great improvement! I have updated, now wildcard patterns convert to regex at planning time. Also updated to use Clacite's REGEXP_REPLACE_3 operator instead of custom UDF, removed WildcardReplaceFunctionImpl, WildcardReplaceUtils, and runtime caching

@ahkcs ahkcs requested review from Swiddis and ykmr1224 October 30, 2025 22:59
@LantaoJin
Copy link
Member

The bwc-tests failed due to no space left on device which can be fixed by #4716 , no related to this PR but better to merge upstream (after 4716 merged) then rerun.

* @param wildcardPattern wildcard pattern with '*' and escape sequences (\*, \\)
* @return regex pattern with capture groups
*/
public static String convertWildcardPatternToRegex(String wildcardPattern) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add comprehensive unit tests for the methods in WildcardUtils.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added unit tests in WildcardUtilsTest

* @param pattern wildcard pattern with escape sequences
* @return array of pattern parts
*/
public static String[] splitWildcards(String pattern) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to private

* @param str string to count wildcards in
* @return number of unescaped wildcards
*/
public static int countWildcards(String str) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to private

ahkcs added 12 commits November 4, 2025 09:37
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>

# Conflicts:
#	integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java

# Conflicts:
#	integ-test/src/test/java/org/opensearch/sql/calcite/CalciteNoPushdownIT.java
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
@Swiddis Swiddis merged commit f97adb0 into opensearch-project:main Nov 4, 2025
39 of 57 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.19-dev failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/sql/backport-2.19-dev 2.19-dev
# Navigate to the new working tree
pushd ../.worktrees/sql/backport-2.19-dev
# Create a new branch
git switch --create backport/backport-4698-to-2.19-dev
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 f97adb00227d0b209becc4cea72501af221533e6
# Push it to GitHub
git push --set-upstream origin backport/backport-4698-to-2.19-dev
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/sql/backport-2.19-dev

Then, create a pull request where the base branch is 2.19-dev and the compare/head branch is backport/backport-4698-to-2.19-dev.

expani pushed a commit to vinaykpud/sql that referenced this pull request Nov 4, 2025
ahkcs added a commit to ahkcs/sql that referenced this pull request Nov 4, 2025
LantaoJin pushed a commit that referenced this pull request Nov 5, 2025
(cherry picked from commit f97adb0)

Signed-off-by: Kai Huang <[email protected]>
@LantaoJin LantaoJin added the backport-manually Filed a PR to backport manually. label Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.19-dev backport-failed backport-manually Filed a PR to backport manually. enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC] Add replace Command to PPL (Calcite Engine)

4 participants