Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MONGOCRYPT-755 Implement StrEncode #928

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

marksg07
Copy link
Collaborator

No description provided.

@marksg07 marksg07 marked this pull request as ready for review December 31, 2024 20:07
src/mc-text-search-str-encode-private.h Show resolved Hide resolved
src/mc-text-search-str-encode.c Outdated Show resolved Hide resolved
src/mc-text-search-str-encode.c Outdated Show resolved Hide resolved
src/mc-text-search-str-encode.c Outdated Show resolved Hide resolved
src/mc-text-search-str-encode.c Show resolved Hide resolved
src/mc-text-search-str-encode.c Show resolved Hide resolved
@marksg07 marksg07 marked this pull request as draft January 6, 2025 16:02
@marksg07 marksg07 requested a review from erwee January 6, 2025 21:36
@marksg07 marksg07 marked this pull request as ready for review January 7, 2025 16:20
src/mc-text-search-str-encode.c Outdated Show resolved Hide resolved
src/mc-text-search-str-encode-private.h Outdated Show resolved Hide resolved
src/mc-text-search-str-encode-private.h Outdated Show resolved Hide resolved
src/mc-text-search-str-encode.c Outdated Show resolved Hide resolved
src/mc-text-search-str-encode.c Show resolved Hide resolved
src/mc-text-search-str-encode.c Outdated Show resolved Hide resolved
src/mc-text-search-str-encode.c Outdated Show resolved Hide resolved
@marksg07 marksg07 requested a review from erwee January 13, 2025 21:18
src/mc-str-encode-string-sets-private.h Outdated Show resolved Hide resolved
src/mc-str-encode-string-sets.c Show resolved Hide resolved
}

bool mc_substring_set_insert(mc_substring_set_t *set, uint32_t base_start_idx, uint32_t base_end_idx) {
if (base_start_idx > base_end_idx || base_end_idx >= set->base_string->codepoint_len) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be BSON_ASSERTs I think.. also need to add BSON_ASSERT_PARAM(set).

The = in base_end_idx >= set->base_string->codepoint_len implies that there's an extra fake byte at the end, so this doesn't allow you to include that. What if we just don't include the additional 0xff byte in codepoint_len, so that the caller doesn't have to think about the fake byte, and so the entire range of valid inputs for base_end_idx is just [0, codepoint_len], instead of [0, codepoint_len-1].

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having codepoint_len as the length of the codepoint_offsets array. We give the fake byte its own codepoint offset, so I feel it should be included in the length. What I could do is, since there's nothing actually programmatically wrong with allowing base_end_idx == set->base_string->codepoint_len, just allow that case but never use it. We can verify in testing (well, we already do) that we are never actually inserting the bad byte.


void mc_substring_set_iter_init(mc_substring_set_iter_t *it, mc_substring_set_t *set) {
it->set = set;
it->cur_node = NULL;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be it->cur_node = set->set[0] because if not, the subsequent call to iter_next will always skip index 0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Fixed.

src/mc-str-encode-string-sets.c Outdated Show resolved Hide resolved
src/mc-str-encode-string-sets.c Show resolved Hide resolved
src/mc-str-encode-string-sets.c Outdated Show resolved Hide resolved
// maxkgram_2 = sum_(j=lb, ub, (cbclen - j + 1)) # same sum bounds as maxkgram_1
// msize = sum_(j=lb, ub, (min(mlen, cbclen) - j + 1))
// in both cases, msize can be rewritten as:
// msize = sum_(j=lb, min(ub, cbclen), (min(mlen, cbclen) - j + 1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, maybe let's just keep this simple and calculate it the way it's done in the paper?

uint32_t maxkgram_1 = calc_number_of_substrings(mlen, spec->lb, spec->ub);
uint32_t maxkgram_2 = calc_number_of_substrings(cbclen, spec->lb, BSON_MIN(spec->ub, cbclen));
uint32_t msize = BSON_MIN(maxkgram_1, maxkgram_2);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, I don't like doing more calculations just in order to match the paper. I feel that the variable names and the inline comments give enough reason for why this counting method is correct -- it's clear that we are padding the length to the lesser of cbclen or mlen, and then calculating how many substrings between length lb and ub there would be if the string actually was that max padded length.

@marksg07 marksg07 requested a review from erwee January 15, 2025 21:59
Copy link
Contributor

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Only substantial comment is to limit string lengths to prevent possible overflows.

Comment on lines +47 to +52
fprintf(stderr,
"Testing nofold suffix/prefix case: str=\"%s\", lb=%u, ub=%u, unfolded_codepoint_len=%u\n",
str,
lb,
ub,
unfolded_codepoint_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using the TEST_PRINTF and TEST_STDERR_PRINTF macros to flush stdout/stderr and avoid mixed output in Evergreen logs. The macros were recently introduced in b193dba. Run git merge master to include them.

    TEST_STDERR_PRINTF("Testing nofold suffix/prefix case: str=\"%s\", lb=%u, ub=%u, unfolded_codepoint_len=%u\n",
                       str,
                       lb,
                       ub,
                       unfolded_codepoint_len);

Comment on lines +26 to +27
#undef MIN
#define MIN(a, b) (((a) < (b)) ? (a) : (b))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#undef MIN
#define MIN(a, b) (((a) < (b)) ? (a) : (b))

Suggest using BSON_MIN to simplify.

@@ -119,6 +119,58 @@ bool mc_FLE2RangeInsertSpec_parse(mc_FLE2RangeInsertSpec_t *out,
bool use_range_v2,
mongocrypt_status_t *status);

typedef struct {
// mlen is the max string length that can be indexed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// mlen is the max string length that can be indexed.
// mlen is the max string length (in characters, not bytes) that can be indexed.

Comment on lines +65 to +66
mc_FLE2TextSearchInsertSpec_t spec =
{str, byte_len, {{0, 0, 0}, false}, {{lb, ub}, true}, {{0, 0}, false}, false, false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mc_FLE2TextSearchInsertSpec_t spec =
{str, byte_len, {{0, 0, 0}, false}, {{lb, ub}, true}, {{0, 0}, false}, false, false};
mc_FLE2TextSearchInsertSpec_t spec = {.v = str, .len = byte_len, .suffix = {{lb, ub}, true}};

Suggest using designated initializers and omitting fields that are expected to be zero-initialized to improve readability.

uint32_t affix_count = 0;
uint32_t total_real_affix_count = 0;
while (mc_affix_set_iter_next(&it, &affix, &affix_len, &affix_count)) {
// Since all substrings are just views on the base string, we can use pointer math to find our start and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Since all substrings are just views on the base string, we can use pointer math to find our start and
// Since all substrings are just views on the base string, we can use pointer math to find our start and end

ASSERT(sets->exact.len == byte_len);
ASSERT(0 == memcmp(sets->exact.data, str, byte_len));

if (unfolded_codepoint_len > mlen || lb > max_padded_len) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (unfolded_codepoint_len > mlen || lb > max_padded_len) {
if (lb > max_padded_len) {

Redundant with above check.

}
set->start_indices[idx] = base_start_idx;
set->end_indices[idx] = base_end_idx;
set->substring_counts[idx] = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider storing and incrementing the current set size in mc_affix_set_t, rather than requiring callers to track the index:

set->start_indices[set->cur_idx] = base_start_idx;
set->end_indices[set->cur_idx] = base_end_idx;
set->substring_counts[set->cur_idx] = 1;
set->cur_idx++;

That may help avoid exposing implementation details of mc_affix_set_t to the caller.

it->cur_idx = 0;
}

bool mc_affix_set_iter_next(mc_affix_set_iter_t *it, const char **str, uint32_t *len, uint32_t *count) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool mc_affix_set_iter_next(mc_affix_set_iter_t *it, const char **str, uint32_t *len, uint32_t *count) {
bool mc_affix_set_iter_next(mc_affix_set_iter_t *it, const char **str, uint32_t *byte_len, uint32_t *count) {

To clarify output is byte length, not character length.

// Linked list node in the hashset.
typedef struct _mc_substring_set_node_t {
uint32_t start_offset;
uint32_t len;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint32_t len;
uint32_t byte_len;

mc_utf8_string_with_bad_char_t *mc_utf8_string_with_bad_char_from_buffer(const char *buf, uint32_t len) {
BSON_ASSERT_PARAM(buf);
mc_utf8_string_with_bad_char_t *ret = bson_malloc0(sizeof(mc_utf8_string_with_bad_char_t));
_mongocrypt_buffer_init_size(&ret->buf, len + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect this could overflow if len is UINT32_MAX. Similarly, the CBC length calculations may overflow when adding 15.

I suggest rejecting too-long strings in mc_text_search_str_encode, and using a limit much smaller than UINT32_MAX. If the limit is near UINT32_MAX, these operations may be prohibitively slow to be useful and could risk a denial-of-service attack. Consider using 16MiB (16777216 bytes) to match the maximum insert size of a BSON document (maxBsonObjectSize from the hello command reply).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants