MONGOCRYPT-755 Implement StrEncode #928

marksg07 · 2024-12-20T20:09:54Z

No description provided.

src/mc-text-search-str-encode-private.h

src/mc-text-search-str-encode.c

src/mc-text-search-str-encode-private.h

src/mc-text-search-str-encode.c

src/mc-str-encode-string-sets-private.h

src/mc-str-encode-string-sets.c

erwee · 2025-01-15T16:16:14Z

src/mc-str-encode-string-sets.c

+}
+
+bool mc_substring_set_insert(mc_substring_set_t *set, uint32_t base_start_idx, uint32_t base_end_idx) {
+    if (base_start_idx > base_end_idx || base_end_idx >= set->base_string->codepoint_len) {


These should be BSON_ASSERTs I think.. also need to add BSON_ASSERT_PARAM(set).

The = in base_end_idx >= set->base_string->codepoint_len implies that there's an extra fake byte at the end, so this doesn't allow you to include that. What if we just don't include the additional 0xff byte in codepoint_len, so that the caller doesn't have to think about the fake byte, and so the entire range of valid inputs for base_end_idx is just [0, codepoint_len], instead of [0, codepoint_len-1].

I like having codepoint_len as the length of the codepoint_offsets array. We give the fake byte its own codepoint offset, so I feel it should be included in the length. What I could do is, since there's nothing actually programmatically wrong with allowing base_end_idx == set->base_string->codepoint_len, just allow that case but never use it. We can verify in testing (well, we already do) that we are never actually inserting the bad byte.

erwee · 2025-01-15T16:29:58Z

src/mc-str-encode-string-sets.c

+
+void mc_substring_set_iter_init(mc_substring_set_iter_t *it, mc_substring_set_t *set) {
+    it->set = set;
+    it->cur_node = NULL;


shouldn't this be it->cur_node = set->set[0] because if not, the subsequent call to iter_next will always skip index 0?

Good point. Fixed.

src/mc-str-encode-string-sets.c

erwee · 2025-01-15T19:07:36Z

src/mc-text-search-str-encode.c

+    //     maxkgram_2 = sum_(j=lb, ub, (cbclen - j + 1))        # same sum bounds as maxkgram_1
+    //     msize      = sum_(j=lb, ub, (min(mlen, cbclen) - j + 1))
+    // in both cases, msize can be rewritten as:
+    //     msize      = sum_(j=lb, min(ub, cbclen), (min(mlen, cbclen) - j + 1))


On second thought, maybe let's just keep this simple and calculate it the way it's done in the paper?

uint32_t maxkgram_1 = calc_number_of_substrings(mlen, spec->lb, spec->ub); uint32_t maxkgram_2 = calc_number_of_substrings(cbclen, spec->lb, BSON_MIN(spec->ub, cbclen)); uint32_t msize = BSON_MIN(maxkgram_1, maxkgram_2);

Eh, I don't like doing more calculations just in order to match the paper. I feel that the variable names and the inline comments give enough reason for why this counting method is correct -- it's clear that we are padding the length to the lesser of cbclen or mlen, and then calculating how many substrings between length lb and ub there would be if the string actually was that max padded length.

kevinAlbs

Nice work. Only substantial comment is to limit string lengths to prevent possible overflows.

kevinAlbs · 2025-01-15T15:00:31Z

test/test-mc-text-search-str-encode.c

+    fprintf(stderr,
+            "Testing nofold suffix/prefix case: str=\"%s\", lb=%u, ub=%u, unfolded_codepoint_len=%u\n",
+            str,
+            lb,
+            ub,
+            unfolded_codepoint_len);


Suggest using the TEST_PRINTF and TEST_STDERR_PRINTF macros to flush stdout/stderr and avoid mixed output in Evergreen logs. The macros were recently introduced in b193dba. Run git merge master to include them.

TEST_STDERR_PRINTF("Testing nofold suffix/prefix case: str=\"%s\", lb=%u, ub=%u, unfolded_codepoint_len=%u\n", str, lb, ub, unfolded_codepoint_len);

kevinAlbs · 2025-01-16T16:56:03Z

test/test-mc-text-search-str-encode.c

+#undef MIN
+#define MIN(a, b) (((a) < (b)) ? (a) : (b))


Suggested change

#undef MIN

#define MIN(a, b) (((a) < (b)) ? (a) : (b))

Suggest using BSON_MIN to simplify.

kevinAlbs · 2025-01-16T16:58:48Z

src/mc-fle2-encryption-placeholder-private.h

@@ -119,6 +119,58 @@ bool mc_FLE2RangeInsertSpec_parse(mc_FLE2RangeInsertSpec_t *out,
                                  bool use_range_v2,
                                  mongocrypt_status_t *status);

+typedef struct {
+    // mlen is the max string length that can be indexed.


Suggested change

// mlen is the max string length that can be indexed.

// mlen is the max string length (in characters, not bytes) that can be indexed.

kevinAlbs · 2025-01-16T20:23:20Z

test/test-mc-text-search-str-encode.c

+            mc_FLE2TextSearchInsertSpec_t spec =
+                {str, byte_len, {{0, 0, 0}, false}, {{lb, ub}, true}, {{0, 0}, false}, false, false};


Suggested change

mc_FLE2TextSearchInsertSpec_t spec =

{str, byte_len, {{0, 0, 0}, false}, {{lb, ub}, true}, {{0, 0}, false}, false, false};

mc_FLE2TextSearchInsertSpec_t spec = {.v = str, .len = byte_len, .suffix = {{lb, ub}, true}};

Suggest using designated initializers and omitting fields that are expected to be zero-initialized to improve readability.

kevinAlbs · 2025-01-16T20:30:08Z

test/test-mc-text-search-str-encode.c

+        uint32_t affix_count = 0;
+        uint32_t total_real_affix_count = 0;
+        while (mc_affix_set_iter_next(&it, &affix, &affix_len, &affix_count)) {
+            // Since all substrings are just views on the base string, we can use pointer math to find our start and


Suggested change

// Since all substrings are just views on the base string, we can use pointer math to find our start and

// Since all substrings are just views on the base string, we can use pointer math to find our start and end

kevinAlbs · 2025-01-16T21:05:23Z

test/test-mc-text-search-str-encode.c

+    ASSERT(sets->exact.len == byte_len);
+    ASSERT(0 == memcmp(sets->exact.data, str, byte_len));
+
+    if (unfolded_codepoint_len > mlen || lb > max_padded_len) {


Suggested change

if (unfolded_codepoint_len > mlen || lb > max_padded_len) {

if (lb > max_padded_len) {

Redundant with above check.

kevinAlbs · 2025-01-17T14:02:52Z

src/mc-str-encode-string-sets.c

+    }
+    set->start_indices[idx] = base_start_idx;
+    set->end_indices[idx] = base_end_idx;
+    set->substring_counts[idx] = 1;


Consider storing and incrementing the current set size in mc_affix_set_t, rather than requiring callers to track the index:

set->start_indices[set->cur_idx] = base_start_idx; set->end_indices[set->cur_idx] = base_end_idx; set->substring_counts[set->cur_idx] = 1; set->cur_idx++;

That may help avoid exposing implementation details of mc_affix_set_t to the caller.

kevinAlbs · 2025-01-17T14:05:45Z

src/mc-str-encode-string-sets.c

+    it->cur_idx = 0;
+}
+
+bool mc_affix_set_iter_next(mc_affix_set_iter_t *it, const char **str, uint32_t *len, uint32_t *count) {


Suggested change

bool mc_affix_set_iter_next(mc_affix_set_iter_t *it, const char **str, uint32_t *len, uint32_t *count) {

bool mc_affix_set_iter_next(mc_affix_set_iter_t *it, const char **str, uint32_t *byte_len, uint32_t *count) {

To clarify output is byte length, not character length.

kevinAlbs · 2025-01-17T14:06:58Z

src/mc-str-encode-string-sets.c

+// Linked list node in the hashset.
+typedef struct _mc_substring_set_node_t {
+    uint32_t start_offset;
+    uint32_t len;


Suggested change

uint32_t len;

uint32_t byte_len;

kevinAlbs · 2025-01-17T14:57:43Z

src/mc-str-encode-string-sets.c

+mc_utf8_string_with_bad_char_t *mc_utf8_string_with_bad_char_from_buffer(const char *buf, uint32_t len) {
+    BSON_ASSERT_PARAM(buf);
+    mc_utf8_string_with_bad_char_t *ret = bson_malloc0(sizeof(mc_utf8_string_with_bad_char_t));
+    _mongocrypt_buffer_init_size(&ret->buf, len + 1);


I expect this could overflow if len is UINT32_MAX. Similarly, the CBC length calculations may overflow when adding 15.

I suggest rejecting too-long strings in mc_text_search_str_encode, and using a limit much smaller than UINT32_MAX. If the limit is near UINT32_MAX, these operations may be prohibitively slow to be useful and could risk a denial-of-service attack. Consider using 16MiB (16777216 bytes) to match the maximum insert size of a BSON document (maxBsonObjectSize from the hello command reply).

marksg07 added 10 commits December 20, 2024 20:09

MONGOCRYPT-755 Implement StrEncode

70e2ef4

Comments + cleanup

fe6f93b

more comments

c8678c8

fix

5215b80

fix ff

e5e8c58

fix

92bfeb0

f

ceacd48

windows

cbd420d

ll

54f6815

lld

481f378

marksg07 requested review from kevinAlbs, erwee and markbenvenuto December 31, 2024 20:07

marksg07 marked this pull request as ready for review December 31, 2024 20:07

Merge branch 'master' into marksg07/mongocrypt-755

85a12ba

erwee requested changes Jan 2, 2025

View reviewed changes

marksg07 marked this pull request as draft January 6, 2025 16:02

marksg07 removed request for kevinAlbs and markbenvenuto January 6, 2025 16:02

marksg07 added 3 commits January 6, 2025 21:23

unicode

723427d

comment

0286858

comments

b0c023f

marksg07 requested a review from erwee January 6, 2025 21:36

marksg07 added 2 commits January 6, 2025 21:43

const

cb6bcf2

windows

10792c2

marksg07 marked this pull request as ready for review January 7, 2025 16:20

marksg07 requested review from kevinAlbs and markbenvenuto January 7, 2025 16:20

erwee requested changes Jan 9, 2025

View reviewed changes

src/mc-text-search-str-encode.c Outdated Show resolved Hide resolved

marksg07 added 4 commits January 10, 2025 22:25

Hashset

4bcba8a

PR fixes

3e0301e

fix bug

dad5688

a

48f80c1

marksg07 requested a review from erwee January 13, 2025 21:18

marksg07 added 2 commits January 13, 2025 21:59

more leaks

59e5944

Merge branch 'master' into marksg07/mongocrypt-755

67b5d07

erwee requested changes Jan 15, 2025

View reviewed changes

Fixes

d8f11cb

marksg07 requested a review from erwee January 15, 2025 21:59

erwee approved these changes Jan 16, 2025

View reviewed changes

kevinAlbs reviewed Jan 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MONGOCRYPT-755 Implement StrEncode #928

MONGOCRYPT-755 Implement StrEncode #928

marksg07 commented Dec 20, 2024

erwee Jan 15, 2025

marksg07 Jan 15, 2025

erwee Jan 15, 2025

marksg07 Jan 15, 2025

erwee Jan 15, 2025

marksg07 Jan 15, 2025

kevinAlbs left a comment

kevinAlbs Jan 15, 2025

kevinAlbs Jan 16, 2025

kevinAlbs Jan 16, 2025

kevinAlbs Jan 16, 2025

kevinAlbs Jan 16, 2025

kevinAlbs Jan 16, 2025

kevinAlbs Jan 17, 2025

kevinAlbs Jan 17, 2025

kevinAlbs Jan 17, 2025

kevinAlbs Jan 17, 2025

	// mlen is the max string length that can be indexed.
	// mlen is the max string length (in characters, not bytes) that can be indexed.

		mc_FLE2TextSearchInsertSpec_t spec =
		{str, byte_len, {{0, 0, 0}, false}, {{lb, ub}, true}, {{0, 0}, false}, false, false};

	mc_FLE2TextSearchInsertSpec_t spec =
	{str, byte_len, {{0, 0, 0}, false}, {{lb, ub}, true}, {{0, 0}, false}, false, false};
	mc_FLE2TextSearchInsertSpec_t spec = {.v = str, .len = byte_len, .suffix = {{lb, ub}, true}};

	// Since all substrings are just views on the base string, we can use pointer math to find our start and
	// Since all substrings are just views on the base string, we can use pointer math to find our start and end

	if (unfolded_codepoint_len > mlen \|\| lb > max_padded_len) {
	if (lb > max_padded_len) {

	bool mc_affix_set_iter_next(mc_affix_set_iter_t it, const char str, uint32_t len, uint32_t *count) {
	bool mc_affix_set_iter_next(mc_affix_set_iter_t it, const char str, uint32_t byte_len, uint32_t *count) {

MONGOCRYPT-755 Implement StrEncode #928

Are you sure you want to change the base?

MONGOCRYPT-755 Implement StrEncode #928

Conversation

marksg07 commented Dec 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinAlbs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment