-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MONGOCRYPT-755 Implement StrEncode #928
base: master
Are you sure you want to change the base?
Conversation
src/mc-str-encode-string-sets.c
Outdated
} | ||
|
||
bool mc_substring_set_insert(mc_substring_set_t *set, uint32_t base_start_idx, uint32_t base_end_idx) { | ||
if (base_start_idx > base_end_idx || base_end_idx >= set->base_string->codepoint_len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be BSON_ASSERTs I think.. also need to add BSON_ASSERT_PARAM(set).
The =
in base_end_idx >= set->base_string->codepoint_len
implies that there's an extra fake byte at the end, so this doesn't allow you to include that. What if we just don't include the additional 0xff byte in codepoint_len
, so that the caller doesn't have to think about the fake byte, and so the entire range of valid inputs for base_end_idx
is just [0, codepoint_len]
, instead of [0, codepoint_len-1]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having codepoint_len
as the length of the codepoint_offsets
array. We give the fake byte its own codepoint offset, so I feel it should be included in the length. What I could do is, since there's nothing actually programmatically wrong with allowing base_end_idx == set->base_string->codepoint_len
, just allow that case but never use it. We can verify in testing (well, we already do) that we are never actually inserting the bad byte.
src/mc-str-encode-string-sets.c
Outdated
|
||
void mc_substring_set_iter_init(mc_substring_set_iter_t *it, mc_substring_set_t *set) { | ||
it->set = set; | ||
it->cur_node = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be it->cur_node = set->set[0]
because if not, the subsequent call to iter_next
will always skip index 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Fixed.
// maxkgram_2 = sum_(j=lb, ub, (cbclen - j + 1)) # same sum bounds as maxkgram_1 | ||
// msize = sum_(j=lb, ub, (min(mlen, cbclen) - j + 1)) | ||
// in both cases, msize can be rewritten as: | ||
// msize = sum_(j=lb, min(ub, cbclen), (min(mlen, cbclen) - j + 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, maybe let's just keep this simple and calculate it the way it's done in the paper?
uint32_t maxkgram_1 = calc_number_of_substrings(mlen, spec->lb, spec->ub);
uint32_t maxkgram_2 = calc_number_of_substrings(cbclen, spec->lb, BSON_MIN(spec->ub, cbclen));
uint32_t msize = BSON_MIN(maxkgram_1, maxkgram_2);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, I don't like doing more calculations just in order to match the paper. I feel that the variable names and the inline comments give enough reason for why this counting method is correct -- it's clear that we are padding the length to the lesser of cbclen or mlen, and then calculating how many substrings between length lb and ub there would be if the string actually was that max padded length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. Only substantial comment is to limit string lengths to prevent possible overflows.
fprintf(stderr, | ||
"Testing nofold suffix/prefix case: str=\"%s\", lb=%u, ub=%u, unfolded_codepoint_len=%u\n", | ||
str, | ||
lb, | ||
ub, | ||
unfolded_codepoint_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest using the TEST_PRINTF
and TEST_STDERR_PRINTF
macros to flush stdout/stderr and avoid mixed output in Evergreen logs. The macros were recently introduced in b193dba. Run git merge master
to include them.
TEST_STDERR_PRINTF("Testing nofold suffix/prefix case: str=\"%s\", lb=%u, ub=%u, unfolded_codepoint_len=%u\n",
str,
lb,
ub,
unfolded_codepoint_len);
#undef MIN | ||
#define MIN(a, b) (((a) < (b)) ? (a) : (b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#undef MIN | |
#define MIN(a, b) (((a) < (b)) ? (a) : (b)) |
Suggest using BSON_MIN
to simplify.
@@ -119,6 +119,58 @@ bool mc_FLE2RangeInsertSpec_parse(mc_FLE2RangeInsertSpec_t *out, | |||
bool use_range_v2, | |||
mongocrypt_status_t *status); | |||
|
|||
typedef struct { | |||
// mlen is the max string length that can be indexed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// mlen is the max string length that can be indexed. | |
// mlen is the max string length (in characters, not bytes) that can be indexed. |
mc_FLE2TextSearchInsertSpec_t spec = | ||
{str, byte_len, {{0, 0, 0}, false}, {{lb, ub}, true}, {{0, 0}, false}, false, false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mc_FLE2TextSearchInsertSpec_t spec = | |
{str, byte_len, {{0, 0, 0}, false}, {{lb, ub}, true}, {{0, 0}, false}, false, false}; | |
mc_FLE2TextSearchInsertSpec_t spec = {.v = str, .len = byte_len, .suffix = {{lb, ub}, true}}; |
Suggest using designated initializers and omitting fields that are expected to be zero-initialized to improve readability.
uint32_t affix_count = 0; | ||
uint32_t total_real_affix_count = 0; | ||
while (mc_affix_set_iter_next(&it, &affix, &affix_len, &affix_count)) { | ||
// Since all substrings are just views on the base string, we can use pointer math to find our start and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Since all substrings are just views on the base string, we can use pointer math to find our start and | |
// Since all substrings are just views on the base string, we can use pointer math to find our start and end |
ASSERT(sets->exact.len == byte_len); | ||
ASSERT(0 == memcmp(sets->exact.data, str, byte_len)); | ||
|
||
if (unfolded_codepoint_len > mlen || lb > max_padded_len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (unfolded_codepoint_len > mlen || lb > max_padded_len) { | |
if (lb > max_padded_len) { |
Redundant with above check.
} | ||
set->start_indices[idx] = base_start_idx; | ||
set->end_indices[idx] = base_end_idx; | ||
set->substring_counts[idx] = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider storing and incrementing the current set size in mc_affix_set_t
, rather than requiring callers to track the index:
set->start_indices[set->cur_idx] = base_start_idx;
set->end_indices[set->cur_idx] = base_end_idx;
set->substring_counts[set->cur_idx] = 1;
set->cur_idx++;
That may help avoid exposing implementation details of mc_affix_set_t
to the caller.
it->cur_idx = 0; | ||
} | ||
|
||
bool mc_affix_set_iter_next(mc_affix_set_iter_t *it, const char **str, uint32_t *len, uint32_t *count) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool mc_affix_set_iter_next(mc_affix_set_iter_t *it, const char **str, uint32_t *len, uint32_t *count) { | |
bool mc_affix_set_iter_next(mc_affix_set_iter_t *it, const char **str, uint32_t *byte_len, uint32_t *count) { |
To clarify output is byte length, not character length.
// Linked list node in the hashset. | ||
typedef struct _mc_substring_set_node_t { | ||
uint32_t start_offset; | ||
uint32_t len; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint32_t len; | |
uint32_t byte_len; |
mc_utf8_string_with_bad_char_t *mc_utf8_string_with_bad_char_from_buffer(const char *buf, uint32_t len) { | ||
BSON_ASSERT_PARAM(buf); | ||
mc_utf8_string_with_bad_char_t *ret = bson_malloc0(sizeof(mc_utf8_string_with_bad_char_t)); | ||
_mongocrypt_buffer_init_size(&ret->buf, len + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect this could overflow if len
is UINT32_MAX
. Similarly, the CBC length calculations may overflow when adding 15.
I suggest rejecting too-long strings in mc_text_search_str_encode
, and using a limit much smaller than UINT32_MAX
. If the limit is near UINT32_MAX
, these operations may be prohibitively slow to be useful and could risk a denial-of-service attack. Consider using 16MiB (16777216 bytes) to match the maximum insert size of a BSON document (maxBsonObjectSize
from the hello
command reply).
No description provided.