Skip to content

[Rust] FlatBufferBuilder::create_shared_string is O(N^2) #8687

@PollRobots

Description

@PollRobots

The string_pool is maintained as a sorted list of offsets, which when create_shared_string is called are searched using slice::binary_search_by

If a match is not found, then the string is stored using create_byte_string and the index is inserted into string_pool using Vec::insert

So the happy path is an O(LogN) binary search

The unhappy path is an O(N) vector insertion

Every happy path requires a preceding unhappy path, so if you have N unique strings and M duplicates, then the cost of all calls to create_shared_string becomes O(N^2) + O(MLogN) or simply O(N^2)

I have a stress test that is writing a flatbuffer with ~500,000 unique strings, 80% of our serialization time is in create_shared_string

I can reduce this to ~30% of our serialization time (total serialization time goes from to 20.4s to 4.1s) by

  1. changing the type of string_pool to HashMap<String, WIPOffset<&'fbb str>>, and
  2. changing the implementation of create_shared_string to
    #[inline]
    pub fn create_shared_string<'a: 'b, 'b>(&'a mut self, s: &'b str) -> WIPOffset<&'fbb str> {
        self.assert_not_nested(
            "create_shared_string can not be called when a table or vector is under construction",
        );

        match self.strings_pool.get(s) {
            Some(address) => address,
            None => {
                let address = WIPOffset::new(self.create_byte_string(s.as_bytes()).value());
                self.strings_pool.insert(s.to_owned(), address);
                address
            }
        }
    }

Metadata

Metadata

Assignees

No one assigned

    Labels

    pr-requestedA Pull Request is requested to move the issue forward.rust

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions