-
Notifications
You must be signed in to change notification settings - Fork 1.3k
SCIP-aware symbol storage #60800
Description
(NOTE: This is currently just an idea, not an actively worked on project.)
In our database today, we essentially have a Radix tree of SCIP symbol names stored in the codeintel_scip_symbol_names table.
The reason we use a radix tree instead of storing the SCIP symbol strings (or not storing them at all) directly is because:
- We want to reduce disk usage/table size (this table is one of the biggest in the code intel database)
- We need to be able to materialize the symbol strings during querying (see this CTE)
However, the radix tree does not take into account the structure of SCIP at all, making it very hard to implement different forms of matching on symbol names (such as that requested in https://github.com/sourcegraph/sourcegraph/issues/59957) other than exact matching.
Instead, we could utilize the structure of SCIP symbols:
<scheme> ' ' <manager> ' ' <package-name> ' ' <version> ' ' (<descriptor>)+
Out of these, there are probably not that many <scheme> ' ' <manager> ' ' <package-name> ' ' <version> prefixes in total anyways. In terms of the Radix tree, the total number of nodes near the root up to a depth of 3-4 are probably not that high. We could leverage this to split out the prefixes into a more structured form:
codeintel_scip_symbol_names_v2
id | upload_id | scheme_manager | package_name | package_version | symbol_ids : (array[int])
codeintel_scip_symbol_descriptors
id | reference_count | name_segment | prefix_id
So essentially splitting the symbol name into two, with the prefix stored in an easily accessible fashion.
Like the old schema, this new schema would allow:
- Searching for symbols based on exact package name and version
- Reasonably fast deletion for uploads - Identify rows in
codeintel_scip_symbol_names_v2, iterate oversymbol_idsarrays, apply decrements toreference_count, and delete rows withreference_count == 0.
Unlike the old schema, this new schema would:
- Allow searching for packages and versions in a fuzzy way, as requested in https://github.com/sourcegraph/sourcegraph/issues/59957
- Will likely actually reduce the storage significantly because we no longer have different rows when the descriptors are repeated across different uploads.
- (Optional) This space savings might allow us to be a bit loose with the Radix tree and break the
name_segmentvalues based on descriptors rather than at random intervals, in case that helps with better insertion logic.- Sketch of insertion logic: construct
DescriptorPrefixTreein Go for a batch by splitting symbol strings at descriptor boundaries, do a breadth first traversal, at each tree depth retrieving the IDs for the prefixes and then perform an insert-or-bump-refcount with the(name_segment, prefix_id)pair.
- Sketch of insertion logic: construct
- (Optional) This space savings might allow us to be a bit loose with the Radix tree and break the
Added downsides of new schema:
- We'd need to storage two integers in
codeintel_scip_symbolsinstead of just one that we're using now (symbol_id), because the symbol itself would be broken up into two chunks (one "flat", one tree-ified). That seems OK given the storage we'd recover from de-duplicating the trees across uploads. - (Risk) Because of the ref count bumping, there's the potential that the write path (= inserting uploads) interferes with reads (i.e. code nav).
- (Risk) Adding an index for
(name_segment, prefix_id)for insertions add a bunch of overhead/disk utilization.