SCIP-aware symbol storage

(NOTE: This is currently just an idea, not an actively worked on project.)

In our database today, we essentially have a [Radix tree](https://en.wikipedia.org/wiki/Radix_tree) of SCIP symbol names stored in the [`codeintel_scip_symbol_names` table](https://sourcegraph.com/github.com/sourcegraph/sourcegraph@8e7dff7ab117c63a02c7224c827aea68ce45de08/-/blob/internal/database/schema.codeintel.md?L145-168).

The reason we use a radix tree instead of storing the SCIP symbol strings (or not storing them at all) directly is because:
1. We want to reduce disk usage/table size (this table is one of the biggest in the code intel database)
2. We need to be able to materialize the symbol strings during querying (see this [CTE](https://sourcegraph.com/github.com/sourcegraph/sourcegraph@8e7dff7ab117c63a02c7224c827aea68ce45de08/-/blob/internal/codeintel/codenav/internal/lsifstore/metadata_by_position.go?L118:7-118:20#tab=references))

However, the radix tree does not take into account the structure of SCIP at all, making it very hard to implement different forms of matching on symbol names (such as that requested in https://github.com/sourcegraph/sourcegraph/issues/59957) other than exact matching.

Instead, we could utilize the structure of SCIP symbols:

```
<scheme> ' ' <manager> ' ' <package-name> ' ' <version> ' ' (<descriptor>)+
```


Out of these, there are probably not that many `<scheme> ' ' <manager> ' ' <package-name> ' ' <version>` prefixes in total anyways. In terms of the Radix tree, the total number of nodes near the root up to a depth of 3-4 are probably not that high. We could leverage this to split out the prefixes into a more structured form:

```
codeintel_scip_symbol_names_v2
id | upload_id | scheme_manager | package_name | package_version | symbol_ids : (array[int])

codeintel_scip_symbol_descriptors
id | reference_count | name_segment | prefix_id
```

So essentially splitting the symbol name into two, with the prefix stored in an easily accessible fashion.

Like the old schema, this new schema would allow:
- Searching for symbols based on exact package name and version
- Reasonably fast deletion for uploads - Identify rows in `codeintel_scip_symbol_names_v2`, iterate over `symbol_ids` arrays, apply decrements to `reference_count`, and delete rows with `reference_count == 0`.

Unlike the old schema, this new schema would:
- Allow searching for packages and versions in a fuzzy way, as requested in https://github.com/sourcegraph/sourcegraph/issues/59957
- Will likely actually reduce the storage significantly because we no longer have different rows when the descriptors are repeated across different uploads.
    - (Optional) This space savings might allow us to be a bit loose with the Radix tree and break the `name_segment` values based on descriptors rather than at random intervals, in case that helps with better insertion logic.
      - Sketch of insertion logic: construct `DescriptorPrefixTree` in Go for a batch by splitting symbol strings at descriptor boundaries, do a breadth first traversal, at each tree depth retrieving the IDs for the prefixes and then perform an insert-or-bump-refcount with the `(name_segment, prefix_id)` pair.

Added downsides of new schema:
- We'd need to storage two integers in `codeintel_scip_symbols` instead of just one that we're using now (`symbol_id`), because the symbol itself would be broken up into two chunks (one "flat", one tree-ified). That seems OK given the storage we'd recover from de-duplicating the trees across uploads.
- (Risk) Because of the ref count bumping, there's the potential that the write path (= inserting uploads) interferes with reads (i.e. code nav).
- (Risk) Adding an index for `(name_segment, prefix_id)` for insertions add a bunch of overhead/disk utilization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SCIP-aware symbol storage #60800

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SCIP-aware symbol storage #60800

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions