This repository was archived by the owner on Sep 30, 2024. It is now read-only.
SCIP Tree-sitter CLI evaluation logic + workspace indexing mode#57894
Merged
SCIP Tree-sitter CLI evaluation logic + workspace indexing mode#57894
Conversation
Candidate ambiguity is a measure of how detailed the candidate SCIP in comparison to ground truth. When a candidate symbol has high ambiguity, it means that it occurs in a lot of places where ground truth SCIP uses different symbols. A demonstration of this method overloads in Java. If you have 20 overloads of the same method (but with different parameters), scip-java actually produces 20 different symbols (e.g. "NodeRenderer#render(+19)"). Our current methods just produce a single symbol "NodeRenderer#render()" for all those occurrences. This commit penalises such occurrences by the logarithm of ambiguity.
After computing the weights (using same jaccard measure) of individual pairs of (candidate, ground truth) symbols, we collect all the ground truth symbols that can be assigned to a given candidate, and normalise the weights of each pair by dividing it by sum of all weights. The idea behind this is to reassert the fact that mapping of symbols is fuzzy, and therefore we shouldn't be selecting just 1 symbol - instead we spread the fuzziness over all the occurrences, normalise them so they add up to one. Note that for a single alternative the weight will be 1, but that's not a problem because even if some occurrences were missed, they will be counted as part of false negatives, heavily discounting the effect of this spurious 1.0 TP
Contributor
|
Could you update the checklist in https://github.com/sourcegraph/sourcegraph/issues/58005 with some more details reflecting the sub-tasks involved? That's be helpful in better understand what's done/completed in the main issue itself, vs that info being scattered across multiple PR descriptions. |
docker-images/syntax-highlighter/crates/scip-treesitter-cli/BUILD.bazel
Outdated
Show resolved
Hide resolved
docker-images/syntax-highlighter/crates/scip-treesitter-cli/README.md
Outdated
Show resolved
Hide resolved
docker-images/syntax-highlighter/crates/scip-treesitter-cli/README.md
Outdated
Show resolved
Hide resolved
docker-images/syntax-highlighter/crates/scip-treesitter-cli/BUILD.bazel
Outdated
Show resolved
Hide resolved
docker-images/syntax-highlighter/crates/scip-treesitter-cli/Cargo.toml
Outdated
Show resolved
Hide resolved
2 tasks
docker-images/syntax-highlighter/crates/scip-treesitter-cli/src/evaluate.rs
Outdated
Show resolved
Hide resolved
Comment on lines
+38
to
+44
| if occs == 0 { | ||
| Err(anyhow!( | ||
| "Index contains no occurrences and cannot be used for evaluation" | ||
| )) | ||
| } else { | ||
| Ok(()) | ||
| } |
Contributor
There was a problem hiding this comment.
Suggested change
| if occs == 0 { | |
| Err(anyhow!( | |
| "Index contains no occurrences and cannot be used for evaluation" | |
| )) | |
| } else { | |
| Ok(()) | |
| } | |
| if occs == 0 { | |
| return Err(anyhow!( | |
| "Index contains no occurrences and cannot be used for evaluation" | |
| )); | |
| } | |
| Ok(()) |
Slightly shorter
Comment on lines
+32
to
+36
| let mut occs = 0; | ||
|
|
||
| for doc in &idx.documents { | ||
| occs += doc.occurrences.len(); | ||
| } |
Contributor
There was a problem hiding this comment.
Suggested change
| let mut occs = 0; | |
| for doc in &idx.documents { | |
| occs += doc.occurrences.len(); | |
| } | |
| let num_occs = idx.documents.iter().map(|d| d.occurrences.len()).sum(); |
docker-images/syntax-highlighter/crates/scip-treesitter-cli/src/evaluate.rs
Outdated
Show resolved
Hide resolved
Comment on lines
70
to
71
| // For each symbol pair we maintain an Overlap instance | ||
| let mut overlaps: HashMap<SymbolPair, Overlap> = HashMap::new(); |
Contributor
There was a problem hiding this comment.
nit: The comment is not adding anything over the type annotation.
varungandhi-src
approved these changes
Nov 14, 2023
Contributor
varungandhi-src
left a comment
There was a problem hiding this comment.
I don't care about the style comments that much, but please at least add help text for the different arguments.
9 tasks
vovakulikov
pushed a commit
that referenced
this pull request
Dec 12, 2023
* Support indexing entire workspace * Separate evaluate and index subcommands into different modules * Add --evaluate argument to index command, use serde for json * Punish candidates with high ambiguity Candidate ambiguity is a measure of how detailed the candidate SCIP in comparison to ground truth. When a candidate symbol has high ambiguity, it means that it occurs in a lot of places where ground truth SCIP uses different symbols. A demonstration of this method overloads in Java. If you have 20 overloads of the same method (but with different parameters), scip-java actually produces 20 different symbols (e.g. "NodeRenderer#render(+19)"). Our current methods just produce a single symbol "NodeRenderer#render()" for all those occurrences. This commit penalises such occurrences by the logarithm of ambiguity. * Introduce normalised weighting of candidates After computing the weights (using same jaccard measure) of individual pairs of (candidate, ground truth) symbols, we collect all the ground truth symbols that can be assigned to a given candidate, and normalise the weights of each pair by dividing it by sum of all weights. The idea behind this is to reassert the fact that mapping of symbols is fuzzy, and therefore we shouldn't be selecting just 1 symbol - instead we spread the fuzziness over all the occurrences, normalise them so they add up to one. Note that for a single alternative the weight will be 1, but that's not a problem because even if some occurrences were missed, they will be counted as part of false negatives, heavily discounting the effect of this spurious 1.0 TP * bzl: Remove library target from Rust crate (#58221) * fix: Stop sanitizing path unnecessarily * cleanup: Remove incorrect dep on CLI in highlighter binary * bzl: Re-add library target for cargo compat * build: Add comment for build targets * config: Hoist walkdir to workspace-level dep --------- Co-authored-by: Varun Gandhi <varun.gandhi@sourcegraph.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR:
Test plan