Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

SCIP Tree-sitter CLI evaluation logic + workspace indexing mode#57894

Merged
keynmol merged 32 commits intomainfrom
scip-treeistter-cli-evaluate
Nov 14, 2023
Merged

SCIP Tree-sitter CLI evaluation logic + workspace indexing mode#57894
keynmol merged 32 commits intomainfrom
scip-treeistter-cli-evaluate

Conversation

@keynmol
Copy link
Contributor

@keynmol keynmol commented Oct 25, 2023

This PR:

  • Refactors and separates indexing subcommand from the main entry point
  • Adds workspace indexing mode - this was very useful to point the index + evaluate command at an existing fully setup project where scip-java could be run
  • Moves some common helpers into lib.rs
  • Adds the evaluation command

Test plan

  • Currently only evaluation fundamentals tests are added
  • Next PR will add snapshot tests for precision/recall outputs + some testing codebases

@cla-bot cla-bot bot added the cla-signed label Oct 25, 2023
@keynmol keynmol changed the title SCIP Treesitter CLI evaluation logic SCIP Treesitter CLI evaluation logic + workspace indexing mode Oct 26, 2023
Candidate ambiguity is a measure of how detailed the candidate SCIP in
comparison to ground truth.

When a candidate symbol has high ambiguity, it means that it occurs in a
lot of places where ground truth SCIP uses different symbols.

A demonstration of this method overloads in Java.
If you have 20 overloads of the same method (but with different
parameters), scip-java actually produces 20 different symbols (e.g.
"NodeRenderer#render(+19)").

Our current methods just produce a single symbol "NodeRenderer#render()"
for all those occurrences.

This commit penalises such occurrences by the logarithm of ambiguity.
After computing the weights (using same jaccard measure) of individual
pairs of (candidate, ground truth) symbols, we collect all the ground
truth symbols that can be assigned to a given candidate, and normalise
the weights of each pair by dividing it by sum of all weights.

The idea behind this is to reassert the fact that mapping of symbols is
fuzzy, and therefore we shouldn't be selecting just 1 symbol - instead we spread
the fuzziness over all the occurrences, normalise them so they add up to
one.

Note that for a single alternative the weight will be 1, but that's not
a problem because even if some occurrences were missed, they will be
counted as part of false negatives, heavily discounting the effect of
this spurious 1.0 TP
@keynmol keynmol marked this pull request as ready for review November 7, 2023 12:14
@varungandhi-src
Copy link
Contributor

Could you update the checklist in https://github.com/sourcegraph/sourcegraph/issues/58005 with some more details reflecting the sub-tasks involved? That's be helpful in better understand what's done/completed in the main issue itself, vs that info being scattered across multiple PR descriptions.

@varungandhi-src varungandhi-src mentioned this pull request Nov 9, 2023
2 tasks
Comment on lines +38 to +44
if occs == 0 {
Err(anyhow!(
"Index contains no occurrences and cannot be used for evaluation"
))
} else {
Ok(())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if occs == 0 {
Err(anyhow!(
"Index contains no occurrences and cannot be used for evaluation"
))
} else {
Ok(())
}
if occs == 0 {
return Err(anyhow!(
"Index contains no occurrences and cannot be used for evaluation"
));
}
Ok(())

Slightly shorter

Comment on lines +32 to +36
let mut occs = 0;

for doc in &idx.documents {
occs += doc.occurrences.len();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut occs = 0;
for doc in &idx.documents {
occs += doc.occurrences.len();
}
let num_occs = idx.documents.iter().map(|d| d.occurrences.len()).sum();

Comment on lines 70 to 71
// For each symbol pair we maintain an Overlap instance
let mut overlaps: HashMap<SymbolPair, Overlap> = HashMap::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The comment is not adding anything over the type annotation.

@varungandhi-src varungandhi-src changed the title SCIP Treesitter CLI evaluation logic + workspace indexing mode SCIP Tree-sitter CLI evaluation logic + workspace indexing mode Nov 10, 2023
Copy link
Contributor

@varungandhi-src varungandhi-src left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't care about the style comments that much, but please at least add help text for the different arguments.

@keynmol keynmol merged commit aa76466 into main Nov 14, 2023
@keynmol keynmol deleted the scip-treeistter-cli-evaluate branch November 14, 2023 13:02
vovakulikov pushed a commit that referenced this pull request Dec 12, 2023
* Support indexing entire workspace
* Separate evaluate and index subcommands into different modules
* Add --evaluate argument to index command, use serde for json
* Punish candidates with high ambiguity

Candidate ambiguity is a measure of how detailed the candidate SCIP in
comparison to ground truth.

When a candidate symbol has high ambiguity, it means that it occurs in a
lot of places where ground truth SCIP uses different symbols.

A demonstration of this method overloads in Java.
If you have 20 overloads of the same method (but with different
parameters), scip-java actually produces 20 different symbols (e.g.
"NodeRenderer#render(+19)").

Our current methods just produce a single symbol "NodeRenderer#render()"
for all those occurrences.

This commit penalises such occurrences by the logarithm of ambiguity.

* Introduce normalised weighting of candidates

After computing the weights (using same jaccard measure) of individual
pairs of (candidate, ground truth) symbols, we collect all the ground
truth symbols that can be assigned to a given candidate, and normalise
the weights of each pair by dividing it by sum of all weights.

The idea behind this is to reassert the fact that mapping of symbols is
fuzzy, and therefore we shouldn't be selecting just 1 symbol - instead we spread
the fuzziness over all the occurrences, normalise them so they add up to
one.

Note that for a single alternative the weight will be 1, but that's not
a problem because even if some occurrences were missed, they will be
counted as part of false negatives, heavily discounting the effect of
this spurious 1.0 TP

* bzl: Remove library target from Rust crate (#58221)

* fix: Stop sanitizing path unnecessarily

* cleanup: Remove incorrect dep on CLI in highlighter binary

* bzl: Re-add library target for cargo compat

* build: Add comment for build targets

* config: Hoist walkdir to workspace-level dep

---------

Co-authored-by: Varun Gandhi <varun.gandhi@sourcegraph.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants