SCIP Tree-sitter CLI evaluation logic + workspace indexing mode by keynmol · Pull Request #57894 · sourcegraph/sourcegraph-public-snapshot

keynmol · 2023-10-25T15:10:35Z

This PR:

Refactors and separates indexing subcommand from the main entry point
Adds workspace indexing mode - this was very useful to point the index + evaluate command at an existing fully setup project where scip-java could be run
Moves some common helpers into lib.rs
Adds the evaluation command

Test plan

Currently only evaluation fundamentals tests are added
Next PR will add snapshot tests for precision/recall outputs + some testing codebases

Candidate ambiguity is a measure of how detailed the candidate SCIP in comparison to ground truth. When a candidate symbol has high ambiguity, it means that it occurs in a lot of places where ground truth SCIP uses different symbols. A demonstration of this method overloads in Java. If you have 20 overloads of the same method (but with different parameters), scip-java actually produces 20 different symbols (e.g. "NodeRenderer#render(+19)"). Our current methods just produce a single symbol "NodeRenderer#render()" for all those occurrences. This commit penalises such occurrences by the logarithm of ambiguity.

After computing the weights (using same jaccard measure) of individual pairs of (candidate, ground truth) symbols, we collect all the ground truth symbols that can be assigned to a given candidate, and normalise the weights of each pair by dividing it by sum of all weights. The idea behind this is to reassert the fact that mapping of symbols is fuzzy, and therefore we shouldn't be selecting just 1 symbol - instead we spread the fuzziness over all the occurrences, normalise them so they add up to one. Note that for a single alternative the weight will be 1, but that's not a problem because even if some occurrences were missed, they will be counted as part of false negatives, heavily discounting the effect of this spurious 1.0 TP

varungandhi-src · 2023-11-08T08:57:34Z

Could you update the checklist in https://github.com/sourcegraph/sourcegraph/issues/58005 with some more details reflecting the sub-tasks involved? That's be helpful in better understand what's done/completed in the main issue itself, vs that info being scattered across multiple PR descriptions.

docker-images/syntax-highlighter/crates/scip-treesitter-cli/BUILD.bazel

docker-images/syntax-highlighter/crates/scip-treesitter-cli/README.md

docker-images/syntax-highlighter/crates/scip-treesitter-cli/src/main.rs

docker-images/syntax-highlighter/crates/scip-treesitter-cli/BUILD.bazel

docker-images/syntax-highlighter/crates/scip-treesitter-cli/Cargo.toml

docker-images/syntax-highlighter/crates/scip-treesitter-cli/src/evaluate.rs

varungandhi-src · 2023-11-09T11:23:23Z

docker-images/syntax-highlighter/crates/scip-treesitter-cli/src/evaluate.rs

+    if occs == 0 {
+        Err(anyhow!(
+            "Index contains no occurrences and cannot be used for evaluation"
+        ))
+    } else {
+        Ok(())
+    }


Suggested change

if occs == 0 {

Err(anyhow!(

"Index contains no occurrences and cannot be used for evaluation"

))

} else {

Ok(())

}

if occs == 0 {

return Err(anyhow!(

"Index contains no occurrences and cannot be used for evaluation"

));

}

Ok(())

Slightly shorter

varungandhi-src · 2023-11-09T11:26:27Z

docker-images/syntax-highlighter/crates/scip-treesitter-cli/src/evaluate.rs

+    let mut occs = 0;
+
+    for doc in &idx.documents {
+        occs += doc.occurrences.len();
+    }


Suggested change

let mut occs = 0;

for doc in &idx.documents {

occs += doc.occurrences.len();

}

let num_occs = idx.documents.iter().map(|d| d.occurrences.len()).sum();

docker-images/syntax-highlighter/crates/scip-treesitter-cli/src/evaluate.rs

varungandhi-src · 2023-11-09T11:30:05Z

docker-images/syntax-highlighter/crates/scip-treesitter-cli/src/evaluate.rs

+    // For each symbol pair we maintain an Overlap instance
+    let mut overlaps: HashMap<SymbolPair, Overlap> = HashMap::new();


nit: The comment is not adding anything over the type annotation.

varungandhi-src

I don't care about the style comments that much, but please at least add help text for the different arguments.

* Support indexing entire workspace * Separate evaluate and index subcommands into different modules * Add --evaluate argument to index command, use serde for json * Punish candidates with high ambiguity Candidate ambiguity is a measure of how detailed the candidate SCIP in comparison to ground truth. When a candidate symbol has high ambiguity, it means that it occurs in a lot of places where ground truth SCIP uses different symbols. A demonstration of this method overloads in Java. If you have 20 overloads of the same method (but with different parameters), scip-java actually produces 20 different symbols (e.g. "NodeRenderer#render(+19)"). Our current methods just produce a single symbol "NodeRenderer#render()" for all those occurrences. This commit penalises such occurrences by the logarithm of ambiguity. * Introduce normalised weighting of candidates After computing the weights (using same jaccard measure) of individual pairs of (candidate, ground truth) symbols, we collect all the ground truth symbols that can be assigned to a given candidate, and normalise the weights of each pair by dividing it by sum of all weights. The idea behind this is to reassert the fact that mapping of symbols is fuzzy, and therefore we shouldn't be selecting just 1 symbol - instead we spread the fuzziness over all the occurrences, normalise them so they add up to one. Note that for a single alternative the weight will be 1, but that's not a problem because even if some occurrences were missed, they will be counted as part of false negatives, heavily discounting the effect of this spurious 1.0 TP * bzl: Remove library target from Rust crate (#58221) * fix: Stop sanitizing path unnecessarily * cleanup: Remove incorrect dep on CLI in highlighter binary * bzl: Re-add library target for cargo compat * build: Add comment for build targets * config: Hoist walkdir to workspace-level dep --------- Co-authored-by: Varun Gandhi <varun.gandhi@sourcegraph.com>

keynmol added 3 commits October 25, 2023 15:59

Support indexing entire workspace

3a7c33d

WIP: scip-evaluate command prototype

5bbe393

bazel lock

0637c81

cla-bot bot added the cla-signed label Oct 25, 2023

keynmol added 3 commits October 26, 2023 10:51

Add spinner to evaluate command

10eaf20

Heavily comment the evaluation code, don't use tuples

26ef68d

Separate evaluate and index subcommands into different modules

1e4d6c1

keynmol changed the title ~~SCIP Treesitter CLI evaluation logic~~ SCIP Treesitter CLI evaluation logic + workspace indexing mode Oct 26, 2023

keynmol added 13 commits October 26, 2023 13:49

Fix Bazel build!

c15208a

Add --evaluate argument to index command, use serde for json

67d04de

Verify that neither index is empty in the evaluation

fcf357c

Refactor path canonicalisation

69a3475

Add basic evaluation unit tests

dc02f44

Differentiate between candidate and ground truth symbol ambiguity

1e8e517

Improve mapping printer

446f918

Introduce type aliases for readability

6dea47d

Readme

0525af8

Run clippy to make merging easier

b753fac

Merge branch 'main' into scip-treeistter-cli-evaluate

ef1d5e0

keynmol marked this pull request as ready for review November 7, 2023 12:14

keynmol requested a review from varungandhi-src November 7, 2023 12:15

Merge branch 'main' into scip-treeistter-cli-evaluate

f39c94e