Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Extract reposcheduler and add support for syntactic indexes #60958

Merged
merged 29 commits into from
Apr 2, 2024

Conversation

keynmol
Copy link
Contributor

@keynmol keynmol commented Mar 8, 2024

Part of #59795
Fixes ENG-23423

This PR's main goal is to extract the logic that selects repositories that should be scanned for new commits-candidates for syntactic indexing.

As part of this work, I've extracted this specific functionality into a new component, called reposcheduler, to reduce the area of responsibility assumed by auto-indexing service.

This is a relatively minor refactoring, mostly invisible from the outside.

The implementation of reposcheduling logic was mostly ported over, with some changes to SQL queries.

We are using a new table, instead of a index_type extra field on the lsif_last_index_scan table, because there is currently no way to introduce a composite uniqueness constraint (repository_id, index_type) in a backwards compatible way.

This PR will be squashed and merged. Old commits are left in PR history to refer to previous approaches.

Test plan

  • Existing tests that cover precise indexing
  • A copy of those tests that cover syntactic testing in isolation
  • Table tests that operate on shared tests and prove isolation of precise and syntactic schedulers

@cla-bot cla-bot bot added the cla-signed label Mar 8, 2024
@keynmol keynmol marked this pull request as ready for review March 21, 2024 15:45
Copy link
Contributor

@varungandhi-src varungandhi-src left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first round visiting 20/30 files, will do a second round later since this PR is quite big. 😬

Comment on lines 26 to 34
"codeintel_autoindexing_store",
metrics.WithLabels("op"),
metrics.WithCountHelp("Total number of method invocations."),
)
})

op := func(name string) *observation.Operation {
return observationCtx.Operation(observation.Op{
Name: fmt.Sprintf("codeintel.reposcheduler.store.%s", name),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these two names be more similar/do you know what both of these are doing?

Copy link
Contributor

@varungandhi-src varungandhi-src left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, would be good to have more clarity on the test logic and the decision to set the boolean to be on by default.

@keynmol keynmol requested a review from varungandhi-src March 27, 2024 11:06
Copy link
Contributor

@varungandhi-src varungandhi-src left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the changes. The test cases in particular are much more easily understandable.

Comment on lines +94 to +112
// The following tests simulate the passage of time (i.e. repeated scheduled invocations of the repo scheduling logic)
// T is time

// N.B. We use 1 hour process delay in all those tests,
// which means that IF the repository id was returned, it will be put on cooldown for an hour

// T = 0: No records in last index scan table, so we return all repos permitted by the limit parameter
assertRepoList(t, tt.store, NewBatchOptions(1*time.Hour, true, nil, 2), now, []int{50, 51})

// T + 20 minutes: first two repositories are still on cooldown
assertRepoList(t, tt.store, NewBatchOptions(1*time.Hour, true, nil, 100), now.Add(time.Minute*20), []int{52, 53})

// T + 30 minutes: all repositories are on cooldown
assertRepoList(t, tt.store, NewBatchOptions(1*time.Hour, true, nil, 100), now.Add(time.Minute*30), []int(nil))

// T + 90 minutes: all repositories are visible again
// Repos 50, 51 are visible because they were scheduled at (T + 0) - which is more than 1 hour ago
// Repos 52, 53 are visible because they were scheduled at (T + 20) - which is more than 1 hour ago
assertRepoList(t, tt.store, NewBatchOptions(1*time.Hour, true, nil, 100), now.Add(time.Minute*90), []int{50, 51, 52, 53})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, now this is much more understandable. I think part of what was throwing me off was the names. Here, the function name starts with assert and the query this bottoms out is getRepositoriesForIndexScanQuery. However, that query modifies state, and consequently, this assertBlah function also modifies state, whereas normally you'd expect 'get' and 'assert' to not modify state.

I don't have specific alternate naming suggestions, but the usage of get for a query which does inserts seems a bit unfortunate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the upsert nature of that operation is a bit confusing - also makes it hard to debug.
The insert is pushed into the query for (I believe) atomicity and to handle possible dotcom batch sizes (though I wonder how much of a problem it is to literally return only integer repository IDs, you can do thousands of them)

We can rename it to Calculate/Compute, and replace this single helper method assert with a combo of assert + a local computeReposForScheduling or something, but I think we'd hit diminishing returns wrt readability

type RepositorySchedulingService interface {
GetRepositoriesForIndexScan(
ctx context.Context,
_ RepositoryBatchOptions,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Go question), why the underscore here in an interface definition?

@keynmol keynmol force-pushed the refactoring-extract-reposcheduler branch from 80aa55c to 5962c57 Compare April 2, 2024 10:16
@keynmol keynmol merged commit 612e26d into main Apr 2, 2024
@keynmol keynmol deleted the refactoring-extract-reposcheduler branch April 2, 2024 10:37
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants