Extract reposcheduler and add support for syntactic indexes #60958

keynmol · 2024-03-08T14:31:23Z

Part of #59795
Fixes ENG-23423

This PR's main goal is to extract the logic that selects repositories that should be scanned for new commits-candidates for syntactic indexing.

As part of this work, I've extracted this specific functionality into a new component, called reposcheduler, to reduce the area of responsibility assumed by auto-indexing service.

This is a relatively minor refactoring, mostly invisible from the outside.

The implementation of reposcheduling logic was mostly ported over, with some changes to SQL queries.

We are using a new table, instead of a index_type extra field on the lsif_last_index_scan table, because there is currently no way to introduce a composite uniqueness constraint (repository_id, index_type) in a backwards compatible way.

This PR will be squashed and merged. Old commits are left in PR history to refer to previous approaches.

Test plan

Existing tests that cover precise indexing
A copy of those tests that cover syntactic testing in isolation
Table tests that operate on shared tests and prove isolation of precise and syntactic schedulers

internal/codeintel/autoindexing/internal/background/scheduler/job_scheduler.go

varungandhi-src

I did a first round visiting 20/30 files, will do a second round later since this PR is quite big. 😬

internal/codeintel/autoindexing/init.go

internal/codeintel/autoindexing/internal/background/scheduler/job_scheduler.go

cmd/worker/internal/codeintel/autoindexing_scheduler.go

internal/codeintel/reposcheduler/init.go

internal/codeintel/reposcheduler/store.go

varungandhi-src · 2024-03-25T11:03:20Z

internal/codeintel/reposcheduler/store_observability.go

+			"codeintel_autoindexing_store",
+			metrics.WithLabels("op"),
+			metrics.WithCountHelp("Total number of method invocations."),
+		)
+	})
+
+	op := func(name string) *observation.Operation {
+		return observationCtx.Operation(observation.Op{
+			Name:              fmt.Sprintf("codeintel.reposcheduler.store.%s", name),


Should these two names be more similar/do you know what both of these are doing?

internal/codeintel/reposcheduler/store.go

internal/codeintel/reposcheduler/store_test.go

varungandhi-src

Mostly LGTM, would be good to have more clarity on the test logic and the decision to set the boolean to be on by default.

internal/codeintel/reposcheduler/store_test.go

internal/database/schema.md

migrations/frontend/squashed.sql

varungandhi-src

Thanks for all the changes. The test cases in particular are much more easily understandable.

internal/codeintel/reposcheduler/init.go

internal/codeintel/reposcheduler/options.go

internal/codeintel/reposcheduler/store.go

varungandhi-src · 2024-03-27T11:28:43Z

internal/codeintel/reposcheduler/store_test.go

+			// The following tests simulate the passage of time (i.e. repeated scheduled invocations of the repo scheduling logic)
+			// T is time
+
+			// N.B. We use 1 hour process delay in all those tests,
+			// which means that IF the repository id was returned, it will be put on cooldown for an hour
+
+			// T = 0: No records in last index scan table, so we return all repos permitted by the limit parameter
+			assertRepoList(t, tt.store, NewBatchOptions(1*time.Hour, true, nil, 2), now, []int{50, 51})
+
+			// T + 20 minutes: first two repositories are still on cooldown
+			assertRepoList(t, tt.store, NewBatchOptions(1*time.Hour, true, nil, 100), now.Add(time.Minute*20), []int{52, 53})
+
+			// T + 30 minutes: all repositories are on cooldown
+			assertRepoList(t, tt.store, NewBatchOptions(1*time.Hour, true, nil, 100), now.Add(time.Minute*30), []int(nil))
+
+			// T + 90 minutes: all repositories are visible again
+			// Repos 50, 51 are visible because they were scheduled at (T + 0) - which is more than 1 hour ago
+			// Repos 52, 53 are visible because they were scheduled at (T + 20) - which is more than 1 hour ago
+			assertRepoList(t, tt.store, NewBatchOptions(1*time.Hour, true, nil, 100), now.Add(time.Minute*90), []int{50, 51, 52, 53})


Thank you, now this is much more understandable. I think part of what was throwing me off was the names. Here, the function name starts with assert and the query this bottoms out is getRepositoriesForIndexScanQuery. However, that query modifies state, and consequently, this assertBlah function also modifies state, whereas normally you'd expect 'get' and 'assert' to not modify state.

I don't have specific alternate naming suggestions, but the usage of get for a query which does inserts seems a bit unfortunate.

Yeah, the upsert nature of that operation is a bit confusing - also makes it hard to debug.
The insert is pushed into the query for (I believe) atomicity and to handle possible dotcom batch sizes (though I wonder how much of a problem it is to literally return only integer repository IDs, you can do thousands of them)

We can rename it to Calculate/Compute, and replace this single helper method assert with a combo of assert + a local computeReposForScheduling or something, but I think we'd hit diminishing returns wrt readability

mmanela · 2024-03-26T20:20:10Z

internal/codeintel/reposcheduler/init.go

+type RepositorySchedulingService interface {
+	GetRepositoriesForIndexScan(
+		ctx context.Context,
+		_ RepositoryBatchOptions,


(Go question), why the underscore here in an interface definition?

This fixes scheduling, isolating the precise indexes from syntactic ones in the last scan table.

cla-bot bot added the cla-signed label Mar 8, 2024

keynmol mentioned this pull request Mar 13, 2024

Syntactic indexing: create job scheduling worker #59795

Closed

keynmol marked this pull request as ready for review March 21, 2024 15:45

mmanela added the team/product-platform label Mar 21, 2024

mmanela reviewed Mar 21, 2024

View reviewed changes

internal/codeintel/autoindexing/internal/background/scheduler/job_scheduler.go Outdated Show resolved Hide resolved

keynmol mentioned this pull request Mar 22, 2024

Ingesting syntactic indexes via upload #60951

Merged

keynmol requested a review from varungandhi-src March 25, 2024 09:01

varungandhi-src reviewed Mar 25, 2024

View reviewed changes

varungandhi-src reviewed Mar 26, 2024

View reviewed changes

internal/codeintel/reposcheduler/store_test.go Outdated Show resolved Hide resolved

varungandhi-src suggested changes Mar 26, 2024

View reviewed changes

internal/codeintel/reposcheduler/store_test.go Outdated Show resolved Hide resolved

internal/database/schema.md Outdated Show resolved Hide resolved

migrations/frontend/squashed.sql Outdated Show resolved Hide resolved

keynmol requested a review from varungandhi-src March 27, 2024 11:06

varungandhi-src approved these changes Mar 27, 2024

View reviewed changes

mmanela reviewed Mar 27, 2024

View reviewed changes

keynmol added 17 commits April 2, 2024 11:05

Extract reposcheduler and add support for syntactic indexes

262d2ef

gazelle

2577e90

Run migration generators

b5213ea

go formatting

23e4aa7

Update mocks and bazel build for tests

fd44cab

Make migration idempotent

83cdb83

regenerate files

37353fb

Refactor reposcheduler test

1bb0010

DRY the global policy tests as well

0245cc6

Include indexing type in lsif_last_index_scan primary key

6a935c0

This fixes scheduling, isolating the precise indexes from syntactic ones in the last scan table.

Make migration idempotent

4822ea0

check in generated stuff

046f438

Use constraint name for conflict handling

3febbd4

Switch to a separate table

2b86a02

Make migration idempotent

fd55462

Remove spurious comments

9438272

Remove unnecessary goroutine parameter

6e4e4fb

keynmol added 12 commits April 2, 2024 11:06

PR comments

e3721b9

Default syntactic indexing to false, make it non null in DB

b24994a

PR comments

31370b1

Regenerate migrations

2d2b6ba

Push RepositoryToIndex conversion into the store

b8c521a

Add extensive comments to reposcheduler tests

f033f14

Refactor tests

bba1af3

Add comments, clarify metric and variable names

0528dde

gofmt

11bc9d3

Revert changes to AutoIndexingService

951733a

fixup

919f34b

PR comments

5962c57

keynmol force-pushed the refactoring-extract-reposcheduler branch from 80aa55c to 5962c57 Compare April 2, 2024 10:16

keynmol merged commit 612e26d into main Apr 2, 2024

keynmol deleted the refactoring-extract-reposcheduler branch April 2, 2024 10:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract reposcheduler and add support for syntactic indexes #60958

Extract reposcheduler and add support for syntactic indexes #60958

keynmol commented Mar 8, 2024 •

edited

Loading

varungandhi-src left a comment

varungandhi-src Mar 25, 2024

varungandhi-src left a comment

varungandhi-src left a comment

varungandhi-src Mar 27, 2024

keynmol Mar 27, 2024

mmanela Mar 26, 2024

Extract reposcheduler and add support for syntactic indexes #60958

Extract reposcheduler and add support for syntactic indexes #60958

Conversation

keynmol commented Mar 8, 2024 • edited Loading

Test plan

varungandhi-src left a comment

Choose a reason for hiding this comment

varungandhi-src Mar 25, 2024

Choose a reason for hiding this comment

varungandhi-src left a comment

Choose a reason for hiding this comment

varungandhi-src left a comment

Choose a reason for hiding this comment

varungandhi-src Mar 27, 2024

Choose a reason for hiding this comment

keynmol Mar 27, 2024

Choose a reason for hiding this comment

mmanela Mar 26, 2024

Choose a reason for hiding this comment

keynmol commented Mar 8, 2024 •

edited

Loading