Skip to content

Commit 81b5d9c

Browse files
authored
Refactor batch processing logic for performance and stability (#63)
* Simplified batch processing logic, by relying on the `.git/svn/.metadata` file, and commit messages from the latest commit * Removed slow-running `svn log` * Fix file perms issue from Podman vs Docker Containers * Shorten volume root from /sourcegraph/ to /sg/ * Improved job state tracking * Cleaned up logging, added local git repo metadata to log events * Refactor failure logic * Fixing fetch job success / fail evaluation
1 parent 4ede23e commit 81b5d9c

38 files changed

+2518
-3476
lines changed

.github/workflows/cleanup-cache.yml

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: Cleanup old Docker Images from GitHub Container Registry
1+
name: Cleanup Podman build cache from GitHub Container Registry
22

33
on:
44

@@ -13,17 +13,17 @@ on:
1313
type: string
1414
description: Package name
1515
default: repo-converter/podman-build-cache
16-
required: true
16+
required: false
1717
min-versions-to-keep:
1818
type: number
1919
description: Minimum number of versions to keep
2020
default: 20
21-
required: true
21+
required: false
2222
delete-only-untagged-versions:
2323
type: boolean
2424
description: Delete only untagged versions?
2525
default: false
26-
required: true
26+
required: false
2727

2828

2929
jobs:
@@ -37,7 +37,7 @@ jobs:
3737
with:
3838
# Setting defaults for the scheduled runs
3939
# https://stackoverflow.com/a/73495922
40-
delete-only-untagged-versions: ${{ github.event.inputs.delete-only-untagged-versions || false }}
41-
min-versions-to-keep: ${{ github.event.inputs.min-versions-to-keep || 20 }}
42-
package-name: ${{ github.event.inputs.package-name || 'repo-converter/podman-build-cache' }}
4340
package-type: container
41+
package-name: ${{ github.event.inputs.package-name || 'repo-converter/podman-build-cache' }}
42+
min-versions-to-keep: ${{ github.event.inputs.min-versions-to-keep || 20 }}
43+
delete-only-untagged-versions: ${{ github.event.inputs.delete-only-untagged-versions || false }}

.github/workflows/cleanup-images.yml

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: Cleanup old Docker Images from GitHub Container Registry
1+
name: Cleanup old images from GitHub Container Registry
22

33
on:
44

@@ -12,17 +12,18 @@ on:
1212
package-name:
1313
type: string
1414
description: Package name
15-
required: true
15+
default: repo-converter
16+
required: false
1617
min-versions-to-keep:
1718
type: number
1819
description: Minimum number of versions to keep
1920
default: 20
20-
required: true
21+
required: false
2122
delete-only-untagged-versions:
2223
type: boolean
2324
description: Delete only untagged versions?
2425
default: true
25-
required: true
26+
required: false
2627

2728

2829
jobs:
@@ -36,7 +37,7 @@ jobs:
3637
with:
3738
# Setting defaults for the scheduled runs
3839
# https://stackoverflow.com/a/73495922
39-
delete-only-untagged-versions: ${{ github.event.inputs.delete-only-untagged-versions || true }}
40-
min-versions-to-keep: ${{ github.event.inputs.min-versions-to-keep || 20 }}
41-
package-name: ${{ github.event.inputs.package-name || 'repo-updater' }}
4240
package-type: container
41+
package-name: ${{ github.event.inputs.package-name || 'repo-updater' }}
42+
min-versions-to-keep: ${{ github.event.inputs.min-versions-to-keep || 20 }}
43+
delete-only-untagged-versions: ${{ github.event.inputs.delete-only-untagged-versions || true }}

.github/workflows/docker-build-old.bak

Lines changed: 0 additions & 55 deletions
This file was deleted.

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ config/repos-to-convert*
66
config/service-account-key.json
77
dev/stats/repos.txt
88
logs/
9+
notes/
910
src-serve-root/
1011

1112
# Byte-compiled / optimized / DLL files

AGENT.md

Lines changed: 59 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,17 @@
1-
# AI Agent Guidelines for repo-converter Project
1+
# AI Agent Guidelines for repo-converter
2+
23
- The purpose of this project is to convert repos from other repo types (ex. Subversion) to Git
34
- It runs in a Podman container
45
- src/main.py is the entrypoint for the container
56
- The usage of this project is described to users in `./README.md` and `./docs/repo-converter.md`
67

78
## Build/Test Commands
9+
810
- Build and start all containers: `./build/build.sh`
911
- Build and start all containers, and follow the repo-converter container's logs: `./build/build.sh f`
1012

1113
## Code Style Guidelines
14+
1215
- Python version: 3.13.2
1316
- Imports: local modules first, standard libs second, then third-party libs with URLs in comments
1417
- Variables: Snake case (e.g., `local_repo_path`)
@@ -18,3 +21,58 @@
1821
- Security: the log function calls the `redact()` function before logging, to ensure no credentials are leaked in logs
1922
- Documentation: Use Python best practices for docstrings
2023
- Comments: Use `#` for comments, and add lots of comments
24+
25+
## Design Facts
26+
27+
When a user runs the command `svn info https://svn.apache.org/repos/asf/lucene`, the output shows:
28+
```
29+
Path: lucene
30+
URL: https://svn.apache.org/repos/asf/lucene
31+
Relative URL: ^/lucene
32+
Repository Root: https://svn.apache.org/repos/asf
33+
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
34+
Revision: 1927289
35+
Node Kind: directory
36+
Last Changed Author: vajda
37+
Last Changed Rev: 1921508
38+
Last Changed Date: 2024-10-23 13:00:39 +0000 (Wed, 23 Oct 2024)"
39+
```
40+
41+
Reorganized for readability:
42+
```
43+
URL: https://svn.apache.org/repos/asf/lucene <- Just the URL from the command
44+
Repository Root: https://svn.apache.org/repos/asf <- The URL to the svn repo's root directory
45+
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68 <- The svn server assigns a UUID for each repo
46+
Path: lucene <- The subdirectory path, from the repo root, to the directory in the URL passed into the `svn info` command
47+
Revision: 1927289 <- The current tip of the repo's revision number index
48+
Last Changed Rev: 1921508 <- The last revision number which included a change in the requested URL
49+
Last Changed Date: 2024-10-23 13:00:39 +0000 (Wed... <- The date of the last revision which included a change in the requested URL
50+
```
51+
52+
1. The human usage of svn is different than git
53+
- In git, repos are cheap for the server, and easy for human users to create, self-service, so humans usually create a new git repo for each company product or project
54+
- In svn, only the server administrators can create new repos, so most companies only have one repo per svn server, or one repo per division of the company's org chart, so each division's products / projects are usually rooted in top-level or second-level directory of the repo. Perforce is also quite similar this way.
55+
56+
2. The svn revision number index is shared across the entire repo, therefore different products / projects / top-level folders may only be relevant at different ranges of the revision index.
57+
- If a product was built and completed early in the svn repo's history and left dormant, then only ranges of lower revision numbers would be relevant for efficient batch processing, even though this project may have maintenance commits later in the svn repo's history.
58+
- If multiple separate projects are both in active development at the same time, that range of revision numbers may go back and forth between those projects.
59+
60+
3. This repo-converter system is designed to adapt repo organization to match the differences in human usage from svn to git, i.e. break up a large svn repo into smaller git repos.
61+
- From the perspectives of conversion system stability and performance: each repo conversion job execution only needs to handle a subset of revision numbers.
62+
63+
4. svn (and thus, the `git svn` command in git) is largely abandoned, so many of the quality of life features which have been added to git over the years are missing in svn
64+
- svn does not have rate limiting built in, so there's no HTTP 429 response, or `Retry-After` header
65+
- `git svn fetch` does not have robust error handling, retries, or batch processing features, so many features need to be built into the repo-converter system to make up for these shortcomings
66+
- If batch processing logic is required, such as allowing the user to specify a number of commits in the `fetch-batch-size` config, it must find a way to make use of a subset of revisions specifically relevant to code changes under the targeted directory.
67+
- The `--log-window-size n` arg to the `git svn fetch` command is only used to request `n` revision numbers from the svn repo's index of revision numbers, per network request, which may or may not have any commits to the subdirectory the job is trying to convert
68+
- However, this may still be useful for svn repos / subdirectories which seem to have performance issues, so we should implement a retry / backoff method
69+
- The default `--log-window-size` is 100, so if a request times out, we should cut this value in half and retry
70+
- For customer svn repo revision indexes with millions of revisions, smaller window sizes result in many, many network requests to the svn server
71+
- Request timeouts happen more often in conversion jobs when:
72+
- The `--log-window-size` is too large, and the request times out / gets dropped at after 10 minutes of processing time
73+
- The `--log-window-size` is too small, which multiplies the number of requests required to convert the repo, and every new request has its own possibility of timing out
74+
75+
5. The `_calculate_batch_revisions`, then `_git_svn_fetch` functions are only ever called after the `_check_if_repo_already_up_to_date` function concludes that the local git repo is behind the remote svn server, based on the "Last Changed Rev" response to the `svn info` command above
76+
- Therefore, any `git svn fetch` execution which doesn't return any lines of output is considered a failure, even if the return_code is 0
77+
- Therefore, we cannot trust the return code as an indicator of task success
78+
- We must determine task success based on data in the local git repo, and the lines in stdout from the `git svn fetch` command

0 commit comments

Comments
 (0)