Skip to content

Repo size is ~866MBΒ #1930

@BananaJeanss

Description

@BananaJeanss

Right now, the repo size is unnecessarily large.

git filter-repo --analyze:

Unpacked Packed
844.83MB 605.33MB

Folder sizes:

.git Worktree Combined
548MB 318MB 866MB

From what I could find, this is caused by large PDF files and mostly by large jpgs/pngs/webps.

Solution

git filter-repo should be used to cleanup either only old deleted unused blobs with a bash script, or just git filter-repo --strip-blobs-bigger-than 5M to get rid of everything that's bigger than 5M including currently active files/blobs, though stuff in public/ should be moved to the CDN before that, and its riskier, but the payoff will definitely be worth it with the size reduction, because I don't think anyone wants to download a ~548mb git repo just to edit a single file.

This will most likely need someone with force push perms though and other people would need to re-clone the repo again, if I'm correct.

I tested this bash script, and got the .git down to 386M by only removing unused files:

# get every filename ever in history from %(rest)
git rev-list --objects --all | \
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | \
  awk '/^blob/ {print $4}' | sort -u > all_files.txt

# get currently used files
git ls-tree -r --name-only main | sort > current_files.txt

# get dead files by comparing all files to current files
comm -23 all_files.txt current_files.txt > deleted_files.txt

# strip dead files from history
git filter-repo --invert-paths --paths-from-file deleted_files.txt

By just straight up doing git filter-repo --strip-blobs-bigger-than 3M, I got the .git size down to 219M.

By combining the bash script and the --strip-blobs-bigger-than 3M, I got it down to 159M, which is obviously a lot better than 548MB.

Largest files

Largest files in history from running git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | awk '/^blob/ {print $3, $4}' | sort -u | sort -n | awk '{printf "| %.2fMB | %s |\n", $1/1024/1024, $2}' | tail -10:

Size File
9.96MB public/jobs/zephyr-group-pic.jpg
10.25MB public/winter/2.png
12.28MB public/hc-cdn/7bf19e299e3e8253096906cef8d599c7aedeed09_image.png
13.05MB public/fiscal-sponsorship/hcb-gource.gif
16.58MB public/winter/11.gif
22.57MB public/home/assemble.jpg
22.96MB public/train_starry_night.png
29.65MB public/home/outernet-110.jpg
38.72MB public/philanthropy/hackclub.pdf
40.43MB public/onboard/first_and_hack_club.pdf

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions