-
Notifications
You must be signed in to change notification settings - Fork 80
Description
We have been using gzip to compress logs from benchmarks. However, GZip has poor compression ratio among common compression tools, and is especially bad for certain log files we produce. This consumes a ridiculous amount of disk space, but can be easily fixed by switching to a better compression tool.
We also need to ensure there are Python bindings so that we can process those logs using our existing Python scripts with minimal modification.
Example: ci-perf-result
We keep a history of performance log in the https://github.com/mmtk/ci-perf-result repository. When checked out, the files in the workspace (not including .git itself) takes 2 GB of space. Most of the space is consumed by log files of a few test cases.
-rw-r--r-- 1 wks wks 64711 Dec 12 14:57 antlr.6000.144.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 6653 Dec 12 14:57 bloat.6000.198.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 7345 Dec 12 14:57 eclipse.6000.504.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 6260 Dec 12 14:57 fop.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 7022 Dec 12 14:57 hsqldb.6000.762.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 6605 Dec 12 14:57 jython.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 226549 Dec 12 14:57 luindex.6000.132.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 6955 Dec 12 14:57 lusearch.6000.204.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 121510 Dec 12 14:57 pmd.6000.294.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 552 Dec 12 14:57 runbms_args.yml
-rw-r--r-- 1 wks wks 8246 Dec 12 14:57 runbms.yml
-rw-r--r-- 1 wks wks 6510 Dec 12 14:57 sunflow.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 6551 Dec 12 14:57 xalan.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
luindexprints out the table of contents of the King James Version Bible while indexing them, accumulating to more than 1000 lines per run.pmdprints out the warnings on the test corpus while executing.antlrlogs the input file names during execution.
While 226 KB doesn't look large for a single run, it can accumulate. With one 226 KB per plan per CI run, it eventually accumulates to a gigabyte.
How effective are other compression tools?
DaCapo 2006
I tried to compress the logs from antlr, fop, luindex and pmd in DaCapo 2006 (i.e. from ci-perf-result) using several tools. antlr, luindex and pmd have large log output, and fop is chosen as a control group.
| size (bytes) | benchmark | compression | options |
|---|---|---|---|
| 1155654 | antlr | uncompressed | |
| 127836 | antlr | lzo | |
| 77267 | antlr | lzo | -9 |
| 64651 | antlr | gz | |
| 63424 | antlr | gz | -9 |
| 45655 | antlr | lz4 | |
| 15083 | antlr | bz2 | |
| 15083 | antlr | bz2 | -9 |
| 13522 | antlr | lz4 | -9 |
| 6980 | antlr | zstd | |
| 6284 | antlr | bz3 | |
| 6196 | antlr | lrz | |
| 6161 | antlr | zstd | -9 |
| 6064 | antlr | lrz | -z |
| 5959 | antlr | lz | |
| 5935 | antlr | zstd | -19 |
| 5920 | antlr | xz | |
| 5920 | antlr | xz | -9 |
| 5695 | antlr | lz | -9 |
| 161919 | fop | uncompressed | |
| 20744 | fop | lzo | |
| 9766 | fop | lz4 | |
| 6841 | fop | lz4 | -9 |
| 6666 | fop | lzo | -9 |
| 6546 | fop | bz2 | |
| 6546 | fop | bz2 | -9 |
| 6202 | fop | gz | |
| 5976 | fop | gz | -9 |
| 5114 | fop | zstd | |
| 4816 | fop | bz3 | |
| 4659 | fop | lrz | |
| 4643 | fop | zstd | -9 |
| 4545 | fop | lrz | -z |
| 4257 | fop | zstd | -19 |
| 4207 | fop | lz | |
| 4204 | fop | xz | |
| 4204 | fop | xz | -9 |
| 4080 | fop | lz | -9 |
| 1858161 | luindex | uncompressed | |
| 323165 | luindex | lzo | |
| 226487 | luindex | gz | |
| 201132 | luindex | lzo | -9 |
| 197991 | luindex | gz | -9 |
| 158852 | luindex | lz4 | |
| 83571 | luindex | lz4 | -9 |
| 30587 | luindex | bz2 | |
| 30587 | luindex | bz2 | -9 |
| 9509 | luindex | zstd | |
| 8683 | luindex | zstd | -9 |
| 7868 | luindex | bz3 | |
| 7563 | luindex | lrz | |
| 7458 | luindex | zstd | -19 |
| 7212 | luindex | lrz | -z |
| 6762 | luindex | lz | |
| 6656 | luindex | xz | |
| 6656 | luindex | xz | -9 |
| 6468 | luindex | lz | -9 |
| 2377643 | pmd | uncompressed | |
| 183840 | pmd | lzo | |
| 154526 | pmd | lzo | -9 |
| 127098 | pmd | lz4 | |
| 121452 | pmd | gz | |
| 118288 | pmd | gz | -9 |
| 81500 | pmd | lz4 | -9 |
| 22670 | pmd | bz2 | |
| 22670 | pmd | bz2 | -9 |
| 8176 | pmd | zstd | |
| 7265 | pmd | zstd | -9 |
| 6476 | pmd | xz | |
| 6476 | pmd | xz | -9 |
| 6465 | pmd | lz | |
| 6376 | pmd | lrz | |
| 6314 | pmd | bz3 | |
| 6296 | pmd | zstd | -19 |
| 6180 | pmd | lrz | -z |
| 5802 | pmd | lz | -9 |
From the table, we see that lzip -9 is always the winner of compression ratio for the DaCapo 2006 outputs. bzip3, lrz, xz and zstd also work reasonably well. lz4, gz, bz2 and lzo are all unsatisfactory.
fastdebug
Note that the compression ratio depends on the input data. Here is a result from compressing the execution log of fastdebug in an OpenJDK test with DaCapo Chopin, i.e. concurrentimmix in linux-x86_64-lusearch-fastdebug-2.5 picked from https://github.com/mmtk/mmtk-openjdk/actions/runs/19810034387
| size (bytes) | compression | options |
|---|---|---|
| 710863 | uncompressed | |
| 66154 | lzo | |
| 61895 | lz4 | |
| 40964 | lzo | -9 |
| 40016 | lz4 | -9 |
| 36698 | zstd | |
| 32920 | gz | |
| 30236 | gz | -9 |
| 29598 | lrz | |
| 28106 | lz | |
| 27822 | zstd | -9 |
| 26459 | lrz | -z |
| 25800 | xz | |
| 25800 | xz | -9 |
| 23453 | zstd | -19 |
| 21685 | bz2 | |
| 21685 | bz2 | -9 |
| 20987 | lz | -9 |
| 19215 | bz3 |
This time bz3 works better than lz -9, and bz2 works better than zstd and xz. But gz is still worse. I think one difference between this log and that of luindex in DaCapo 2006 is that this log contains a timestamp on each line, and the logs contain many numbers that are distinct among all log lines.
DaCapo Chopin
In general, DaCapo Chopin produces much less verbose logs than DaCapo 2006. If each file is small, compression algorithms will just give diminishing return. The table below is also from ConcurrentImmix, but from linux-x86_64-lusearch-release-2.5 from https://github.com/mmtk/mmtk-openjdk/actions/runs/19810034387
| size (bytes) | compression | options |
|---|---|---|
| 8158 | uncompressed | |
| 4694 | lz4 | |
| 4662 | lzo | |
| 4427 | lz4 | -9 |
| 4079 | lzo | -9 |
| 3443 | zstd | |
| 3432 | bz2 | |
| 3432 | bz2 | -9 |
| 3377 | bz3 | |
| 3366 | zstd | -9 |
| 3275 | gz | |
| 3274 | gz | -9 |
| 3251 | lrz | |
| 3250 | zstd | -19 |
| 3162 | lrz | -z |
| 3140 | xz | |
| 3140 | xz | -9 |
| 3096 | lz | |
| 3095 | lz | -9 |
The difference between gz and the best performer lz is not that significant, but still saves about 10% space.
What should we change?
Things that generate GZip-compressed files
It is mostly not mmtk-core itself, but many related tools.
One thing to change is running-ng. It hard-coded gzip into runbms.py. See: https://github.com/anupli/running-ng/blob/64ee6876a076e42b671cfd9c0e6a7ddd41fb3bdf/src/running/command/runbms.py#L387 This should be easy to fix because we can change it to another tool, such as lzip -9, or make it a configurable option.
Things that read GZip-compressed files
The other thing that should be changed are Python scripts. We have quite some scripts that work with GZip-compressed files. For example,
- log_preprocessor.py in
running-ngfilters GZip-compressed log files. - parse.py in
ci-perf-kitalso reads GZip-compressed files. - plotty also has log parsers that reads GZip-compressed files.
To ensure we don't lock ourselves into one compression format, we should replace the python scripts with a more general tool that can auto-detect the compression format of a file.
ci-perf-results
We may convert all files in the ci-perf-result repo to another format, such as lzip.
However, Git is snapshot-based and keeps all historical versions of files. If we want to keep the Git history of the ci-perf-results repo, it won't save us any space to convert the log formats.
But if we only use the ci-perf-result repository as a place to hold the logs and render the web pages, we may consider doing the conversion and replace the entire git history of the self-hosted branch of ci-perf-result.
And it is wiser to put logs into a tar archive instead of compressing individual files. As we can see in the output below, an lzip-compressed archive is only 30KB, but compressing individual files will sum up to 100KB (total 100). Note that each file takes a multiple of 4KB of space on disk, depending on the file system.
total 524
drwxr-xr-x 2 wks wks 4096 Dec 15 15:08 hammer-2025-12-05-Fri-064328-before
drwxr-xr-x 2 wks wks 4096 Dec 15 15:09 hammer-2025-12-05-Fri-064328-archived
drwxr-xr-x 2 wks wks 4096 Dec 15 15:08 hammer-2025-12-05-Fri-064328-after
-rw-r--r-- 1 wks wks 30377 Dec 15 15:10 hammer-2025-12-05-Fri-064328-archived.tar.lz
-rw-r--r-- 1 wks wks 33844 Dec 15 15:11 hammer-2025-12-05-Fri-064328-archived.tar.xz
-rw-r--r-- 1 wks wks 34435 Dec 15 15:11 hammer-2025-12-05-Fri-064328-archived.tar.zstd
-rw-r--r-- 1 wks wks 416529 Dec 15 15:10 hammer-2025-12-05-Fri-064328-archived.tar.gz
./hammer-2025-12-05-Fri-064328-before:
total 488
-rw-r--r-- 1 wks wks 552 Dec 15 15:08 runbms_args.yml
-rw-r--r-- 1 wks wks 6260 Dec 15 15:08 fop.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 6510 Dec 15 15:08 sunflow.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 6551 Dec 15 15:08 xalan.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 6605 Dec 15 15:08 jython.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 6653 Dec 15 15:08 bloat.6000.198.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 6955 Dec 15 15:08 lusearch.6000.204.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 7022 Dec 15 15:08 hsqldb.6000.762.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 7345 Dec 15 15:08 eclipse.6000.504.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 8246 Dec 15 15:08 runbms.yml
-rw-r--r-- 1 wks wks 64711 Dec 15 15:08 antlr.6000.144.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 121510 Dec 15 15:08 pmd.6000.294.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 226549 Dec 15 15:08 luindex.6000.132.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
./hammer-2025-12-05-Fri-064328-archived:
total 6672
-rw-r--r-- 1 wks wks 552 Dec 15 15:09 runbms_args.yml
-rw-r--r-- 1 wks wks 8246 Dec 15 15:09 runbms.yml
-rw-r--r-- 1 wks wks 161919 Dec 15 15:09 fop.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks 162933 Dec 15 15:09 xalan.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks 166221 Dec 15 15:09 jython.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks 171026 Dec 15 15:09 lusearch.6000.204.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks 171263 Dec 15 15:09 sunflow.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks 180060 Dec 15 15:09 bloat.6000.198.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks 188478 Dec 15 15:09 eclipse.6000.504.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks 203881 Dec 15 15:09 hsqldb.6000.762.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks 1155654 Dec 15 15:09 antlr.6000.144.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks 1858161 Dec 15 15:09 luindex.6000.132.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks 2377643 Dec 15 15:09 pmd.6000.294.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
./hammer-2025-12-05-Fri-064328-after:
total 100
-rw-r--r-- 1 wks wks 552 Dec 15 15:08 runbms_args.yml
-rw-r--r-- 1 wks wks 4080 Dec 15 15:08 fop.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 4206 Dec 15 15:08 bloat.6000.198.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 4216 Dec 15 15:08 sunflow.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 4283 Dec 15 15:08 jython.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 4290 Dec 15 15:08 xalan.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 4405 Dec 15 15:08 lusearch.6000.204.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 4470 Dec 15 15:08 hsqldb.6000.762.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 4707 Dec 15 15:08 eclipse.6000.504.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 5695 Dec 15 15:08 antlr.6000.144.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 5802 Dec 15 15:08 pmd.6000.294.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 6468 Dec 15 15:08 luindex.6000.132.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 8246 Dec 15 15:08 runbms.yml
And it will reduce the space even further if we archive the output of an entire GC run into one single .tar.lz archive. In the output below, before contains all logs produced on hammer.moma on 2025-12-05. after-individual has each individual log re-compressed using lzip -9. after-archived is a directory of uncompressed log files, and we compress it as a hole into after-archived.tar.lz.
wks@luna ~/j/c/whole-run> ll
total 188
drwxr-xr-x 10 wks wks 4096 Dec 15 15:29 after-archived
-rw-r--r-- 1 wks wks 176953 Dec 15 15:33 after-archived.tar.lz
drwxr-xr-x 10 wks wks 4096 Dec 15 15:31 after-individual
drwxr-xr-x 10 wks wks 4096 Dec 15 15:28 before
wks@luna ~/j/c/whole-run> ll before
total 32
drwxr-xr-x 3 wks wks 4096 Dec 15 15:28 canary
drwxr-xr-x 3 wks wks 4096 Dec 15 15:28 gencopy
drwxr-xr-x 3 wks wks 4096 Dec 15 15:28 genimmix
drwxr-xr-x 3 wks wks 4096 Dec 15 15:28 immix
drwxr-xr-x 3 wks wks 4096 Dec 15 15:28 marksweep
drwxr-xr-x 3 wks wks 4096 Dec 15 15:28 nogc
drwxr-xr-x 3 wks wks 4096 Dec 15 15:28 semispace
drwxr-xr-x 3 wks wks 4096 Dec 15 15:28 stickyimmix
wks@luna ~/j/c/whole-run> ll before/genimmix/
total 4
drwxr-xr-x 2 wks wks 4096 Dec 15 15:28 hammer-2025-12-05-Fri-064328
wks@luna ~/j/c/whole-run> ll before/genimmix/hammer-2025-12-05-Fri-064328/
total 488
-rw-r--r-- 1 wks wks 64711 Dec 15 15:28 antlr.6000.144.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 6653 Dec 15 15:28 bloat.6000.198.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 7345 Dec 15 15:28 eclipse.6000.504.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 6260 Dec 15 15:28 fop.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 7022 Dec 15 15:28 hsqldb.6000.762.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 6605 Dec 15 15:28 jython.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 226549 Dec 15 15:28 luindex.6000.132.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 6955 Dec 15 15:28 lusearch.6000.204.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 121510 Dec 15 15:28 pmd.6000.294.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 552 Dec 15 15:28 runbms_args.yml
-rw-r--r-- 1 wks wks 8246 Dec 15 15:28 runbms.yml
-rw-r--r-- 1 wks wks 6510 Dec 15 15:28 sunflow.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 6551 Dec 15 15:28 xalan.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
wks@luna ~/j/c/whole-run> du -sh *
53M after-archived
176K after-archived.tar.lz
892K after-individual
3.9M before
If we can reduce 3.9M to 176K, we should be able to reduce the 2GB repository to about 100M.