Skip to content

Switch away from GZip for log compression #1436

@wks

Description

@wks

We have been using gzip to compress logs from benchmarks. However, GZip has poor compression ratio among common compression tools, and is especially bad for certain log files we produce. This consumes a ridiculous amount of disk space, but can be easily fixed by switching to a better compression tool.

We also need to ensure there are Python bindings so that we can process those logs using our existing Python scripts with minimal modification.

Example: ci-perf-result

We keep a history of performance log in the https://github.com/mmtk/ci-perf-result repository. When checked out, the files in the workspace (not including .git itself) takes 2 GB of space. Most of the space is consumed by log files of a few test cases.

-rw-r--r-- 1 wks wks  64711 Dec 12 14:57 antlr.6000.144.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   6653 Dec 12 14:57 bloat.6000.198.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   7345 Dec 12 14:57 eclipse.6000.504.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   6260 Dec 12 14:57 fop.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   7022 Dec 12 14:57 hsqldb.6000.762.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   6605 Dec 12 14:57 jython.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 226549 Dec 12 14:57 luindex.6000.132.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   6955 Dec 12 14:57 lusearch.6000.204.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 121510 Dec 12 14:57 pmd.6000.294.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks    552 Dec 12 14:57 runbms_args.yml
-rw-r--r-- 1 wks wks   8246 Dec 12 14:57 runbms.yml
-rw-r--r-- 1 wks wks   6510 Dec 12 14:57 sunflow.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   6551 Dec 12 14:57 xalan.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
  • luindex prints out the table of contents of the King James Version Bible while indexing them, accumulating to more than 1000 lines per run.
  • pmd prints out the warnings on the test corpus while executing.
  • antlr logs the input file names during execution.

While 226 KB doesn't look large for a single run, it can accumulate. With one 226 KB per plan per CI run, it eventually accumulates to a gigabyte.

How effective are other compression tools?

DaCapo 2006

I tried to compress the logs from antlr, fop, luindex and pmd in DaCapo 2006 (i.e. from ci-perf-result) using several tools. antlr, luindex and pmd have large log output, and fop is chosen as a control group.

size (bytes) benchmark compression options
1155654 antlr uncompressed
127836 antlr lzo
77267 antlr lzo -9
64651 antlr gz
63424 antlr gz -9
45655 antlr lz4
15083 antlr bz2
15083 antlr bz2 -9
13522 antlr lz4 -9
6980 antlr zstd
6284 antlr bz3
6196 antlr lrz
6161 antlr zstd -9
6064 antlr lrz -z
5959 antlr lz
5935 antlr zstd -19
5920 antlr xz
5920 antlr xz -9
5695 antlr lz -9
161919 fop uncompressed
20744 fop lzo
9766 fop lz4
6841 fop lz4 -9
6666 fop lzo -9
6546 fop bz2
6546 fop bz2 -9
6202 fop gz
5976 fop gz -9
5114 fop zstd
4816 fop bz3
4659 fop lrz
4643 fop zstd -9
4545 fop lrz -z
4257 fop zstd -19
4207 fop lz
4204 fop xz
4204 fop xz -9
4080 fop lz -9
1858161 luindex uncompressed
323165 luindex lzo
226487 luindex gz
201132 luindex lzo -9
197991 luindex gz -9
158852 luindex lz4
83571 luindex lz4 -9
30587 luindex bz2
30587 luindex bz2 -9
9509 luindex zstd
8683 luindex zstd -9
7868 luindex bz3
7563 luindex lrz
7458 luindex zstd -19
7212 luindex lrz -z
6762 luindex lz
6656 luindex xz
6656 luindex xz -9
6468 luindex lz -9
2377643 pmd uncompressed
183840 pmd lzo
154526 pmd lzo -9
127098 pmd lz4
121452 pmd gz
118288 pmd gz -9
81500 pmd lz4 -9
22670 pmd bz2
22670 pmd bz2 -9
8176 pmd zstd
7265 pmd zstd -9
6476 pmd xz
6476 pmd xz -9
6465 pmd lz
6376 pmd lrz
6314 pmd bz3
6296 pmd zstd -19
6180 pmd lrz -z
5802 pmd lz -9

From the table, we see that lzip -9 is always the winner of compression ratio for the DaCapo 2006 outputs. bzip3, lrz, xz and zstd also work reasonably well. lz4, gz, bz2 and lzo are all unsatisfactory.

fastdebug

Note that the compression ratio depends on the input data. Here is a result from compressing the execution log of fastdebug in an OpenJDK test with DaCapo Chopin, i.e. concurrentimmix in linux-x86_64-lusearch-fastdebug-2.5 picked from https://github.com/mmtk/mmtk-openjdk/actions/runs/19810034387

size (bytes) compression options
710863 uncompressed
66154 lzo
61895 lz4
40964 lzo -9
40016 lz4 -9
36698 zstd
32920 gz
30236 gz -9
29598 lrz
28106 lz
27822 zstd -9
26459 lrz -z
25800 xz
25800 xz -9
23453 zstd -19
21685 bz2
21685 bz2 -9
20987 lz -9
19215 bz3

This time bz3 works better than lz -9, and bz2 works better than zstd and xz. But gz is still worse. I think one difference between this log and that of luindex in DaCapo 2006 is that this log contains a timestamp on each line, and the logs contain many numbers that are distinct among all log lines.

DaCapo Chopin

In general, DaCapo Chopin produces much less verbose logs than DaCapo 2006. If each file is small, compression algorithms will just give diminishing return. The table below is also from ConcurrentImmix, but from linux-x86_64-lusearch-release-2.5 from https://github.com/mmtk/mmtk-openjdk/actions/runs/19810034387

size (bytes) compression options
8158 uncompressed
4694 lz4
4662 lzo
4427 lz4 -9
4079 lzo -9
3443 zstd
3432 bz2
3432 bz2 -9
3377 bz3
3366 zstd -9
3275 gz
3274 gz -9
3251 lrz
3250 zstd -19
3162 lrz -z
3140 xz
3140 xz -9
3096 lz
3095 lz -9

The difference between gz and the best performer lz is not that significant, but still saves about 10% space.

What should we change?

Things that generate GZip-compressed files

It is mostly not mmtk-core itself, but many related tools.

One thing to change is running-ng. It hard-coded gzip into runbms.py. See: https://github.com/anupli/running-ng/blob/64ee6876a076e42b671cfd9c0e6a7ddd41fb3bdf/src/running/command/runbms.py#L387 This should be easy to fix because we can change it to another tool, such as lzip -9, or make it a configurable option.

Things that read GZip-compressed files

The other thing that should be changed are Python scripts. We have quite some scripts that work with GZip-compressed files. For example,

  • log_preprocessor.py in running-ng filters GZip-compressed log files.
  • parse.py in ci-perf-kit also reads GZip-compressed files.
  • plotty also has log parsers that reads GZip-compressed files.

To ensure we don't lock ourselves into one compression format, we should replace the python scripts with a more general tool that can auto-detect the compression format of a file.

ci-perf-results

We may convert all files in the ci-perf-result repo to another format, such as lzip.

However, Git is snapshot-based and keeps all historical versions of files. If we want to keep the Git history of the ci-perf-results repo, it won't save us any space to convert the log formats.

But if we only use the ci-perf-result repository as a place to hold the logs and render the web pages, we may consider doing the conversion and replace the entire git history of the self-hosted branch of ci-perf-result.

And it is wiser to put logs into a tar archive instead of compressing individual files. As we can see in the output below, an lzip-compressed archive is only 30KB, but compressing individual files will sum up to 100KB (total 100). Note that each file takes a multiple of 4KB of space on disk, depending on the file system.

total 524
drwxr-xr-x 2 wks wks   4096 Dec 15 15:08 hammer-2025-12-05-Fri-064328-before
drwxr-xr-x 2 wks wks   4096 Dec 15 15:09 hammer-2025-12-05-Fri-064328-archived
drwxr-xr-x 2 wks wks   4096 Dec 15 15:08 hammer-2025-12-05-Fri-064328-after
-rw-r--r-- 1 wks wks  30377 Dec 15 15:10 hammer-2025-12-05-Fri-064328-archived.tar.lz
-rw-r--r-- 1 wks wks  33844 Dec 15 15:11 hammer-2025-12-05-Fri-064328-archived.tar.xz
-rw-r--r-- 1 wks wks  34435 Dec 15 15:11 hammer-2025-12-05-Fri-064328-archived.tar.zstd
-rw-r--r-- 1 wks wks 416529 Dec 15 15:10 hammer-2025-12-05-Fri-064328-archived.tar.gz

./hammer-2025-12-05-Fri-064328-before:
total 488
-rw-r--r-- 1 wks wks    552 Dec 15 15:08 runbms_args.yml
-rw-r--r-- 1 wks wks   6260 Dec 15 15:08 fop.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   6510 Dec 15 15:08 sunflow.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   6551 Dec 15 15:08 xalan.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   6605 Dec 15 15:08 jython.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   6653 Dec 15 15:08 bloat.6000.198.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   6955 Dec 15 15:08 lusearch.6000.204.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   7022 Dec 15 15:08 hsqldb.6000.762.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   7345 Dec 15 15:08 eclipse.6000.504.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   8246 Dec 15 15:08 runbms.yml
-rw-r--r-- 1 wks wks  64711 Dec 15 15:08 antlr.6000.144.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 121510 Dec 15 15:08 pmd.6000.294.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 226549 Dec 15 15:08 luindex.6000.132.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz

./hammer-2025-12-05-Fri-064328-archived:
total 6672
-rw-r--r-- 1 wks wks     552 Dec 15 15:09 runbms_args.yml
-rw-r--r-- 1 wks wks    8246 Dec 15 15:09 runbms.yml
-rw-r--r-- 1 wks wks  161919 Dec 15 15:09 fop.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks  162933 Dec 15 15:09 xalan.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks  166221 Dec 15 15:09 jython.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks  171026 Dec 15 15:09 lusearch.6000.204.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks  171263 Dec 15 15:09 sunflow.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks  180060 Dec 15 15:09 bloat.6000.198.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks  188478 Dec 15 15:09 eclipse.6000.504.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks  203881 Dec 15 15:09 hsqldb.6000.762.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks 1155654 Dec 15 15:09 antlr.6000.144.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks 1858161 Dec 15 15:09 luindex.6000.132.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log
-rw-r--r-- 1 wks wks 2377643 Dec 15 15:09 pmd.6000.294.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log

./hammer-2025-12-05-Fri-064328-after:
total 100
-rw-r--r-- 1 wks wks  552 Dec 15 15:08 runbms_args.yml
-rw-r--r-- 1 wks wks 4080 Dec 15 15:08 fop.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 4206 Dec 15 15:08 bloat.6000.198.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 4216 Dec 15 15:08 sunflow.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 4283 Dec 15 15:08 jython.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 4290 Dec 15 15:08 xalan.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 4405 Dec 15 15:08 lusearch.6000.204.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 4470 Dec 15 15:08 hsqldb.6000.762.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 4707 Dec 15 15:08 eclipse.6000.504.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 5695 Dec 15 15:08 antlr.6000.144.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 5802 Dec 15 15:08 pmd.6000.294.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 6468 Dec 15 15:08 luindex.6000.132.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.lz
-rw-r--r-- 1 wks wks 8246 Dec 15 15:08 runbms.yml

And it will reduce the space even further if we archive the output of an entire GC run into one single .tar.lz archive. In the output below, before contains all logs produced on hammer.moma on 2025-12-05. after-individual has each individual log re-compressed using lzip -9. after-archived is a directory of uncompressed log files, and we compress it as a hole into after-archived.tar.lz.

wks@luna ~/j/c/whole-run> ll
total 188
drwxr-xr-x 10 wks wks   4096 Dec 15 15:29 after-archived
-rw-r--r--  1 wks wks 176953 Dec 15 15:33 after-archived.tar.lz
drwxr-xr-x 10 wks wks   4096 Dec 15 15:31 after-individual
drwxr-xr-x 10 wks wks   4096 Dec 15 15:28 before
wks@luna ~/j/c/whole-run> ll before
total 32
drwxr-xr-x 3 wks wks 4096 Dec 15 15:28 canary
drwxr-xr-x 3 wks wks 4096 Dec 15 15:28 gencopy
drwxr-xr-x 3 wks wks 4096 Dec 15 15:28 genimmix
drwxr-xr-x 3 wks wks 4096 Dec 15 15:28 immix
drwxr-xr-x 3 wks wks 4096 Dec 15 15:28 marksweep
drwxr-xr-x 3 wks wks 4096 Dec 15 15:28 nogc
drwxr-xr-x 3 wks wks 4096 Dec 15 15:28 semispace
drwxr-xr-x 3 wks wks 4096 Dec 15 15:28 stickyimmix
wks@luna ~/j/c/whole-run> ll before/genimmix/
total 4
drwxr-xr-x 2 wks wks 4096 Dec 15 15:28 hammer-2025-12-05-Fri-064328
wks@luna ~/j/c/whole-run> ll before/genimmix/hammer-2025-12-05-Fri-064328/
total 488
-rw-r--r-- 1 wks wks  64711 Dec 15 15:28 antlr.6000.144.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   6653 Dec 15 15:28 bloat.6000.198.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   7345 Dec 15 15:28 eclipse.6000.504.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   6260 Dec 15 15:28 fop.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   7022 Dec 15 15:28 hsqldb.6000.762.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   6605 Dec 15 15:28 jython.6000.240.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 226549 Dec 15 15:28 luindex.6000.132.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   6955 Dec 15 15:28 lusearch.6000.204.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks 121510 Dec 15 15:28 pmd.6000.294.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks    552 Dec 15 15:28 runbms_args.yml
-rw-r--r-- 1 wks wks   8246 Dec 15 15:28 runbms.yml
-rw-r--r-- 1 wks wks   6510 Dec 15 15:28 sunflow.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
-rw-r--r-- 1 wks wks   6551 Dec 15 15:28 xalan.6000.324.jdk-mmtk.genimmix.common_mmtk.dacapo2006.log.gz
wks@luna ~/j/c/whole-run> du -sh *
53M     after-archived
176K    after-archived.tar.lz
892K    after-individual
3.9M    before

If we can reduce 3.9M to 176K, we should be able to reduce the 2GB repository to about 100M.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions