>> kawipiko-archiver --help >> kawipiko-archiver --man
--sources <path> --archive <path> --compress <gzip | zopfli | brotli | identity> --compress-level <number> --compress-cache <path> --sources-cache <path> --exclude-index --exclude-strip --exclude-cache --include-etag --exclude-slash-redirects --include-folder-listing --exclude-paths-index --progress --debug --version --help (show this short help) --man (show the full manual) --sources-md5 (dump an ``md5sum`` of the sources) --sources-cpio (dump a ``cpio.gz`` of the sources) --sbom --sbom-text --sbom-json
--sources
The path to the source folder that is the root of the static website content.
--archive
The path to the target CDB file that contains the archived static content.
--compress
, and --compress-level
Each individual file (and consequently of the corresponding HTTP response body) is compressed with either
gzip
,zopfli
orbrotli
; by default (or alternatively withidentity
) no compression is used.Even if compression is explicitly requested, if the compression ratio is bellow a certain threshold (depending on the uncompressed size), the file is stored without any compression. (It's senseless to force the client to spend time and decompress the response body if that time is not recovered during network transmission.)
The compression level can be chosen, the value depending on the algorithm:
gzip
---1
for algorithm default,-2
for Huffman only,0
to9
for fast to slow;zopfli
---1
for algorithm default,0
to30
iterations for fast to slow;brotli
---1
for algorithm default,0
to9
for fast to slow,-2
for extreme;- (by "algorithm default", it is meant "what that algorithm considers the recommended default compression level";)
kawipiko
by default uses the maximum compression level for each algorithm; (i.e.9
forgzip
,30
forzopfli
, and-2
forbrotli
;)
--compress-cache <path>
, and --sources-cache <path>
At the given path a single file is created (that is an BBolt database), that will be used to cache the following information:
- in case of
--sources-cache
, the fingerprint of each file contents is stored, so that if the file was not changed, re-reading it shouldn't be attempted unless it is absolutely necessary; also if the file is small enough, its contents is stored in this database (deduplicated by its fingerprint);- in case of
--compress-cache
the compression outcome of each file contents is stored (deduplicated by its fingerprint), so that compression is done only once over multiple runs;Each of these caches can be safely reused between multiple related archives, especially when they have many files in common. Each of these caches can be independently used (or shared).
Using these caches allows one to very quickly rebuild an archive when only a couple of files have been changed, without even touching the file-system for the unchanged ones.
--exclude-index
Disables using_index.*
andindex.*
files (where.*
is one of.html
,.htm
,.xhtml
,.xht
,.txt
,.json
, and.xml
) to respond to a request whose URL path ends in/
(corresponding to the folder wherein_index.*
orindex.*
file is located). (This can be used to implement "slash" blog style URL's like/blog/whatever/
which maps to/blog/whatever/index.html
.)
--exclude-strip
Disables using a file with the suffix.html
,.htm
,.xhtml
,.xht
, and.txt
to respond to a request whose URL does not exactly match an existing file. (This can be used to implement "suffix-less" blog style URL's like/blog/whatever
which maps to/blog/whatever.html
.)
--exclude-cache
Disables adding anCache-Control: public, immutable, max-age=3600
header that forces the browser (and other intermediary proxies) to cache the response for an hour (thepublic
andmax-age=3600
arguments), and furthermore not request it even on reloads (theimmutable
argument).
--include-etag
Enables adding an
ETag
response header that contains the SHA256 of the response body.By not including the
ETag
header (i.e. the default), and because identical headers are stored only one, if one has many files of the same type (that in turn withoutETag
generates the same headers), this can lead to significant reduction in stored headers blocks, including reducing RAM usage. (At this moment it does not support HTTP conditional requests, i.e. theIf-None-Match
,If-Modified-Since
and their counterparts; however thisETag
header might be used in conjuction withHEAD
requests to see if the resource has changed.)
--exclude-slash-redirects
Disables adding redirects to/from paths with/without / (For example, by default, if /file exists, then there is also a /file/ redirect towards /file; and vice-versa from /folder towards /folder/.)
--include-folder-listing
Enables the creation of an internal list of folders.
--exclude-paths-index
Disables the creation of an internal list of references that can be used in conjunction with the--index-all
flag of thekawipiko-server
.
--progress
Enables periodic reporting of various metrics.
--debug
Enables verbose logging. It will log various information about the archived files (including compression statistics).
- any file with the following prefixes:
.
,#
; - any file with the following suffixes:
~
,#
,.log
,.tmp
,.temp
,.lock
; - any file that contains the following:
#
; - any file that exactly matches the following:
Thumbs.db
,.DS_Store
; - (at the moment these rules are not configurable through flags;)
By placing a file whose name matches _wildcard.*
(i.e. with the prefix _wildcard.
and any other suffix), it will be used to respond to any request whose URL fails to find a "better" match.
These wildcard files respect the folder hierarchy, in that wildcard files in (direct or transitive) subfolders override the wildcard file in their parents (direct or transitive).
In addition to _wildcard.*
, there is also support for _200.html
(or just 200.html
), plus _404.html
(or just 404.html
).
By placing a file whose name is _redirects
(or _redirects.txt
), it instructs the archiver to create redirect responses.
The syntax is quite simple:
# This is a comment. # NOTE: Absolute paths are allowed only at the top of the sources folder. /some-path https://example.com/ 301 # NOTE: Relative paths are always, and are reinterpreted as relative to the containing folder. ./some-path https://example.com/ 302 # NOTE: Redirects only for a specific domain. (The protocol is irelevant.) # (Allowed only at the top of the sources folder.) ://example.com/some-path https://example.com/ 303 http://example.com/some-path https://example.com/ 307 https://example.com/some-path https://example.com/ 308
You freely use symlinks (including pointing outside of the content root) and they will be crawled during archival respecting the "logical" hierarchy they introduce. (Any loop that you introduce into the hierarchy will be ignored and a warning will be issued.)
You can safely symlink or hardlink the same file (or folder) in multiple places (within the content hierarchy), and its data will be stored only once. (The same applies to duplicated files that have exactly the same data.)