Skip to content

Final report #773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 41 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
273d0e3
Add (hopefully) final draft of the final report
dcoutts Jul 1, 2025
c8060a3
Apply the same formatting to the integration notes as final report
dcoutts Jul 1, 2025
09e573a
section in integration notes on security of hash based data structures
dcoutts Jul 1, 2025
a4b291e
Add the previous reports as references
dcoutts Jul 1, 2025
dcc18b8
Add references to the specification items with the targets
jeltsch Jul 1, 2025
4db6e3b
Polish a bit
jeltsch Jul 1, 2025
90fe4be
Revise the part on meeting the memory targets
jeltsch Jul 1, 2025
cea37c3
Elaborate on interactions with Mithril in integration notes
dcoutts Jul 2, 2025
a5f543b
Tweak final report title, and add subtitle
dcoutts Jul 2, 2025
f77131f
Minor edits to final report introduction
dcoutts Jul 2, 2025
592a11e
Final report: add a changelog
jorisdral Jul 2, 2025
d45fd47
Revise the part on the upsert benchmarks
jeltsch Jul 2, 2025
3a43516
Restore 80-columns layout for paragraphs with citations
jeltsch Jul 3, 2025
ed87915
Improve the formatting of the metadata source
jeltsch Jul 3, 2025
1918726
Add @dcoutts to references as integration notes co-author
jeltsch Jul 3, 2025
ad91509
Restore spaces dropped by bibliography style
jeltsch Jul 3, 2025
25e2c7d
Change `master` to `main` in GitHub URLs
jeltsch Jul 3, 2025
7e96f24
Fix the URL of the API documentation
jeltsch Jul 3, 2025
d0b1533
Add (no-break) spaces before citation references
jeltsch Jul 3, 2025
1201acc
Slightly improve the beginning of the introduction
jeltsch Jul 3, 2025
9e93f86
Add integration notes section on possible file system incompatibility…
dcoutts Jul 4, 2025
65a2c1c
Remove `locator` field from the bibliography
jeltsch Jul 5, 2025
874d157
Revise the section on hashing and security
jeltsch Jul 5, 2025
cf64c3b
Revise the section on problems in connection with XFS
jeltsch Jul 5, 2025
205f50e
Make appendix-related references use correct terminology
jeltsch Jul 7, 2025
9f8eed8
Update `lsm-tree-bench-wp8` to `utxo-bench`
jeltsch Jul 7, 2025
f5cc532
Fix formula for hash clash probability
jeltsch Jul 7, 2025
678e544
Add makefile
jeltsch Jul 7, 2025
fa4ab3b
Disable built-in make rules
jeltsch Jul 7, 2025
5173ccc
Switch all GitHub URLs to the commit tagged `final-report`
jeltsch Jul 7, 2025
4590f2f
Change remaining occurrences of `alpha` to `final-report`
jeltsch Jul 8, 2025
5ba9a3e
Add explicit reference to hash salts being per session
jeltsch Jul 8, 2025
930d33c
Update publication months
jeltsch Jul 8, 2025
7b15ff3
Change “Monoidal updates” to “Upserts”
jeltsch Jul 8, 2025
e9f9882
Make references to requirements hyperlinks
jeltsch Jul 8, 2025
0ea9cc3
Improve shell code blocks
jeltsch Jul 8, 2025
cef44f1
Make it clear that we talk about `fio` scores at one point
jeltsch Jul 8, 2025
ef77159
Make a phrase about serial execution clearer
jeltsch Jul 8, 2025
c3dd9ef
Clarify that keys are only roughly uniformly distributed
jeltsch Jul 8, 2025
6538634
Clarify that a list of options is not exhaustive
jeltsch Jul 8, 2025
33beeae
Clarify that another list of options is not exhaustive
jeltsch Jul 8, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,479 changes: 1,479 additions & 0 deletions doc/final-report/final-report.md

Large diffs are not rendered by default.

17 changes: 17 additions & 0 deletions doc/final-report/ieee-software.csl
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
<?xml version="1.0" encoding="utf-8"?>
<style xmlns="http://purl.org/net/xbiblio/csl" version="1.0" default-locale="en-US">
<!-- Generated with https://github.com/citation-style-language/utilities/tree/master/generate_dependent_styles/data/ieee -->
<info>
<title>IEEE Software</title>
<id>http://www.zotero.org/styles/ieee-software</id>
<link href="http://www.zotero.org/styles/ieee-software" rel="self"/>
<link href="http://www.zotero.org/styles/ieee" rel="independent-parent"/>
<link href="http://ieeexplore.ieee.org/servlet/opac?punumber=52" rel="documentation"/>
<category citation-format="numeric"/>
<category field="engineering"/>
<category field="communications"/>
<issn>0740-7459</issn>
<updated>2014-05-15T02:20:32+00:00</updated>
<rights license="http://creativecommons.org/licenses/by-sa/3.0/">This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License</rights>
</info>
</style>
124 changes: 111 additions & 13 deletions doc/final-report/integration-notes.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,25 @@
# Storing the Cardano ledger state on disk: integration notes for high-performance backend

Authors: Joris Dral, Wolfgang Jeltsch
Date: May 2025

## Sessions
---
title: "Storing the Cardano ledger state on disk:
integration notes for high-performance backend"
author:
- Duncan Coutts
- Joris Dral
- Wolfgang Jeltsch
date: July 2025

toc: true
numbersections: true
classoption:
- 11pt
- a4paper
geometry:
- margin=2.5cm
header-includes:
- \usepackage{microtype}
- \usepackage{mathpazo}
---

# Sessions

Creating new empty tables or opening tables from snapshots requires a `Session`.
The session can be created using `openSession`, which has to be done in the
Expand All @@ -15,7 +31,7 @@ Closing the session will automatically close all tables, but this is only
intended to be a backup functionality: ideally the user closes all tables
manually.

## The compact index
# The compact index

The compact index is a memory-efficient data structure that maintains serialised
keys. Rather than storing full keys, it only stores the first 64 bits of each
Expand All @@ -24,9 +40,9 @@ key.
The compact index only works properly if in most cases it can determine the
order of two serialised keys by looking at their 64-bit prefixes. This is the
case, for example, when the keys are hashes: the probability that two hashes
have the same 64-bit prefixes is $\frac{1}{2}^{64}$ and thus very small. If the
hashes are 256 bits in size, then the compact index uses 4 times less memory
than if it would store the full keys.
have the same 64-bit prefixes is $2^{-64}$ and thus very small. If the hashes
are 256 bits in size, then the compact index uses 4 times less memory than if it
would store the full keys.

There is a backup mechanism in place for the case when the 64-bit prefixes of
keys are not sufficient to make a comparison. This backup mechanism is less
Expand Down Expand Up @@ -60,7 +76,7 @@ keys is as good as any other total ordering. However, the consensus layer will
face the situation where a range lookup or a cursor read returns key–value pairs
slightly out of order. Currently, we do not expect this to cause problems.

## Snapshots
# Snapshots

Snapshots currently require support for hard links. This means that on Windows
the library only works when using NTFS. Support for other file systems could be
Expand All @@ -84,7 +100,7 @@ a cheaper non-SSD drive. This feature was unfortunately not anticipated in the
project specification and so is not currently included. As discussed above, it
could be added with some additional work.

## Value resolving
# Value resolving

When instantiating the `ResolveValue` class, it is usually advisable to
implement `resolveValue` such that it works directly on the serialised values.
Expand All @@ -94,7 +110,7 @@ function is intended to work like `(+)`, then `resolveValue` could add the raw
bytes of the serialised values and would likely achieve better performance this
way.

## `io-classes` incompatibility
# `io-classes` incompatibility

At the time of writing, various packages in the `cardano-node` stack depend on
`io-classes-1.5` and the 1.5-versions of its daughter packages, like
Expand Down Expand Up @@ -124,3 +140,85 @@ It is known to us that the `ouroboros-consensus` stack has not been updated to
https://github.com/IntersectMBO/ouroboros-network/pull/4951. We would advise to
fix this Nix-related bug rather than downgrading `lsm-tree`’s dependency on
`io-classes` to version 1.5.

# Security of hash-based data structures

Data structures based on hashing have to be considered carefully when they may
be used with untrusted data. For example, an attacker who can control the keys
in a hash table may be able to provoke hash collisions and cause unexpected
performance problems this way. This is why the Haskell Cardano node
implementation does not use hash tables but ordering-based containers, such as
those provided by `Data.Map`.

The Bloom filters in an LSM-Tree are hash-based data structures. For the sake of
performance, they do not use cryptographic hashes. Thus, without additional
measures, an attacker can in principle choose keys whose hashs identify mostly
the same bits. This is a potential problem for the UTxO and other stake-related
tables in Cardano, since it is the users who get to pick their UTxO keys (TxIn)
and stake keys (verification key hashes) and these keys will hash the same way
on all other Cardano nodes.

This issue was not considered in the original project specification, but we have
taken it into account and have included a mitigation in `lsm-tree`. The
mitigation is that, on the initial creation of a session, a random salt is
conjured and stored persistenly as part of the session. This salt is then used
as part of the Bloom filter hashing for all runs in all tables of the session.

The consequence is that, while it is in principle still possible to produce hash
collisions in the Bloom filter, this now depends on knowing the salt. However,
every node has a different salt. Thus a system-wide attack becomes impossible.
It is only plausible to target individual nodes, but discovering a node’s salt
is extremely difficult. In principle there is a timing side channel, in that
collisions will cause more I/O and thus cause longer running times. To exploit
this, an attacker would need to get upstream of a victim node, supply a valid
block and measure the timing of receiving the block downstream. There would,
however, be a large amount of noise spoiling such an attack.

Overall, our judgement is that our mitigation is sufficient, but it merits a
security review from others who may make a different judgement. It is also worth
noting that the described hash clash issue may occur in other LSM-tree
implementations used in other software, related and unrelated to Cardano. In
particular, RocksDB does not appear to use a salt at all.

Note that using a per-run or per-table hash salt would incur non-trivial costs,
because it would reduce the sharing available in bulk Bloom filter lookups,
where several keys are looked up in several filters. Given that the Bloom filter
lookup is a performance-sensitive part of the overall database implementation,
such an approach to salting does not seem feasible. Therefore, we chose to
generate hash salts per session.

In the Cardano context, a downside of picking Bloom filter salts per session and
thus per node is that this interacts poorly with sharing of pre-created
databases. While it would still be possible to copy a whole database session,
since this includes the salt, doing so would result in the salt being shared
between nodes. If SPOs shared databases widely with each other, to avoid
processing the entire chain, then the salt diversity would be lost.

Picking Bloom filter salts per session is particularly problematic in Mithril,
which shares a single copy of the database. It may be necessary for proper
Mithril support to add a re-salting operation and to perform this operation
after cloning a Mithril snapshot. Re-salting would involve re-creating the Bloom
filters for all table runs, which would mean reading each run, inserting its
keys into a new Bloom filter and finally writing out the new Bloom filter.
Adding such a feature would, of course, incur additional development work, but
the infrastructure needed is present already.

# Possible incompatibility with the XFS file system

We have seen at least one failure when disabling disk caching via the table
configuration, using the `DiskCacheNone` setting. Albeit it is unconfirmed, we
suspect that some versions of Linux’s XFS file system implementation, in
particular the one used by the default AWS Amazon Linux 2023 AMI, do not support
the system call that underlies [`fileSetCaching`] from the `unix` package. This
is an `fcntl` call, used to set the file status flag `O_DIRECT`. XFS certainly
supports `O_DIRECT`, but it may support it only when the file in question is
opened using this flag, not when trying to set this flag for an already open
file.

This problem can be worked around by using the ext4 file system or by using
`DiskCacheAll` in the table configuration, the latter at the cost of using more
memory and putting pressure on the page cache. If this problem is confirmed to
be widespread, it may become necessary to extend the `unix` package to allow
setting the `O_DIRECT` flag upon file opening.

[`fileSetCaching`]: https://hackage-content.haskell.org/package/unix-2.8.7.0/docs/System-Posix-Fcntl.html#v:fileSetCaching
18 changes: 18 additions & 0 deletions doc/final-report/makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
.POSIX:

.SILENT:

.SUFFIXES:

.PHONY: all
all: final-report.pdf integration-notes.pdf

final-report.pdf: final-report.md pipelining.pdf
pandoc -C -o final-report.pdf final-report.md

integration-notes.pdf: integration-notes.md
pandoc -o integration-notes.pdf integration-notes.md

.PHONY: clean
clean:
rm -f final-report.pdf integration-notes.pdf
Binary file added doc/final-report/pipelining.pdf
Binary file not shown.
Binary file added doc/final-report/references/utxo-db-api.pdf
Binary file not shown.
Binary file added doc/final-report/references/utxo-db-lsm.pdf
Binary file not shown.
Binary file added doc/final-report/references/utxo-db.pdf
Binary file not shown.