Skip to content

Commit c1bbd81

Browse files
authored
Merge pull request #773 from IntersectMBO/dcoutts/final-report
Final report
2 parents ad260c7 + 3430b87 commit c1bbd81

File tree

9 files changed

+1706
-14
lines changed

9 files changed

+1706
-14
lines changed

bench/micro/Bench/Database/LSMTree.hs

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,8 +79,9 @@ instance ResolveValue V3 where
7979

8080
benchConfig :: TableConfig
8181
benchConfig = defaultTableConfig
82-
{ confWriteBufferAlloc = AllocNumEntries 20000
82+
{ confWriteBufferAlloc = AllocNumEntries 1000
8383
, confFencePointerIndex = CompactIndex
84+
, confDiskCachePolicy = DiskCacheNone
8485
}
8586

8687
benchSalt :: Bloom.Salt

doc/final-report/final-report.md

Lines changed: 1556 additions & 0 deletions
Large diffs are not rendered by default.

doc/final-report/ieee-software.csl

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
<?xml version="1.0" encoding="utf-8"?>
2+
<style xmlns="http://purl.org/net/xbiblio/csl" version="1.0" default-locale="en-US">
3+
<!-- Generated with https://github.com/citation-style-language/utilities/tree/master/generate_dependent_styles/data/ieee -->
4+
<info>
5+
<title>IEEE Software</title>
6+
<id>http://www.zotero.org/styles/ieee-software</id>
7+
<link href="http://www.zotero.org/styles/ieee-software" rel="self"/>
8+
<link href="http://www.zotero.org/styles/ieee" rel="independent-parent"/>
9+
<link href="http://ieeexplore.ieee.org/servlet/opac?punumber=52" rel="documentation"/>
10+
<category citation-format="numeric"/>
11+
<category field="engineering"/>
12+
<category field="communications"/>
13+
<issn>0740-7459</issn>
14+
<updated>2014-05-15T02:20:32+00:00</updated>
15+
<rights license="http://creativecommons.org/licenses/by-sa/3.0/">This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License</rights>
16+
</info>
17+
</style>

doc/final-report/integration-notes.md

Lines changed: 115 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,25 @@
1-
# Storing the Cardano ledger state on disk: integration notes for high-performance backend
2-
3-
Authors: Joris Dral, Wolfgang Jeltsch
4-
Date: May 2025
5-
6-
## Sessions
1+
---
2+
title: "Storing the Cardano ledger state on disk:
3+
integration notes for high-performance backend"
4+
author:
5+
- Duncan Coutts
6+
- Joris Dral
7+
- Wolfgang Jeltsch
8+
date: July 2025
9+
10+
toc: true
11+
numbersections: true
12+
classoption:
13+
- 11pt
14+
- a4paper
15+
geometry:
16+
- margin=2.5cm
17+
header-includes:
18+
- \usepackage{microtype}
19+
- \usepackage{mathpazo}
20+
---
21+
22+
# Sessions
723

824
Creating new empty tables or opening tables from snapshots requires a `Session`.
925
The session can be created using `openSession`, which has to be done in the
@@ -15,7 +31,7 @@ Closing the session will automatically close all tables, but this is only
1531
intended to be a backup functionality: ideally the user closes all tables
1632
manually.
1733

18-
## The compact index
34+
# The compact index
1935

2036
The compact index is a memory-efficient data structure that maintains serialised
2137
keys. Rather than storing full keys, it only stores the first 64 bits of each
@@ -24,9 +40,9 @@ key.
2440
The compact index only works properly if in most cases it can determine the
2541
order of two serialised keys by looking at their 64-bit prefixes. This is the
2642
case, for example, when the keys are hashes: the probability that two hashes
27-
have the same 64-bit prefixes is $\frac{1}{2}^{64}$ and thus very small. If the
28-
hashes are 256 bits in size, then the compact index uses 4 times less memory
29-
than if it would store the full keys.
43+
have the same 64-bit prefixes is $2^{-64}$ and thus very small. If the hashes
44+
are 256 bits in size, then the compact index uses 4 times less memory than if it
45+
would store the full keys.
3046

3147
There is a backup mechanism in place for the case when the 64-bit prefixes of
3248
keys are not sufficient to make a comparison. This backup mechanism is less
@@ -60,7 +76,7 @@ keys is as good as any other total ordering. However, the consensus layer will
6076
face the situation where a range lookup or a cursor read returns key–value pairs
6177
slightly out of order. Currently, we do not expect this to cause problems.
6278

63-
## Snapshots
79+
# Snapshots
6480

6581
Snapshots currently require support for hard links. This means that on Windows
6682
the library only works when using NTFS. Support for other file systems could be
@@ -84,7 +100,7 @@ a cheaper non-SSD drive. This feature was unfortunately not anticipated in the
84100
project specification and so is not currently included. As discussed above, it
85101
could be added with some additional work.
86102

87-
## Value resolving
103+
# Value resolving
88104

89105
When instantiating the `ResolveValue` class, it is usually advisable to
90106
implement `resolveValue` such that it works directly on the serialised values.
@@ -94,7 +110,7 @@ function is intended to work like `(+)`, then `resolveValue` could add the raw
94110
bytes of the serialised values and would likely achieve better performance this
95111
way.
96112

97-
## `io-classes` incompatibility
113+
# `io-classes` incompatibility
98114

99115
At the time of writing, various packages in the `cardano-node` stack depend on
100116
`io-classes-1.5` and the 1.5-versions of its daughter packages, like
@@ -124,3 +140,89 @@ It is known to us that the `ouroboros-consensus` stack has not been updated to
124140
https://github.com/IntersectMBO/ouroboros-network/pull/4951. We would advise to
125141
fix this Nix-related bug rather than downgrading `lsm-tree`’s dependency on
126142
`io-classes` to version 1.5.
143+
144+
# Security of hash-based data structures
145+
146+
Data structures based on hashing have to be considered carefully when they may
147+
be used with untrusted data. For example, an attacker who can control the keys
148+
in a hash table may be able to provoke hash collisions and cause unexpected
149+
performance problems this way. This is why the Haskell Cardano node
150+
implementation does not use hash tables but ordering-based containers, such as
151+
those provided by `Data.Map`.
152+
153+
The Bloom filters in an LSM-Tree are hash-based data structures. For the sake of
154+
performance, they do not use cryptographic hashes. Thus, without additional
155+
measures, an attacker can in principle choose keys whose hashes identify mostly
156+
the same bits. This is a potential problem for the UTxO and other stake-related
157+
tables in Cardano, since it is the users who get to pick their UTxO keys (TxIn)
158+
and stake keys (verification key hashes) and these keys will hash the same way
159+
on all other Cardano nodes.
160+
161+
This issue was not considered in the original project specification, but we have
162+
taken it into account and have included a mitigation in `lsm-tree`. The
163+
mitigation is that, on the initial creation of a session, a random salt is
164+
conjured and stored persistenly as part of the session. This salt is then used
165+
as part of the Bloom filter hashing for all runs in all tables of the session.
166+
167+
The consequence is that, while it is in principle still possible to produce hash
168+
collisions in the Bloom filter, this now depends on knowing the salt. However,
169+
every node should have a different salt, in which case no single block can be
170+
used to attack every node in the system. It is only plausible to target
171+
individual nodes, but discovering a node’s salt is extremely difficult. In
172+
principle there is a timing side channel, in that collisions will cause more
173+
I/O and thus cause longer running times. To exploit this, an attacker would
174+
need to get upstream of a victim node, supply a valid block on top of the
175+
current chain and measure the timing of receiving the block downstream. There
176+
would, however, be a large amount of noise spoiling such measurements,
177+
necessitating many samples. Creating many samples requires creating many
178+
blocks that the victim node will adopt, which requires substantial stake (or
179+
successfully executing an eclispse attack).
180+
181+
Overall, our judgement is that our mitigation is sufficient, but it merits a
182+
security review from others who may make a different judgement. It is also worth
183+
noting that the described hash clash issue may occur in other LSM-tree
184+
implementations used in other software, related and unrelated to Cardano. In
185+
particular, RocksDB does not appear to use a salt at all.
186+
187+
Note that using a per-run or per-table hash salt would incur non-trivial costs,
188+
because it would reduce the sharing available in bulk Bloom filter lookups,
189+
where several keys are looked up in several filters. Given that the Bloom filter
190+
lookup is a performance-sensitive part of the overall database implementation,
191+
such an approach to salting does not seem feasible. Therefore, we chose to
192+
generate hash salts per session.
193+
194+
In the Cardano context, a downside of picking Bloom filter salts per session
195+
and thus per node is that this interacts poorly with sharing of pre-created
196+
databases. While it would still be possible to copy a whole database session,
197+
since this includes the salt, doing so would result in the salt being shared
198+
between nodes. If SPOs shared databases widely with each other, to avoid
199+
processing the entire chain, then the salt diversity would be lost.
200+
201+
Picking Bloom filter salts per session is particularly problematic for Mithril.
202+
The current Mithril PoC works by copying the node's on-disk file formats. This
203+
design has numerous drawbacks, but would be particularly bad in this context
204+
because it would share the same Bloom filter salt to all Mithril users. If
205+
Mithril were to use a proper externally defined snapshot format, rather than
206+
just copying the node's on-disk formats, then restoring a snapshot would
207+
naturally involve creating a new LSM tree session and thus a fresh local salt.
208+
This would solve the problem.
209+
210+
# Possible incompatibility with the XFS file system
211+
212+
We have seen at least one failure when disabling disk caching via the table
213+
configuration, using the `DiskCacheNone` setting. Albeit it is unconfirmed, we
214+
suspect that some versions of Linux’s XFS file system implementation, in
215+
particular the one used by the default AWS Amazon Linux 2023 AMI, do not support
216+
the system call that underlies [`fileSetCaching`] from the `unix` package. This
217+
is an `fcntl` call, used to set the file status flag `O_DIRECT`. XFS certainly
218+
supports `O_DIRECT`, but it may support it only when the file in question is
219+
opened using this flag, not when trying to set this flag for an already open
220+
file.
221+
222+
This problem can be worked around by using the ext4 file system or by using
223+
`DiskCacheAll` in the table configuration, the latter at the cost of using more
224+
memory and putting pressure on the page cache. If this problem is confirmed to
225+
be widespread, it may become necessary to extend the `unix` package to allow
226+
setting the `O_DIRECT` flag upon file opening.
227+
228+
[`fileSetCaching`]: https://hackage-content.haskell.org/package/unix-2.8.7.0/docs/System-Posix-Fcntl.html#v:fileSetCaching

doc/final-report/makefile

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
.POSIX:
2+
3+
.SUFFIXES:
4+
5+
.PHONY: all
6+
all: final-report.pdf integration-notes.pdf
7+
8+
final-report.pdf: final-report.md ieee-software.csl pipelining.pdf
9+
pandoc --citeproc $< -o $@
10+
11+
integration-notes.pdf: integration-notes.md
12+
pandoc $< -o $@
13+
14+
.PHONY: clean
15+
clean:
16+
rm -f final-report.pdf integration-notes.pdf

doc/final-report/pipelining.pdf

11.1 KB
Binary file not shown.
323 KB
Binary file not shown.
293 KB
Binary file not shown.
194 KB
Binary file not shown.

0 commit comments

Comments
 (0)