1
- # Storing the Cardano ledger state on disk: integration notes for high-performance backend
2
-
3
- Authors: Joris Dral, Wolfgang Jeltsch
4
- Date: May 2025
5
-
6
- ## Sessions
1
+ ---
2
+ title : " Storing the Cardano ledger state on disk:
3
+ integration notes for high-performance backend"
4
+ author :
5
+ - Duncan Coutts
6
+ - Joris Dral
7
+ - Wolfgang Jeltsch
8
+ date : July 2025
9
+
10
+ toc : true
11
+ numbersections : true
12
+ classoption :
13
+ - 11pt
14
+ - a4paper
15
+ geometry :
16
+ - margin=2.5cm
17
+ header-includes :
18
+ - \usepackage{microtype}
19
+ - \usepackage{mathpazo}
20
+ ---
21
+
22
+ # Sessions
7
23
8
24
Creating new empty tables or opening tables from snapshots requires a ` Session ` .
9
25
The session can be created using ` openSession ` , which has to be done in the
@@ -15,7 +31,7 @@ Closing the session will automatically close all tables, but this is only
15
31
intended to be a backup functionality: ideally the user closes all tables
16
32
manually.
17
33
18
- ## The compact index
34
+ # The compact index
19
35
20
36
The compact index is a memory-efficient data structure that maintains serialised
21
37
keys. Rather than storing full keys, it only stores the first 64 bits of each
24
40
The compact index only works properly if in most cases it can determine the
25
41
order of two serialised keys by looking at their 64-bit prefixes. This is the
26
42
case, for example, when the keys are hashes: the probability that two hashes
27
- have the same 64-bit prefixes is $\frac{1}{2}^{ 64}$ and thus very small. If the
28
- hashes are 256 bits in size, then the compact index uses 4 times less memory
29
- than if it would store the full keys.
43
+ have the same 64-bit prefixes is $2^{- 64}$ and thus very small. If the hashes
44
+ are 256 bits in size, then the compact index uses 4 times less memory than if it
45
+ would store the full keys.
30
46
31
47
There is a backup mechanism in place for the case when the 64-bit prefixes of
32
48
keys are not sufficient to make a comparison. This backup mechanism is less
@@ -60,7 +76,7 @@ keys is as good as any other total ordering. However, the consensus layer will
60
76
face the situation where a range lookup or a cursor read returns key–value pairs
61
77
slightly out of order. Currently, we do not expect this to cause problems.
62
78
63
- ## Snapshots
79
+ # Snapshots
64
80
65
81
Snapshots currently require support for hard links. This means that on Windows
66
82
the library only works when using NTFS. Support for other file systems could be
@@ -84,7 +100,7 @@ a cheaper non-SSD drive. This feature was unfortunately not anticipated in the
84
100
project specification and so is not currently included. As discussed above, it
85
101
could be added with some additional work.
86
102
87
- ## Value resolving
103
+ # Value resolving
88
104
89
105
When instantiating the ` ResolveValue ` class, it is usually advisable to
90
106
implement ` resolveValue ` such that it works directly on the serialised values.
@@ -94,7 +110,7 @@ function is intended to work like `(+)`, then `resolveValue` could add the raw
94
110
bytes of the serialised values and would likely achieve better performance this
95
111
way.
96
112
97
- ## ` io-classes ` incompatibility
113
+ # ` io-classes ` incompatibility
98
114
99
115
At the time of writing, various packages in the ` cardano-node ` stack depend on
100
116
` io-classes-1.5 ` and the 1.5-versions of its daughter packages, like
@@ -124,3 +140,89 @@ It is known to us that the `ouroboros-consensus` stack has not been updated to
124
140
https://github.com/IntersectMBO/ouroboros-network/pull/4951 . We would advise to
125
141
fix this Nix-related bug rather than downgrading ` lsm-tree ` ’s dependency on
126
142
` io-classes ` to version 1.5.
143
+
144
+ # Security of hash-based data structures
145
+
146
+ Data structures based on hashing have to be considered carefully when they may
147
+ be used with untrusted data. For example, an attacker who can control the keys
148
+ in a hash table may be able to provoke hash collisions and cause unexpected
149
+ performance problems this way. This is why the Haskell Cardano node
150
+ implementation does not use hash tables but ordering-based containers, such as
151
+ those provided by ` Data.Map ` .
152
+
153
+ The Bloom filters in an LSM-Tree are hash-based data structures. For the sake of
154
+ performance, they do not use cryptographic hashes. Thus, without additional
155
+ measures, an attacker can in principle choose keys whose hashes identify mostly
156
+ the same bits. This is a potential problem for the UTxO and other stake-related
157
+ tables in Cardano, since it is the users who get to pick their UTxO keys (TxIn)
158
+ and stake keys (verification key hashes) and these keys will hash the same way
159
+ on all other Cardano nodes.
160
+
161
+ This issue was not considered in the original project specification, but we have
162
+ taken it into account and have included a mitigation in ` lsm-tree ` . The
163
+ mitigation is that, on the initial creation of a session, a random salt is
164
+ conjured and stored persistenly as part of the session. This salt is then used
165
+ as part of the Bloom filter hashing for all runs in all tables of the session.
166
+
167
+ The consequence is that, while it is in principle still possible to produce hash
168
+ collisions in the Bloom filter, this now depends on knowing the salt. However,
169
+ every node should have a different salt, in which case no single block can be
170
+ used to attack every node in the system. It is only plausible to target
171
+ individual nodes, but discovering a node’s salt is extremely difficult. In
172
+ principle there is a timing side channel, in that collisions will cause more
173
+ I/O and thus cause longer running times. To exploit this, an attacker would
174
+ need to get upstream of a victim node, supply a valid block on top of the
175
+ current chain and measure the timing of receiving the block downstream. There
176
+ would, however, be a large amount of noise spoiling such measurements,
177
+ necessitating many samples. Creating many samples requires creating many
178
+ blocks that the victim node will adopt, which requires substantial stake (or
179
+ successfully executing an eclispse attack).
180
+
181
+ Overall, our judgement is that our mitigation is sufficient, but it merits a
182
+ security review from others who may make a different judgement. It is also worth
183
+ noting that the described hash clash issue may occur in other LSM-tree
184
+ implementations used in other software, related and unrelated to Cardano. In
185
+ particular, RocksDB does not appear to use a salt at all.
186
+
187
+ Note that using a per-run or per-table hash salt would incur non-trivial costs,
188
+ because it would reduce the sharing available in bulk Bloom filter lookups,
189
+ where several keys are looked up in several filters. Given that the Bloom filter
190
+ lookup is a performance-sensitive part of the overall database implementation,
191
+ such an approach to salting does not seem feasible. Therefore, we chose to
192
+ generate hash salts per session.
193
+
194
+ In the Cardano context, a downside of picking Bloom filter salts per session
195
+ and thus per node is that this interacts poorly with sharing of pre-created
196
+ databases. While it would still be possible to copy a whole database session,
197
+ since this includes the salt, doing so would result in the salt being shared
198
+ between nodes. If SPOs shared databases widely with each other, to avoid
199
+ processing the entire chain, then the salt diversity would be lost.
200
+
201
+ Picking Bloom filter salts per session is particularly problematic for Mithril.
202
+ The current Mithril PoC works by copying the node's on-disk file formats. This
203
+ design has numerous drawbacks, but would be particularly bad in this context
204
+ because it would share the same Bloom filter salt to all Mithril users. If
205
+ Mithril were to use a proper externally defined snapshot format, rather than
206
+ just copying the node's on-disk formats, then restoring a snapshot would
207
+ naturally involve creating a new LSM tree session and thus a fresh local salt.
208
+ This would solve the problem.
209
+
210
+ # Possible incompatibility with the XFS file system
211
+
212
+ We have seen at least one failure when disabling disk caching via the table
213
+ configuration, using the ` DiskCacheNone ` setting. Albeit it is unconfirmed, we
214
+ suspect that some versions of Linux’s XFS file system implementation, in
215
+ particular the one used by the default AWS Amazon Linux 2023 AMI, do not support
216
+ the system call that underlies [ ` fileSetCaching ` ] from the ` unix ` package. This
217
+ is an ` fcntl ` call, used to set the file status flag ` O_DIRECT ` . XFS certainly
218
+ supports ` O_DIRECT ` , but it may support it only when the file in question is
219
+ opened using this flag, not when trying to set this flag for an already open
220
+ file.
221
+
222
+ This problem can be worked around by using the ext4 file system or by using
223
+ ` DiskCacheAll ` in the table configuration, the latter at the cost of using more
224
+ memory and putting pressure on the page cache. If this problem is confirmed to
225
+ be widespread, it may become necessary to extend the ` unix ` package to allow
226
+ setting the ` O_DIRECT ` flag upon file opening.
227
+
228
+ [ `fileSetCaching` ] : https://hackage-content.haskell.org/package/unix-2.8.7.0/docs/System-Posix-Fcntl.html#v:fileSetCaching
0 commit comments