IntersectMBO · dcoutts · Jul 1, 2025 · Jul 1, 2025 · Jul 1, 2025 · Jul 1, 2025
@@ -0,0 +1,17 @@
+<?xml version="1.0" encoding="utf-8"?>
+<style xmlns="http://purl.org/net/xbiblio/csl" version="1.0" default-locale="en-US">
+  <!-- Generated with https://github.com/citation-style-language/utilities/tree/master/generate_dependent_styles/data/ieee -->
+  <info>
+    <title>IEEE Software</title>
+    <id>http://www.zotero.org/styles/ieee-software</id>
+    <link href="http://www.zotero.org/styles/ieee-software" rel="self"/>
+    <link href="http://www.zotero.org/styles/ieee" rel="independent-parent"/>
+    <link href="http://ieeexplore.ieee.org/servlet/opac?punumber=52" rel="documentation"/>
+    <category citation-format="numeric"/>
+    <category field="engineering"/>
+    <category field="communications"/>
+    <issn>0740-7459</issn>
+    <updated>2014-05-15T02:20:32+00:00</updated>
+    <rights license="http://creativecommons.org/licenses/by-sa/3.0/">This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License</rights>
+  </info>
+</style>
@@ -1,9 +1,25 @@
-# Storing the Cardano ledger state on disk: integration notes for high-performance backend
-
-Authors: Joris Dral, Wolfgang Jeltsch
-Date: May 2025
-
-## Sessions
+---
+title: "Storing the Cardano ledger state on disk:
+        integration notes for high-performance backend"
+author:
+  - Duncan Coutts
+  - Joris Dral
+  - Wolfgang Jeltsch
+date: July 2025
+
+toc: true
+numbersections: true
+classoption:
+ - 11pt
+ - a4paper
+geometry:
+ - margin=2.5cm
+header-includes:
+ - \usepackage{microtype}
+ - \usepackage{mathpazo}
+---
+
+# Sessions
 
 Creating new empty tables or opening tables from snapshots requires a `Session`.
 The session can be created using `openSession`, which has to be done in the
@@ -15,7 +31,7 @@ Closing the session will automatically close all tables, but this is only
 intended to be a backup functionality: ideally the user closes all tables
 manually.
 
-## The compact index
+# The compact index
 
 The compact index is a memory-efficient data structure that maintains serialised
 keys. Rather than storing full keys, it only stores the first 64 bits of each
@@ -24,9 +40,9 @@ key.
 The compact index only works properly if in most cases it can determine the
 order of two serialised keys by looking at their 64-bit prefixes. This is the
 case, for example, when the keys are hashes: the probability that two hashes
-have the same 64-bit prefixes is $\frac{1}{2}^{64}$ and thus very small. If the
-hashes are 256 bits in size, then the compact index uses 4 times less memory
-than if it would store the full keys.
+have the same 64-bit prefixes is $2^{-64}$ and thus very small. If the hashes
+are 256 bits in size, then the compact index uses 4 times less memory than if it
+would store the full keys.
 
 There is a backup mechanism in place for the case when the 64-bit prefixes of
 keys are not sufficient to make a comparison. This backup mechanism is less
@@ -60,7 +76,7 @@ keys is as good as any other total ordering. However, the consensus layer will
 face the situation where a range lookup or a cursor read returns key–value pairs
 slightly out of order. Currently, we do not expect this to cause problems.
 
-## Snapshots
+# Snapshots
 
 Snapshots currently require support for hard links. This means that on Windows
 the library only works when using NTFS. Support for other file systems could be
@@ -84,7 +100,7 @@ a cheaper non-SSD drive. This feature was unfortunately not anticipated in the
 project specification and so is not currently included. As discussed above, it
 could be added with some additional work.
 
-## Value resolving
+# Value resolving
 
 When instantiating the `ResolveValue` class, it is usually advisable to
 implement `resolveValue` such that it works directly on the serialised values.
@@ -94,7 +110,7 @@ function is intended to work like `(+)`, then `resolveValue` could add the raw
 bytes of the serialised values and would likely achieve better performance this
 way.
 
-## `io-classes` incompatibility
+# `io-classes` incompatibility
 
 At the time of writing, various packages in the `cardano-node` stack depend on
 `io-classes-1.5` and the 1.5-versions of its daughter packages, like
@@ -124,3 +140,85 @@ It is known to us that the `ouroboros-consensus` stack has not been updated to
 https://github.com/IntersectMBO/ouroboros-network/pull/4951. We would advise to
 fix this Nix-related bug rather than downgrading `lsm-tree`’s dependency on
 `io-classes` to version 1.5.
+
+# Security of hash-based data structures
+
+Data structures based on hashing have to be considered carefully when they may
+be used with untrusted data. For example, an attacker who can control the keys
+in a hash table may be able to provoke hash collisions and cause unexpected
+performance problems this way. This is why the Haskell Cardano node
+implementation does not use hash tables but ordering-based containers, such as
+those provided by `Data.Map`.
+
+The Bloom filters in an LSM-Tree are hash-based data structures. For the sake of
+performance, they do not use cryptographic hashes. Thus, without additional
+measures, an attacker can in principle choose keys whose hashs identify mostly
+the same bits. This is a potential problem for the UTxO and other stake-related
+tables in Cardano, since it is the users who get to pick their UTxO keys (TxIn)
+and stake keys (verification key hashes) and these keys will hash the same way
+on all other Cardano nodes.
+
+This issue was not considered in the original project specification, but we have
+taken it into account and have included a mitigation in `lsm-tree`. The
+mitigation is that, on the initial creation of a session, a random salt is
+conjured and stored persistenly as part of the session. This salt is then used
+as part of the Bloom filter hashing for all runs in all tables of the session.
+
+The consequence is that, while it is in principle still possible to produce hash
+collisions in the Bloom filter, this now depends on knowing the salt. However,
+every node has a different salt. Thus a system-wide attack becomes impossible.
+It is only plausible to target individual nodes, but discovering a node’s salt
+is extremely difficult. In principle there is a timing side channel, in that
+collisions will cause more I/O and thus cause longer running times. To exploit
+this, an attacker would need to get upstream of a victim node, supply a valid
+block and measure the timing of receiving the block downstream. There would,
+however, be a large amount of noise spoiling such an attack.
+
+Overall, our judgement is that our mitigation is sufficient, but it merits a
+security review from others who may make a different judgement. It is also worth
+noting that the described hash clash issue may occur in other LSM-tree
+implementations used in other software, related and unrelated to Cardano. In
+particular, RocksDB does not appear to use a salt at all.
+
+Note that using a per-run or per-table hash salt would incur non-trivial costs,
+because it would reduce the sharing available in bulk Bloom filter lookups,
+where several keys are looked up in several filters. Given that the Bloom filter
+lookup is a performance-sensitive part of the overall database implementation,
+such an approach to salting does not seem feasible. Therefore, we chose to
+generate hash salts per session.
+
+In the Cardano context, a downside of picking Bloom filter salts per session and
+thus per node is that this interacts poorly with sharing of pre-created
+databases. While it would still be possible to copy a whole database session,
+since this includes the salt, doing so would result in the salt being shared
+between nodes. If SPOs shared databases widely with each other, to avoid
+processing the entire chain, then the salt diversity would be lost.
+
+Picking Bloom filter salts per session is particularly problematic in Mithril,
+which shares a single copy of the database. It may be necessary for proper
+Mithril support to add a re-salting operation and to perform this operation
+after cloning a Mithril snapshot. Re-salting would involve re-creating the Bloom
+filters for all table runs, which would mean reading each run, inserting its
+keys into a new Bloom filter and finally writing out the new Bloom filter.
+Adding such a feature would, of course, incur additional development work, but
+the infrastructure needed is present already.
+
+# Possible incompatibility with the XFS file system
+
+We have seen at least one failure when disabling disk caching via the table
+configuration, using the `DiskCacheNone` setting. Albeit it is unconfirmed, we
+suspect that some versions of Linux’s XFS file system implementation, in
+particular the one used by the default AWS Amazon Linux 2023 AMI, do not support
+the system call that underlies [`fileSetCaching`] from the `unix` package. This
+is an `fcntl` call, used to set the file status flag `O_DIRECT`. XFS certainly
+supports `O_DIRECT`, but it may support it only when the file in question is
+opened using this flag, not when trying to set this flag for an already open
+file.
+
+This problem can be worked around by using the ext4 file system or by using
+`DiskCacheAll` in the table configuration, the latter at the cost of using more
+memory and putting pressure on the page cache. If this problem is confirmed to
+be widespread, it may become necessary to extend the `unix` package to allow
+setting the `O_DIRECT` flag upon file opening.
+
+[`fileSetCaching`]: https://hackage-content.haskell.org/package/unix-2.8.7.0/docs/System-Posix-Fcntl.html#v:fileSetCaching
@@ -0,0 +1,18 @@
+.POSIX:
+
+.SILENT:
+
+.SUFFIXES:
+
+.PHONY: all
+all: final-report.pdf integration-notes.pdf
+
+final-report.pdf: final-report.md pipelining.pdf
+	pandoc -C -o final-report.pdf final-report.md
+
+integration-notes.pdf: integration-notes.md
+	pandoc -o integration-notes.pdf integration-notes.md
+
+.PHONY: clean
+clean:
+	rm -f final-report.pdf integration-notes.pdf