IntersectMBO
diff --git a/‎bench/micro/Bench/Database/LSMTree.hs
Lines changed: 2 additions & 1 deletion b/‎bench/micro/Bench/Database/LSMTree.hs
Lines changed: 2 additions & 1 deletion
diff --git a/‎doc/final-report/final-report.md
Lines changed: 1556 additions & 0 deletions b/‎doc/final-report/final-report.md
Lines changed: 1556 additions & 0 deletions
diff --git a/‎doc/final-report/ieee-software.csl
Lines changed: 17 additions & 0 deletions b/‎doc/final-report/ieee-software.csl
Lines changed: 17 additions & 0 deletions
diff --git a/‎doc/final-report/integration-notes.md
Lines changed: 115 additions & 13 deletions b/‎doc/final-report/integration-notes.md
Lines changed: 115 additions & 13 deletions
diff --git a/‎doc/final-report/makefile
Lines changed: 16 additions & 0 deletions b/‎doc/final-report/makefile
Lines changed: 16 additions & 0 deletions
diff --git a/‎doc/final-report/pipelining.pdf
11.1 KB b/‎doc/final-report/pipelining.pdf
11.1 KB
diff --git a/‎doc/final-report/references/utxo-db-api.pdf
323 KB b/‎doc/final-report/references/utxo-db-api.pdf
323 KB
diff --git a/‎doc/final-report/references/utxo-db-lsm.pdf
293 KB b/‎doc/final-report/references/utxo-db-lsm.pdf
293 KB
diff --git a/‎doc/final-report/references/utxo-db.pdf
194 KB b/‎doc/final-report/references/utxo-db.pdf
194 KB
@@ -79,8 +79,9 @@ instance ResolveValue V3 where
 
 benchConfig :: TableConfig
 benchConfig = defaultTableConfig
-    { confWriteBufferAlloc  = AllocNumEntries 20000
+    { confWriteBufferAlloc  = AllocNumEntries 1000
     , confFencePointerIndex = CompactIndex
+    , confDiskCachePolicy   = DiskCacheNone
     }
 
 benchSalt :: Bloom.Salt
 
@@ -0,0 +1,17 @@
+<?xml version="1.0" encoding="utf-8"?>
+<style xmlns="http://purl.org/net/xbiblio/csl" version="1.0" default-locale="en-US">
+  <!-- Generated with https://github.com/citation-style-language/utilities/tree/master/generate_dependent_styles/data/ieee -->
+  <info>
+    <title>IEEE Software</title>
+    <id>http://www.zotero.org/styles/ieee-software</id>
+    <link href="http://www.zotero.org/styles/ieee-software" rel="self"/>
+    <link href="http://www.zotero.org/styles/ieee" rel="independent-parent"/>
+    <link href="http://ieeexplore.ieee.org/servlet/opac?punumber=52" rel="documentation"/>
+    <category citation-format="numeric"/>
+    <category field="engineering"/>
+    <category field="communications"/>
+    <issn>0740-7459</issn>
+    <updated>2014-05-15T02:20:32+00:00</updated>
+    <rights license="http://creativecommons.org/licenses/by-sa/3.0/">This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License</rights>
+  </info>
+</style>
@@ -1,9 +1,25 @@
-# Storing the Cardano ledger state on disk: integration notes for high-performance backend
-
-Authors: Joris Dral, Wolfgang Jeltsch
-Date: May 2025
-
-## Sessions
+---
+title: "Storing the Cardano ledger state on disk:
+        integration notes for high-performance backend"
+author:
+  - Duncan Coutts
+  - Joris Dral
+  - Wolfgang Jeltsch
+date: July 2025
+
+toc: true
+numbersections: true
+classoption:
+ - 11pt
+ - a4paper
+geometry:
+ - margin=2.5cm
+header-includes:
+ - \usepackage{microtype}
+ - \usepackage{mathpazo}
+---
+
+# Sessions
 
 Creating new empty tables or opening tables from snapshots requires a `Session`.
 The session can be created using `openSession`, which has to be done in the
@@ -15,7 +31,7 @@ Closing the session will automatically close all tables, but this is only
 intended to be a backup functionality: ideally the user closes all tables
 manually.
 
-## The compact index
+# The compact index
 
 The compact index is a memory-efficient data structure that maintains serialised
 keys. Rather than storing full keys, it only stores the first 64 bits of each
@@ -24,9 +40,9 @@ key.
 The compact index only works properly if in most cases it can determine the
 order of two serialised keys by looking at their 64-bit prefixes. This is the
 case, for example, when the keys are hashes: the probability that two hashes
-have the same 64-bit prefixes is $\frac{1}{2}^{64}$ and thus very small. If the
-hashes are 256 bits in size, then the compact index uses 4 times less memory
-than if it would store the full keys.
+have the same 64-bit prefixes is $2^{-64}$ and thus very small. If the hashes
+are 256 bits in size, then the compact index uses 4 times less memory than if it
+would store the full keys.
 
 There is a backup mechanism in place for the case when the 64-bit prefixes of
 keys are not sufficient to make a comparison. This backup mechanism is less
@@ -60,7 +76,7 @@ keys is as good as any other total ordering. However, the consensus layer will
 face the situation where a range lookup or a cursor read returns key–value pairs
 slightly out of order. Currently, we do not expect this to cause problems.
 
-## Snapshots
+# Snapshots
 
 Snapshots currently require support for hard links. This means that on Windows
 the library only works when using NTFS. Support for other file systems could be
@@ -84,7 +100,7 @@ a cheaper non-SSD drive. This feature was unfortunately not anticipated in the
 project specification and so is not currently included. As discussed above, it
 could be added with some additional work.
 
-## Value resolving
+# Value resolving
 
 When instantiating the `ResolveValue` class, it is usually advisable to
 implement `resolveValue` such that it works directly on the serialised values.
@@ -94,7 +110,7 @@ function is intended to work like `(+)`, then `resolveValue` could add the raw
 bytes of the serialised values and would likely achieve better performance this
 way.
 
-## `io-classes` incompatibility
+# `io-classes` incompatibility
 
 At the time of writing, various packages in the `cardano-node` stack depend on
 `io-classes-1.5` and the 1.5-versions of its daughter packages, like
@@ -124,3 +140,89 @@ It is known to us that the `ouroboros-consensus` stack has not been updated to
 https://github.com/IntersectMBO/ouroboros-network/pull/4951. We would advise to
 fix this Nix-related bug rather than downgrading `lsm-tree`’s dependency on
 `io-classes` to version 1.5.
+
+# Security of hash-based data structures
+
+Data structures based on hashing have to be considered carefully when they may
+be used with untrusted data. For example, an attacker who can control the keys
+in a hash table may be able to provoke hash collisions and cause unexpected
+performance problems this way. This is why the Haskell Cardano node
+implementation does not use hash tables but ordering-based containers, such as
+those provided by `Data.Map`.
+
+The Bloom filters in an LSM-Tree are hash-based data structures. For the sake of
+performance, they do not use cryptographic hashes. Thus, without additional
+measures, an attacker can in principle choose keys whose hashes identify mostly
+the same bits. This is a potential problem for the UTxO and other stake-related
+tables in Cardano, since it is the users who get to pick their UTxO keys (TxIn)
+and stake keys (verification key hashes) and these keys will hash the same way
+on all other Cardano nodes.
+
+This issue was not considered in the original project specification, but we have
+taken it into account and have included a mitigation in `lsm-tree`. The
+mitigation is that, on the initial creation of a session, a random salt is
+conjured and stored persistenly as part of the session. This salt is then used
+as part of the Bloom filter hashing for all runs in all tables of the session.
+
+The consequence is that, while it is in principle still possible to produce hash
+collisions in the Bloom filter, this now depends on knowing the salt. However,
+every node should have a different salt, in which case no single block can be
+used to attack every node in the system. It is only plausible to target
+individual nodes, but discovering a node’s salt is extremely difficult. In
+principle there is a timing side channel, in that collisions will cause more
+I/O and thus cause longer running times. To exploit this, an attacker would
+need to get upstream of a victim node, supply a valid block on top of the
+current chain and measure the timing of receiving the block downstream. There
+would, however, be a large amount of noise spoiling such measurements,
+necessitating many samples. Creating many samples requires creating many
+blocks that the victim node will adopt, which requires substantial stake (or
+successfully executing an eclispse attack).
+
+Overall, our judgement is that our mitigation is sufficient, but it merits a
+security review from others who may make a different judgement. It is also worth
+noting that the described hash clash issue may occur in other LSM-tree
+implementations used in other software, related and unrelated to Cardano. In
+particular, RocksDB does not appear to use a salt at all.
+
+Note that using a per-run or per-table hash salt would incur non-trivial costs,
+because it would reduce the sharing available in bulk Bloom filter lookups,
+where several keys are looked up in several filters. Given that the Bloom filter
+lookup is a performance-sensitive part of the overall database implementation,
+such an approach to salting does not seem feasible. Therefore, we chose to
+generate hash salts per session.
+
+In the Cardano context, a downside of picking Bloom filter salts per session
+and thus per node is that this interacts poorly with sharing of pre-created
+databases. While it would still be possible to copy a whole database session,
+since this includes the salt, doing so would result in the salt being shared
+between nodes. If SPOs shared databases widely with each other, to avoid
+processing the entire chain, then the salt diversity would be lost.
+
+Picking Bloom filter salts per session is particularly problematic for Mithril.
+The current Mithril PoC works by copying the node's on-disk file formats. This
+design has numerous drawbacks, but would be particularly bad in this context
+because it would share the same Bloom filter salt to all Mithril users. If
+Mithril were to use a proper externally defined snapshot format, rather than
+just copying the node's on-disk formats, then restoring a snapshot would
+naturally involve creating a new LSM tree session and thus a fresh local salt.
+This would solve the problem.
+
+# Possible incompatibility with the XFS file system
+
+We have seen at least one failure when disabling disk caching via the table
+configuration, using the `DiskCacheNone` setting. Albeit it is unconfirmed, we
+suspect that some versions of Linux’s XFS file system implementation, in
+particular the one used by the default AWS Amazon Linux 2023 AMI, do not support
+the system call that underlies [`fileSetCaching`] from the `unix` package. This
+is an `fcntl` call, used to set the file status flag `O_DIRECT`. XFS certainly
+supports `O_DIRECT`, but it may support it only when the file in question is
+opened using this flag, not when trying to set this flag for an already open
+file.
+
+This problem can be worked around by using the ext4 file system or by using
+`DiskCacheAll` in the table configuration, the latter at the cost of using more
+memory and putting pressure on the page cache. If this problem is confirmed to
+be widespread, it may become necessary to extend the `unix` package to allow
+setting the `O_DIRECT` flag upon file opening.
+
+[`fileSetCaching`]: https://hackage-content.haskell.org/package/unix-2.8.7.0/docs/System-Posix-Fcntl.html#v:fileSetCaching
@@ -0,0 +1,16 @@
+.POSIX:
+
+.SUFFIXES:
+
+.PHONY: all
+all: final-report.pdf integration-notes.pdf
+
+final-report.pdf: final-report.md ieee-software.csl pipelining.pdf
+	pandoc --citeproc $< -o $@
+
+integration-notes.pdf: integration-notes.md
+	pandoc $< -o $@
+
+.PHONY: clean
+clean:
+	rm -f final-report.pdf integration-notes.pdf
Original file line number	Diff line number	Diff line change
`@@ -79,8 +79,9 @@ instance ResolveValue V3 where`
`79`	`79`
`80`	`80`	`benchConfig :: TableConfig`
`81`	`81`	`benchConfig = defaultTableConfig`
`82`		`- { confWriteBufferAlloc = AllocNumEntries 20000`
	`82`	`+ { confWriteBufferAlloc = AllocNumEntries 1000`
`83`	`83`	`, confFencePointerIndex = CompactIndex`
	`84`	`+ , confDiskCachePolicy = DiskCacheNone`
`84`	`85`	`}`
`85`	`86`
`86`	`87`	`benchSalt :: Bloom.Salt`