Skip to content

Commit 76bdfe9

Browse files
committed
RFC: Direct row codecs for persistent
Add RFC describing the direct decode/encode design that bypasses PersistValue entirely. Includes the SqlBackend bridge via DirectEntity + Typeable, compound type codecs (PgDecode/PgEncode), and the Hedis-style automatic pipelining design. Also adds ARCHITECTURE.md and README.md with benchmark results for persistent-postgresql-ng.
1 parent 855c20c commit 76bdfe9

File tree

7 files changed

+7589
-0
lines changed

7 files changed

+7589
-0
lines changed

RFC-direct-decode.md

Lines changed: 1055 additions & 0 deletions
Large diffs are not rendered by default.

persistent-postgresql-ng/ARCHITECTURE.md

Lines changed: 408 additions & 0 deletions
Large diffs are not rendered by default.

persistent-postgresql-ng/README.md

Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
# persistent-postgresql-ng
2+
3+
A PostgreSQL backend for [persistent](https://hackage.haskell.org/package/persistent) that uses the **binary wire protocol** and **libpq pipeline mode**.
4+
5+
Mostly a drop-in replacement for `persistent-postgresql`. All standard persistent operations work without code changes aside from type signatures and import changes.
6+
7+
## What's different
8+
9+
| Feature | persistent-postgresql | persistent-postgresql-ng |
10+
|---------|----------------------|--------------------------|
11+
| Wire protocol | Text (via postgresql-simple) | Binary (via postgresql-binary) |
12+
| Automatic pipelining | No | Yes– Hedis-style lazy reply stream |
13+
| Bulk insert | `INSERT ... VALUES (?,?,...), (?,?,...), ...` | `INSERT ... SELECT * FROM UNNEST($1::type[], ...)` |
14+
| IN clauses | `IN (?,?,?,...)` | `= ANY($1)` |
15+
| Direct decode path | No | Yes– zero `PersistValue` allocation |
16+
| Result fetch modes | All-at-once only | All-at-once, single-row, chunked (PG17+) |
17+
18+
## Benchmarks
19+
20+
Measured against `persistent-postgresql` on the same PostgreSQL 16 instance. Three network conditions: localhost (0ms), 1ms added latency per direction (2ms RTT), and 5ms per direction (10ms RTT).
21+
22+
Latency was introduced using a TCP delay proxy (`bench/delay-proxy.py`).
23+
24+
### 0ms latency (localhost, TCP loopback)
25+
26+
![Benchmark: 0ms latency](bench/bench-0ms.svg)
27+
28+
29+
| Benchmark | pipeline | simple | speedup |
30+
|-----------|----------|--------|---------|
31+
| **get ×100 (pipelined reads)** | 1.7ms | 4.7ms | **2.8×** |
32+
| **insert ×100 (pipelined RETURNING)** | 10.8ms | 12.8ms | 1.2× |
33+
| **upsert ×100 (pipelined RETURNING)** | 8.9ms | 12.7ms | **1.4×** |
34+
| insertMany ×1000 (UNNEST) | 5.3ms | 14.1ms | **2.7×** |
35+
| delete ×100 then select | 4.5ms | 7.5ms | **1.7×** |
36+
| mixed DML ×100 then select | 14.6ms | 29.9ms | **2.0×** |
37+
| selectList ×100 | 8.6ms | 11.2ms | 1.3× |
38+
39+
At zero latency, the advantage comes from the binary protocol and UNNEST-based bulk inserts. Individual `get` and `insert` are comparable because round-trip time is negligible.
40+
41+
### 1ms latency per direction (2ms RTT, nearby datacenter)
42+
43+
![Benchmark: 1ms latency](bench/bench-1ms.svg)
44+
45+
46+
| Benchmark | pipeline | simple | speedup |
47+
|-----------|----------|--------|---------|
48+
| **get ×100 (pipelined reads)** | **11ms** | 310ms | **28×** |
49+
| **insert ×100 (pipelined RETURNING)** | **13ms** | 314ms | **24×** |
50+
| **upsert ×100 (pipelined RETURNING)** | **13ms** | 321ms | **25×** |
51+
| insertMany ×1000 (UNNEST) | 8.6ms | 31.0ms | **3.6×** |
52+
| selectList ×100 | 16.6ms | 25.8ms | **1.6×** |
53+
| select IN ×20 | 17.4ms | 24.8ms | **1.4×** |
54+
55+
With even modest latency, the automatic pipelining dominates. `mapM get keys`, `mapM insert records`, and `forM_ records upsert` all send queries before reading results– one flush instead of 100 round-trips.
56+
57+
### 5ms latency per direction (10ms RTT, cross-region)
58+
59+
![Benchmark: 5ms latency](bench/bench-5ms.svg)
60+
61+
62+
| Benchmark | pipeline | simple | speedup |
63+
|-----------|----------|--------|---------|
64+
| **get ×100 (pipelined reads)** | **50ms** | 1.19s | **24×** |
65+
| **insert ×100 (pipelined RETURNING)** | **41ms** | 1.20s | **29×** |
66+
| insertMany ×1000 (UNNEST) | 22.8ms | 72.6ms | **3.2×** |
67+
| selectList ×100 | 47.9ms | 74.0ms | **1.5×** |
68+
| select IN ×20 | 44.1ms | 70.3ms | **1.6×** |
69+
70+
The speedup scales linearly with latency. At 10ms RTT, 100 sequential round-trips cost 1000ms minimum. The pipeline pays one RTT for the flush and reads all 100 results from the server's already-buffered responses.
71+
72+
### Attributing the speedup: binary protocol vs pipelining
73+
74+
The improvements come from three independent sources. The 0ms column isolates the binary protocol effect (pipelining has no benefit when round-trips are free). The 1ms column shows the combined effect, and the difference reveals the pipelining contribution.
75+
76+
| Benchmark | 0ms: pipeline / simple | 1ms: pipeline / simple | Source of speedup |
77+
|-----------|:---:|:---:|---|
78+
| **get ×100** | 1.7ms / 4.7ms (2.8×) | 11ms / 310ms (**28×**) | 0ms: binary decode. 1ms: **Hedis-style lazy pipelining** (100 queries in 1 flush) |
79+
| **insert ×100** | 10.8ms / 12.8ms (1.2×) | 13ms / 314ms (**24×**) | 0ms: binary encode. 1ms: **lazy RETURNING pipelining** |
80+
| **delete ×100** | 8.4ms / 12.9ms (1.5×) | 25ms / 592ms (**24×**) | 0ms: binary protocol. 1ms: **fire-and-forget pipelining** |
81+
| **update ×100** | 8.3ms / 12.5ms (1.5×) | 25ms / 555ms (**22×**) | 0ms: binary protocol. 1ms: **fire-and-forget pipelining** |
82+
| **replace ×100** | 11.1ms / 11.5ms (1.0×) | 27ms / 602ms (**22×**) | 0ms: ~neutral. 1ms: **fire-and-forget pipelining** |
83+
| **insertMany ×1000** | 7.2ms / 16.7ms (2.3×) | 8.6ms / 31.0ms (**3.6×**) | 0ms: **UNNEST** (1 query vs N). 1ms: UNNEST + fewer round-trips |
84+
| **selectList ×100** | 13.5ms / 15.6ms (1.2×) | 16.6ms / 25.8ms (**1.6×**) | 0ms: binary decode. 1ms: binary + pipelined setup |
85+
| **upsert ×100** | 8.9ms / 12.7ms (1.4×) | 13ms / 321ms (**25×**) | 0ms: binary protocol. 1ms: **lazy RETURNING pipelining** |
86+
| **deleteWhere ×100** | 90ms / 99ms (1.1×) | 119ms / 750ms (**6.3×**) | 0ms: ~neutral. 1ms: **fire-and-forget pipelining** |
87+
88+
**Summary of sources:**
89+
90+
| Source | Typical gain at 0ms | Typical gain at 1ms/dir |
91+
|--------|:---:|:---:|
92+
| Binary protocol (encode/decode) | 1.2-2.8× | 1.2-2.8× |
93+
| UNNEST bulk insert | 2.3× | 3.6× |
94+
| Fire-and-forget DML pipelining | 1.0× | 20-24× |
95+
| Hedis-style lazy pipelining (get, insert, upsert) | 1.0× | 24-28× |
96+
| Combined (best case) | 2.8× | **28×** |
97+
98+
The binary protocol provides a constant-factor improvement regardless of latency. Pipelining provides a latency-proportional improvement that dominates at any non-zero network distance.
99+
100+
### Running benchmarks
101+
102+
```bash
103+
# Baseline (direct connection)
104+
stack bench persistent-postgresql-ng
105+
106+
# With artificial latency via TCP proxy
107+
python3 bench/delay-proxy.py 15432 localhost 5432 1 & # 1ms per direction
108+
PGPORT=15432 PGHOST=127.0.0.1 stack bench persistent-postgresql-ng
109+
kill %1
110+
111+
# With system-level latency (macOS, requires root)
112+
sudo bench/run-with-latency.sh 1 # 1ms via dummynet
113+
```
114+
115+
## Automatic pipelining (Hedis-style)
116+
117+
All read operations (`get`, `getBy`, `insert` with RETURNING, `count`, `exists`) use a [Hedis-style](https://www.iankduncan.com/engineering/2026-02-17-archive-redis-pipelining) lazy reply stream for automatic optimal pipelining. No API changes are required– standard persistent code like `mapM get keys` is automatically pipelined.
118+
119+
The technique:
120+
121+
1. At connection time, an infinite lazy list of server replies is created using `unsafeInterleaveIO`. Each element, when forced, flushes the send buffer and reads one result.
122+
2. Each command **sends** eagerly (writes to the output buffer) and **receives** lazily (pops an unevaluated thunk from the reply list via `atomicModifyIORef`).
123+
3. The actual network read happens when the caller inspects the result value. If 100 `get` calls are sequenced before any result is inspected, all 100 queries are sent in one flush and results are read sequentially from the server's response buffer.
124+
125+
The ordering guarantee comes from the lazy list structure: each thunk N is created inside thunk N-1's `unsafeInterleaveIO` body, so replies are always read in pipeline order regardless of evaluation order.
126+
127+
Write operations (`delete`, `update`, `replace`, `deleteWhere`, `updateWhere`) remain fire-and-forget– they send the query and don't read the result until a subsequent read operation (or transaction commit) drains them.
128+
129+
## Direct decode path
130+
131+
In addition to the standard `PersistValue`-based path, the backend supports a direct codec path that bypasses `PersistValue` entirely. See the [RFC](../RFC-direct-decode.md) for full design details.
132+
133+
```haskell
134+
-- Switch one import to opt in:
135+
import Database.Persist.Sql.Experimental -- instead of Database.Persist.Sql
136+
```
137+
138+
For code with the concrete backend type (zero overhead, full specialization):
139+
140+
```haskell
141+
rawSqlDirect
142+
"SELECT name, age FROM users WHERE age > $1"
143+
(writeParam (18 :: Int))
144+
:: ReaderT (WriteBackend PostgreSQLBackend) m [(Text, Int64)]
145+
```
146+
147+
For code through `SqlBackend` (uses `DirectEntity` + `Typeable` bridge):
148+
149+
```haskell
150+
rawSqlDirectCompat
151+
"SELECT name, age FROM users WHERE age > $1"
152+
[toPersistValue (18 :: Int)]
153+
:: ReaderT SqlBackend m (Maybe [(Text, Int64)])
154+
```
155+
156+
## Architecture
157+
158+
See [ARCHITECTURE.md](ARCHITECTURE.md) for detailed internals: pipeline mode, binary protocol, connection lifecycle, error handling, and the direct decode/encode layer.

0 commit comments

Comments
 (0)