|
| 1 | +# persistent-postgresql-ng |
| 2 | + |
| 3 | +A PostgreSQL backend for [persistent](https://hackage.haskell.org/package/persistent) that uses the **binary wire protocol** and **libpq pipeline mode**. |
| 4 | + |
| 5 | +Mostly a drop-in replacement for `persistent-postgresql`. All standard persistent operations work without code changes aside from type signatures and import changes. |
| 6 | + |
| 7 | +## What's different |
| 8 | + |
| 9 | +| Feature | persistent-postgresql | persistent-postgresql-ng | |
| 10 | +|---------|----------------------|--------------------------| |
| 11 | +| Wire protocol | Text (via postgresql-simple) | Binary (via postgresql-binary) | |
| 12 | +| Automatic pipelining | No | Yes– Hedis-style lazy reply stream | |
| 13 | +| Bulk insert | `INSERT ... VALUES (?,?,...), (?,?,...), ...` | `INSERT ... SELECT * FROM UNNEST($1::type[], ...)` | |
| 14 | +| IN clauses | `IN (?,?,?,...)` | `= ANY($1)` | |
| 15 | +| Direct decode path | No | Yes– zero `PersistValue` allocation | |
| 16 | +| Result fetch modes | All-at-once only | All-at-once, single-row, chunked (PG17+) | |
| 17 | + |
| 18 | +## Benchmarks |
| 19 | + |
| 20 | +Measured against `persistent-postgresql` on the same PostgreSQL 16 instance. Three network conditions: localhost (0ms), 1ms added latency per direction (2ms RTT), and 5ms per direction (10ms RTT). |
| 21 | + |
| 22 | +Latency was introduced using a TCP delay proxy (`bench/delay-proxy.py`). |
| 23 | + |
| 24 | +### 0ms latency (localhost, TCP loopback) |
| 25 | + |
| 26 | + |
| 27 | + |
| 28 | + |
| 29 | +| Benchmark | pipeline | simple | speedup | |
| 30 | +|-----------|----------|--------|---------| |
| 31 | +| **get ×100 (pipelined reads)** | 1.7ms | 4.7ms | **2.8×** | |
| 32 | +| **insert ×100 (pipelined RETURNING)** | 10.8ms | 12.8ms | 1.2× | |
| 33 | +| **upsert ×100 (pipelined RETURNING)** | 8.9ms | 12.7ms | **1.4×** | |
| 34 | +| insertMany ×1000 (UNNEST) | 5.3ms | 14.1ms | **2.7×** | |
| 35 | +| delete ×100 then select | 4.5ms | 7.5ms | **1.7×** | |
| 36 | +| mixed DML ×100 then select | 14.6ms | 29.9ms | **2.0×** | |
| 37 | +| selectList ×100 | 8.6ms | 11.2ms | 1.3× | |
| 38 | + |
| 39 | +At zero latency, the advantage comes from the binary protocol and UNNEST-based bulk inserts. Individual `get` and `insert` are comparable because round-trip time is negligible. |
| 40 | + |
| 41 | +### 1ms latency per direction (2ms RTT, nearby datacenter) |
| 42 | + |
| 43 | + |
| 44 | + |
| 45 | + |
| 46 | +| Benchmark | pipeline | simple | speedup | |
| 47 | +|-----------|----------|--------|---------| |
| 48 | +| **get ×100 (pipelined reads)** | **11ms** | 310ms | **28×** | |
| 49 | +| **insert ×100 (pipelined RETURNING)** | **13ms** | 314ms | **24×** | |
| 50 | +| **upsert ×100 (pipelined RETURNING)** | **13ms** | 321ms | **25×** | |
| 51 | +| insertMany ×1000 (UNNEST) | 8.6ms | 31.0ms | **3.6×** | |
| 52 | +| selectList ×100 | 16.6ms | 25.8ms | **1.6×** | |
| 53 | +| select IN ×20 | 17.4ms | 24.8ms | **1.4×** | |
| 54 | + |
| 55 | +With even modest latency, the automatic pipelining dominates. `mapM get keys`, `mapM insert records`, and `forM_ records upsert` all send queries before reading results– one flush instead of 100 round-trips. |
| 56 | + |
| 57 | +### 5ms latency per direction (10ms RTT, cross-region) |
| 58 | + |
| 59 | + |
| 60 | + |
| 61 | + |
| 62 | +| Benchmark | pipeline | simple | speedup | |
| 63 | +|-----------|----------|--------|---------| |
| 64 | +| **get ×100 (pipelined reads)** | **50ms** | 1.19s | **24×** | |
| 65 | +| **insert ×100 (pipelined RETURNING)** | **41ms** | 1.20s | **29×** | |
| 66 | +| insertMany ×1000 (UNNEST) | 22.8ms | 72.6ms | **3.2×** | |
| 67 | +| selectList ×100 | 47.9ms | 74.0ms | **1.5×** | |
| 68 | +| select IN ×20 | 44.1ms | 70.3ms | **1.6×** | |
| 69 | + |
| 70 | +The speedup scales linearly with latency. At 10ms RTT, 100 sequential round-trips cost 1000ms minimum. The pipeline pays one RTT for the flush and reads all 100 results from the server's already-buffered responses. |
| 71 | + |
| 72 | +### Attributing the speedup: binary protocol vs pipelining |
| 73 | + |
| 74 | +The improvements come from three independent sources. The 0ms column isolates the binary protocol effect (pipelining has no benefit when round-trips are free). The 1ms column shows the combined effect, and the difference reveals the pipelining contribution. |
| 75 | + |
| 76 | +| Benchmark | 0ms: pipeline / simple | 1ms: pipeline / simple | Source of speedup | |
| 77 | +|-----------|:---:|:---:|---| |
| 78 | +| **get ×100** | 1.7ms / 4.7ms (2.8×) | 11ms / 310ms (**28×**) | 0ms: binary decode. 1ms: **Hedis-style lazy pipelining** (100 queries in 1 flush) | |
| 79 | +| **insert ×100** | 10.8ms / 12.8ms (1.2×) | 13ms / 314ms (**24×**) | 0ms: binary encode. 1ms: **lazy RETURNING pipelining** | |
| 80 | +| **delete ×100** | 8.4ms / 12.9ms (1.5×) | 25ms / 592ms (**24×**) | 0ms: binary protocol. 1ms: **fire-and-forget pipelining** | |
| 81 | +| **update ×100** | 8.3ms / 12.5ms (1.5×) | 25ms / 555ms (**22×**) | 0ms: binary protocol. 1ms: **fire-and-forget pipelining** | |
| 82 | +| **replace ×100** | 11.1ms / 11.5ms (1.0×) | 27ms / 602ms (**22×**) | 0ms: ~neutral. 1ms: **fire-and-forget pipelining** | |
| 83 | +| **insertMany ×1000** | 7.2ms / 16.7ms (2.3×) | 8.6ms / 31.0ms (**3.6×**) | 0ms: **UNNEST** (1 query vs N). 1ms: UNNEST + fewer round-trips | |
| 84 | +| **selectList ×100** | 13.5ms / 15.6ms (1.2×) | 16.6ms / 25.8ms (**1.6×**) | 0ms: binary decode. 1ms: binary + pipelined setup | |
| 85 | +| **upsert ×100** | 8.9ms / 12.7ms (1.4×) | 13ms / 321ms (**25×**) | 0ms: binary protocol. 1ms: **lazy RETURNING pipelining** | |
| 86 | +| **deleteWhere ×100** | 90ms / 99ms (1.1×) | 119ms / 750ms (**6.3×**) | 0ms: ~neutral. 1ms: **fire-and-forget pipelining** | |
| 87 | + |
| 88 | +**Summary of sources:** |
| 89 | + |
| 90 | +| Source | Typical gain at 0ms | Typical gain at 1ms/dir | |
| 91 | +|--------|:---:|:---:| |
| 92 | +| Binary protocol (encode/decode) | 1.2-2.8× | 1.2-2.8× | |
| 93 | +| UNNEST bulk insert | 2.3× | 3.6× | |
| 94 | +| Fire-and-forget DML pipelining | 1.0× | 20-24× | |
| 95 | +| Hedis-style lazy pipelining (get, insert, upsert) | 1.0× | 24-28× | |
| 96 | +| Combined (best case) | 2.8× | **28×** | |
| 97 | + |
| 98 | +The binary protocol provides a constant-factor improvement regardless of latency. Pipelining provides a latency-proportional improvement that dominates at any non-zero network distance. |
| 99 | + |
| 100 | +### Running benchmarks |
| 101 | + |
| 102 | +```bash |
| 103 | +# Baseline (direct connection) |
| 104 | +stack bench persistent-postgresql-ng |
| 105 | + |
| 106 | +# With artificial latency via TCP proxy |
| 107 | +python3 bench/delay-proxy.py 15432 localhost 5432 1 & # 1ms per direction |
| 108 | +PGPORT=15432 PGHOST=127.0.0.1 stack bench persistent-postgresql-ng |
| 109 | +kill %1 |
| 110 | + |
| 111 | +# With system-level latency (macOS, requires root) |
| 112 | +sudo bench/run-with-latency.sh 1 # 1ms via dummynet |
| 113 | +``` |
| 114 | + |
| 115 | +## Automatic pipelining (Hedis-style) |
| 116 | + |
| 117 | +All read operations (`get`, `getBy`, `insert` with RETURNING, `count`, `exists`) use a [Hedis-style](https://www.iankduncan.com/engineering/2026-02-17-archive-redis-pipelining) lazy reply stream for automatic optimal pipelining. No API changes are required– standard persistent code like `mapM get keys` is automatically pipelined. |
| 118 | + |
| 119 | +The technique: |
| 120 | + |
| 121 | +1. At connection time, an infinite lazy list of server replies is created using `unsafeInterleaveIO`. Each element, when forced, flushes the send buffer and reads one result. |
| 122 | +2. Each command **sends** eagerly (writes to the output buffer) and **receives** lazily (pops an unevaluated thunk from the reply list via `atomicModifyIORef`). |
| 123 | +3. The actual network read happens when the caller inspects the result value. If 100 `get` calls are sequenced before any result is inspected, all 100 queries are sent in one flush and results are read sequentially from the server's response buffer. |
| 124 | + |
| 125 | +The ordering guarantee comes from the lazy list structure: each thunk N is created inside thunk N-1's `unsafeInterleaveIO` body, so replies are always read in pipeline order regardless of evaluation order. |
| 126 | + |
| 127 | +Write operations (`delete`, `update`, `replace`, `deleteWhere`, `updateWhere`) remain fire-and-forget– they send the query and don't read the result until a subsequent read operation (or transaction commit) drains them. |
| 128 | + |
| 129 | +## Direct decode path |
| 130 | + |
| 131 | +In addition to the standard `PersistValue`-based path, the backend supports a direct codec path that bypasses `PersistValue` entirely. See the [RFC](../RFC-direct-decode.md) for full design details. |
| 132 | + |
| 133 | +```haskell |
| 134 | +-- Switch one import to opt in: |
| 135 | +import Database.Persist.Sql.Experimental -- instead of Database.Persist.Sql |
| 136 | +``` |
| 137 | + |
| 138 | +For code with the concrete backend type (zero overhead, full specialization): |
| 139 | + |
| 140 | +```haskell |
| 141 | +rawSqlDirect |
| 142 | + "SELECT name, age FROM users WHERE age > $1" |
| 143 | + (writeParam (18 :: Int)) |
| 144 | + :: ReaderT (WriteBackend PostgreSQLBackend) m [(Text, Int64)] |
| 145 | +``` |
| 146 | + |
| 147 | +For code through `SqlBackend` (uses `DirectEntity` + `Typeable` bridge): |
| 148 | + |
| 149 | +```haskell |
| 150 | +rawSqlDirectCompat |
| 151 | + "SELECT name, age FROM users WHERE age > $1" |
| 152 | + [toPersistValue (18 :: Int)] |
| 153 | + :: ReaderT SqlBackend m (Maybe [(Text, Int64)]) |
| 154 | +``` |
| 155 | + |
| 156 | +## Architecture |
| 157 | + |
| 158 | +See [ARCHITECTURE.md](ARCHITECTURE.md) for detailed internals: pipeline mode, binary protocol, connection lifecycle, error handling, and the direct decode/encode layer. |
0 commit comments