Skip to content

Adopt GenericXLog for WAL-based crash atomicity and replication #294

@tjgreen42

Description

@tjgreen42

Summary

pg_textsearch currently uses MarkBufferDirty + FlushOneBuffer for all buffer modifications, with no WAL logging. This means:

Adopting Postgres's GenericXLog API would give us proper WAL-based crash atomicity and correct physical replication for free. Segment data pages (the bulk of writes) can remain as MarkBufferDirty + FlushRelationBuffers since they're immutable and unreachable until linked via metapage.

Scope

Only pointer/metadata operations need GenericXLog — not segment data writes:

Call site Buffers Current approach
tp_add_docid_to_pages (first page) 2 (docid + meta) FlushOneBuffer ordering
tp_add_docid_to_pages (chain extend) 2 (new + old) FlushOneBuffer ordering
tp_add_docid_to_pages (single add) 1 (docid) MarkBufferDirty
tp_clear_docid_pages 1 (meta) FlushOneBuffer
tp_build_init_metapage 1 (meta) FlushOneBuffer
tp_buildempty 1 (meta) FlushOneBuffer
tp_build corpus stats 1 (meta) FlushOneBuffer
tp_link_l0_chain_head 2 (seg + meta) MarkBufferDirty (no flush)
tp_bulk_load_spill_check 2 (seg + meta) MarkBufferDirty (no flush)
tp_merge_level_segments 2 (seg + meta) MarkBufferDirty (no flush)
tp_vacuum_replace_segment 3 (new + prev + meta) MarkBufferDirty (no flush)
tp_bulkdelete stats 1 (meta) MarkBufferDirty (no flush)
build_parallel metapage 1 (meta) MarkBufferDirty (no flush)

Segment data page writes (segment.c writer, build_context.c dict backpatch, merge.c sink) stay as-is — they're immutable and not reachable until linked.

Blocker

Initial implementation found that GenericXLogFinish during aminsert (the tp_add_docid_to_pages single-docid-add path) causes a BufferContent LWLock self-deadlock on the second INSERT to any BM25 index. Key findings:

  • GenericXLog in tp_build_init_metapage (DDL/CREATE INDEX path) works fine
  • GenericXLog in tp_add_docid_to_pages (DML/aminsert path) deadlocks
  • Even a no-op GenericXLog (register buffer, don't modify, finish) triggers it
  • GenericXLogAbort works — only GenericXLogFinish causes the hang
  • The bloom contrib extension uses GenericXLog in aminsert without issues
  • Not caused by TimescaleDB (tested without it)
  • Reproduces on both debug and release PG18 builds

Next step: attach a debugger to get the exact stack trace and identify which LockBuffer call blocks and which prior lock holds the conflicting BufferContent lock.

Context

This was scoped out during work on #291. The flush-ordering fix in PR #292 is the immediate tactical fix. This issue tracks the proper architectural solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions