Background
The tp_memtable_spill_threshold GUC controls when memtables spill to disk segments. Currently set to 32M posting entries by default.
In parallel builds, each worker has its own memtable that can grow up to this threshold. With multiple workers, total memory usage scales linearly.
Problem
We need experimental data to understand RAM consumption as a function of:
tp_memtable_spill_threshold value
- Document characteristics (term count, term length distribution)
- Number of parallel workers
Tasks
- Gather experimental evidence on real workloads (MS MARCO, Wikipedia, etc.)
- Develop a formula to estimate RAM usage:
f(threshold, doc_stats) -> bytes
- Use this to:
- Issue warnings if estimated usage exceeds
maintenance_work_mem
- Optionally cap
tp_memtable_spill_threshold or reduce worker count
- Document memory requirements for users
Notes
- Each
TpPostingEntry is ~12 bytes (ItemPointerData + int32 frequency)
- Term dictionary adds overhead (string storage + hash table)
- Current rough estimate: 32M entries * 12 bytes ≈ 384MB minimum per memtable
- With 4 workers, that's ~1.5GB+ total
Related Changes
This issue was created during parallel build improvements that changed spill logic from memory-based (maintenance_work_mem / workers) to count-based (tp_memtable_spill_threshold) for consistency between serial and parallel builds.
Background
The
tp_memtable_spill_thresholdGUC controls when memtables spill to disk segments. Currently set to 32M posting entries by default.In parallel builds, each worker has its own memtable that can grow up to this threshold. With multiple workers, total memory usage scales linearly.
Problem
We need experimental data to understand RAM consumption as a function of:
tp_memtable_spill_thresholdvalueTasks
f(threshold, doc_stats) -> bytesmaintenance_work_memtp_memtable_spill_thresholdor reduce worker countNotes
TpPostingEntryis ~12 bytes (ItemPointerData + int32 frequency)Related Changes
This issue was created during parallel build improvements that changed spill logic from memory-based (
maintenance_work_mem / workers) to count-based (tp_memtable_spill_threshold) for consistency between serial and parallel builds.