Skip to content

Estimate RAM consumption for memtable based on tp_memtable_spill_threshold #169

@tjgreen42

Description

@tjgreen42

Background

The tp_memtable_spill_threshold GUC controls when memtables spill to disk segments. Currently set to 32M posting entries by default.

In parallel builds, each worker has its own memtable that can grow up to this threshold. With multiple workers, total memory usage scales linearly.

Problem

We need experimental data to understand RAM consumption as a function of:

  • tp_memtable_spill_threshold value
  • Document characteristics (term count, term length distribution)
  • Number of parallel workers

Tasks

  1. Gather experimental evidence on real workloads (MS MARCO, Wikipedia, etc.)
  2. Develop a formula to estimate RAM usage: f(threshold, doc_stats) -> bytes
  3. Use this to:
    • Issue warnings if estimated usage exceeds maintenance_work_mem
    • Optionally cap tp_memtable_spill_threshold or reduce worker count
    • Document memory requirements for users

Notes

  • Each TpPostingEntry is ~12 bytes (ItemPointerData + int32 frequency)
  • Term dictionary adds overhead (string storage + hash table)
  • Current rough estimate: 32M entries * 12 bytes ≈ 384MB minimum per memtable
  • With 4 workers, that's ~1.5GB+ total

Related Changes

This issue was created during parallel build improvements that changed spill logic from memory-based (maintenance_work_mem / workers) to count-based (tp_memtable_spill_threshold) for consistency between serial and parallel builds.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions