Skip to content

Add opt-in union-find chat history summarizer#4940

Draft
kimjune01 wants to merge 9 commits intoAider-AI:mainfrom
kimjune01:feat/topics-command
Draft

Add opt-in union-find chat history summarizer#4940
kimjune01 wants to merge 9 commits intoAider-AI:mainfrom
kimjune01:feat/topics-command

Conversation

@kimjune01
Copy link

@kimjune01 kimjune01 commented Mar 19, 2026

Part 1 of 2. This PR adds the backend. Part 2 adds /topics and /drop-topic commands on top of it. Split for reviewability — together they give users selective control over chat history.

Summary

  • Adds --chat-history-summarizer union-find flag, opt-in alternative to the default recursive summarizer
  • Groups messages into topic clusters by TF-IDF similarity, summarizes each cluster independently
  • Produces the same output format, uses the same model cascade, falls back to recursive if output exceeds budget
  • Without the flag, this PR is a no-op. Default behavior completely unchanged
  • Benchmarked as quality-equivalent (136 paired observations, McNemar p=0.248, 1.14× cost)
  • No new dependencies (pure Python TF-IDF embedder)

Test plan

  • 57 new tests passing (tests/basic/test_chat_summary_uf.py + tests/basic/test_smoke_uf.py)
  • 523 existing tests passing, zero regressions

Why

The current recursive summarizer compresses chat history into a single opaque text blob. Original messages are discarded. There's no provenance, and no way to selectively remove one stale topic without re-summarizing everything.

Union-find compaction groups messages into topic-coherent clusters and summarizes each one independently. Every summary traces back to its source messages through find(). Topics become addressable units you can inspect and drop.

This is the backend. User-facing commands (/topics, /drop-topic) come in a follow-up PR once the foundation is reviewed.

Background

The algorithm was prototyped against gemini-cli, where it showed a +8–18pp recall advantage over flat summarization across 7 trials (1 significant at p=0.039, rest directional). It was then ported to aider for validation against aider's stronger recursive baseline. Full experiment methodology, preregistration, and data are in the research repo.

What changes

New files (549 lines of production code, 689 lines of tests):

File Lines What it does
aider/context_window.py 328 Forest (union-find cluster store with stable ordering and weighted centroids) + ContextWindow (hot/cold zones with graduation and eviction)
aider/embedding_service.py 80 TFIDFEmbedder — pure Python, incremental vocabulary, no external dependencies
aider/cluster_summarizer.py 56 Per-cluster summarization via existing model cascade (simple_send_with_retries)
aider/chat_summary_uf.py 85 ChatSummaryUF(ChatSummary) — drop-in subclass with incremental feeding and mandatory fallback
tests/basic/test_chat_summary_uf.py 689 49 tests covering 12 areas (see table below)
tests/basic/test_smoke_uf.py 314 8 smoke tests with realistic Flask #1169 conversation data

Modified files (15 lines changed):

File Change
aider/args.py --chat-history-summarizer argument (default recursive, choices [recursive, union-find])
aider/main.py Conditional at summarizer construction — ChatSummaryUF when union-find, ChatSummary otherwise

What doesn't change

  • Default summarization (recursive, unchanged)
  • Threading model (summarize_start/summarize_worker/summarize_end)
  • Output format ([summary_msg, "Ok.", *hot_messages])
  • summarize_all() behavior (delegates to parent)
  • Existing commands (/clear, /drop, /tokens, /reset)
  • Existing tests (no modifications, all 523 still passing)

How it works

Messages flow through a hot zone and a cold forest:

  1. Each user/assistant message is TF-IDF-embedded and pushed to the hot zone
  2. When the hot zone exceeds graduate_at (26 messages), the oldest graduates to the forest
  3. Graduated messages merge with the nearest existing cluster if cosine similarity ≥ 0.15, or form a new singleton
  4. If cluster count exceeds max_cold_clusters (10), the closest pair is force-merged
  5. Dirty clusters (merged but not yet summarized) are summarized via the model cascade
  6. render() returns cold summaries + hot contents, formatted as [summary_msg, "Ok.", *hot_messages]

The overlap window (graduate_at=26, evict_at=30) gives resolve_dirty() time to summarize before eviction.

Safety

  1. Mandatory fallback. If the union-find result exceeds max_tokens or is ≥ input tokens, falls back to super().summarize() (recursive). Worst case, you get the current system.
  2. Stale-safety preserved. summarize_end() stale check works identically. The _fed_count mechanism triggers a full forest rebuild when done_messages changes (shrinks, is cleared, or is replaced by a previous summarization result).
  3. Same tokenizer. Inherits self.token_count = self.models[0].token_count from ChatSummary.__init__().
  4. Stable root ordering. roots() returns clusters in insertion order via a tracked _root_order list, ensuring deterministic render() output across calls.
  5. Weighted centroid averaging. union() weights centroids by cluster size (emb * size / total), preventing small clusters from distorting large ones over repeated merges.
  6. No new dependencies. TF-IDF embedder is pure Python. No numpy, no scipy, no API calls.

Test coverage

49 unit tests in tests/basic/test_chat_summary_uf.py, organized by area:

Area Tests What's verified
Forest mechanics 12 Insert, union, roots, compact, nearest_root, dirty tracking, path compression, dirty input collection
Stable root ordering 4 Insertion order preserved, order after merge, deterministic across calls, multiple merges
Weighted centroid 3 Equal-size midpoint, unequal-size weights toward larger, sparse dict weighting
Cluster summarization 4 First model success, fallback to second model, all fail raises ValueError, single model (not list)
Flag selection 3 Subclass relationship, ChatSummaryUF constructs correctly, default is ChatSummary (not UF)
Output format 2 Not-too-big returns unchanged, summary + "Ok." + hot_messages format
Fallback to recursive 2 Result exceeds max_tokens, not enough messages for graduation
summarize_all() parity 1 Delegates to parent, same output format
Stale discard + rebuild 2 _fed_count shrink triggers _init_context_window(), incremental feeding tracks correctly
Incremental feeding 2 Skips non-user/assistant roles, empty content not fed
Low token budget 1 Graceful fallback without crash
TF-IDF embedder 7 Sparse dict output, empty string, stopwords filtered, vocabulary growth, doc count, cosine similarity (high for similar, low for different)
ContextWindow integration 6 Append/render, hot count tracking, graduation to forest, force merge, cold+hot render, dirty resolution

8 smoke tests in tests/basic/test_smoke_uf.py with a distilled Flask #1169 conversation (3 topics: path traversal, Windows drive letters, file descriptor leak):

Test What's verified
Clusters form from realistic data Cold clusters and hot messages created from multi-topic conversation
Output format matches recursive [summary_msg, "Ok.", *hot_messages] structure preserved
Render produces cold and hot render() returns cluster summaries followed by hot contents
Distinct topics form separate clusters Messages about different topics cluster separately, not by timestamp
Fallback to recursive when inflated Low budget triggers graceful fallback
Stale rebuild after shrink Correct behavior when done_messages shrinks (summary applied)
System messages filtered System role messages not fed into forest
Deterministic render render() produces identical output on repeated calls

Benchmark

Tested on 17 real aider conversations (136 paired observation points where both backends triggered summarization):

Metric Value
McNemar's test p = 0.248 (no significant difference in recall)
Cost ratio 1.14× (union-find uses slightly more tokens due to per-cluster prompts)
Latency overhead Sub-millisecond (TF-IDF embedding + union-find operations; LLM calls dominate)

The union-find backend produces quality-equivalent summaries. The value is structured context: visibility into what the model remembers, and selective control over what it forgets.

Follow-up (not in this PR)

  • /topics — read-only command showing topic clusters with token counts and previews
  • /drop-topic N — selective topic removal with done_messages sync and threading guard
  • Cross-session topic persistence ([Feature] Chat history archive #4079)

Future paths

The new code is fully contained — 4 new files, no modifications to base_coder.py, no changes to the threading model, no new state that other systems depend on. The opt-in flag keeps all three options cheap:

  • Make it the default: Change default="recursive" to default="union-find" in args.py. One line.
  • Rip it out: Delete 4 files, remove 15 lines from args.py + main.py. Five minutes.
  • Keep as-is: Zero maintenance. The recursive path doesn't know the union-find path exists.

kimjune01 and others added 2 commits March 18, 2026 15:47
…ry-summarizer)

Port 4 modules from standalone implementation (145 existing tests):
- context_window.py: Forest (union-find clusters) + ContextWindow (hot/cold zones)
- embedding_service.py: Pure Python TF-IDF embedder (no new dependencies)
- cluster_summarizer.py: Per-cluster summarization via model cascade
- chat_summary_uf.py: ChatSummaryUF(ChatSummary) drop-in subclass

Integration: --chat-history-summarizer union-find flag in args.py,
conditional construction in main.py. Default unchanged (recursive).

Backend fixes applied during port:
- Stable root ordering via _root_order list (deterministic render output)
- Weighted centroid averaging in union() (prevents cluster identity distortion)

Safety: mandatory fallback to recursive if output exceeds budget, stale-safety
preserved via _fed_count mechanism, same tokenizer and model cascade.

49 new tests covering 12 areas. 523 existing tests unaffected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8 tests with realistic 3-topic conversation (path traversal, Windows drive
letters, file descriptor leak): cluster formation, output format, cold+hot
render, distinct topic clustering, recursive fallback, stale rebuild,
system message filtering, deterministic rendering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@CLAassistant
Copy link

CLAassistant commented Mar 19, 2026

CLA assistant check
All committers have signed the CLA.

kimjune01 and others added 2 commits March 18, 2026 19:18
- Fix hot tail mismatch: track fed message indices so hot_count maps back
  to correct original messages even with system/tool messages interspersed
- Fix unbounded _hot growth: trim graduated entries after each append
- Fix empty embedding drop in union(): preserve non-empty side
- Remove unused Forest.is_dirty() and Forest.dirty_inputs()
- Remove list-vector branches from _cosine_similarity and union()
  (all embeddings are sparse dicts from TFIDFEmbedder)
- Add tests: mixed-role preservation (2), memory bounds (2)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- resolve_dirty() failure now falls back to recursive instead of crashing
- Remove _maybe_evict() and evict_at parameter — dead code since
  _maybe_graduate() already keeps hot zone at <= graduate_at
- Update all tests to remove evict_at references

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Non-root members' _content and _embedding are only needed before merge
(for dirty input collection and centroid computation). After union(),
only the root's centroid and summary matter. Without cleanup, _content
grows unbounded in long sessions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kimjune01 kimjune01 marked this pull request as draft March 19, 2026 03:42
kimjune01 and others added 4 commits March 18, 2026 20:42
Co-authored-by: aider (claude-sonnet-4-5) <aider@aider.chat>
Control-loop bug: summarization triggers on tokens (too_big) but
graduation triggers on message count (graduate_at=26). Token budget
fires first, no cold clusters exist, falls back to recursive every
time. Union-find path was unreachable in real usage.

Fix: when summarize() runs with no cold clusters and >4 hot messages,
force_graduate() moves the oldest half to the cold forest. This breaks
the deadlock and lets clusters form before rendering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants