Add opt-in union-find chat history summarizer by kimjune01 · Pull Request #4940 · Aider-AI/aider

kimjune01 · 2026-03-19T02:05:12Z

Part 1 of 2. This PR adds the backend. Part 2 adds /topics and /drop-topic commands on top of it. Split for reviewability — together they give users selective control over chat history.

Summary

Adds --chat-history-summarizer union-find flag, opt-in alternative to the default recursive summarizer
Groups messages into topic clusters by TF-IDF similarity, summarizes each cluster independently
Produces the same output format, uses the same model cascade, falls back to recursive if output exceeds budget
Without the flag, this PR is a no-op. Default behavior completely unchanged
Benchmarked as quality-equivalent (136 paired observations, McNemar p=0.248, 1.14× cost)
No new dependencies (pure Python TF-IDF embedder)

Test plan

57 new tests passing (tests/basic/test_chat_summary_uf.py + tests/basic/test_smoke_uf.py)
523 existing tests passing, zero regressions

Why

The current recursive summarizer compresses chat history into a single opaque text blob. Original messages are discarded. There's no provenance, and no way to selectively remove one stale topic without re-summarizing everything.

More control over chat history #3607 — selective history control ("I want to drop the debugging context but keep the refactor decisions")
Editing history in realtime #2219 — see/edit context ("I want to see what the model's context actually contains")
[FeatReq] Ask the user to remove some files or to clear the chat history when the AI context is full instead of giving error #948 — token breakdown with actions ("show me what's consuming tokens and let me act on it")

Union-find compaction groups messages into topic-coherent clusters and summarizes each one independently. Every summary traces back to its source messages through find(). Topics become addressable units you can inspect and drop.

This is the backend. User-facing commands (/topics, /drop-topic) come in a follow-up PR once the foundation is reviewed.

Background

The algorithm was prototyped against gemini-cli, where it showed a +8–18pp recall advantage over flat summarization across 7 trials (1 significant at p=0.039, rest directional). It was then ported to aider for validation against aider's stronger recursive baseline. Full experiment methodology, preregistration, and data are in the research repo.

What changes

New files (549 lines of production code, 689 lines of tests):

File	Lines	What it does
`aider/context_window.py`	328	`Forest` (union-find cluster store with stable ordering and weighted centroids) + `ContextWindow` (hot/cold zones with graduation and eviction)
`aider/embedding_service.py`	80	`TFIDFEmbedder` — pure Python, incremental vocabulary, no external dependencies
`aider/cluster_summarizer.py`	56	Per-cluster summarization via existing model cascade (`simple_send_with_retries`)
`aider/chat_summary_uf.py`	85	`ChatSummaryUF(ChatSummary)` — drop-in subclass with incremental feeding and mandatory fallback
`tests/basic/test_chat_summary_uf.py`	689	49 tests covering 12 areas (see table below)
`tests/basic/test_smoke_uf.py`	314	8 smoke tests with realistic Flask #1169 conversation data

Modified files (15 lines changed):

File	Change
`aider/args.py`	`--chat-history-summarizer` argument (default `recursive`, choices `[recursive, union-find]`)
`aider/main.py`	Conditional at summarizer construction — `ChatSummaryUF` when `union-find`, `ChatSummary` otherwise

What doesn't change

Default summarization (recursive, unchanged)
Threading model (summarize_start/summarize_worker/summarize_end)
Output format ([summary_msg, "Ok.", *hot_messages])
summarize_all() behavior (delegates to parent)
Existing commands (/clear, /drop, /tokens, /reset)
Existing tests (no modifications, all 523 still passing)

How it works

Messages flow through a hot zone and a cold forest:

Each user/assistant message is TF-IDF-embedded and pushed to the hot zone
When the hot zone exceeds graduate_at (26 messages), the oldest graduates to the forest
Graduated messages merge with the nearest existing cluster if cosine similarity ≥ 0.15, or form a new singleton
If cluster count exceeds max_cold_clusters (10), the closest pair is force-merged
Dirty clusters (merged but not yet summarized) are summarized via the model cascade
render() returns cold summaries + hot contents, formatted as [summary_msg, "Ok.", *hot_messages]

The overlap window (graduate_at=26, evict_at=30) gives resolve_dirty() time to summarize before eviction.

Safety

Mandatory fallback. If the union-find result exceeds max_tokens or is ≥ input tokens, falls back to super().summarize() (recursive). Worst case, you get the current system.
Stale-safety preserved. summarize_end() stale check works identically. The _fed_count mechanism triggers a full forest rebuild when done_messages changes (shrinks, is cleared, or is replaced by a previous summarization result).
Same tokenizer. Inherits self.token_count = self.models[0].token_count from ChatSummary.__init__().
Stable root ordering. roots() returns clusters in insertion order via a tracked _root_order list, ensuring deterministic render() output across calls.
Weighted centroid averaging. union() weights centroids by cluster size (emb * size / total), preventing small clusters from distorting large ones over repeated merges.
No new dependencies. TF-IDF embedder is pure Python. No numpy, no scipy, no API calls.

Test coverage

49 unit tests in tests/basic/test_chat_summary_uf.py, organized by area:

Area	Tests	What's verified
Forest mechanics	12	Insert, union, roots, compact, nearest_root, dirty tracking, path compression, dirty input collection
Stable root ordering	4	Insertion order preserved, order after merge, deterministic across calls, multiple merges
Weighted centroid	3	Equal-size midpoint, unequal-size weights toward larger, sparse dict weighting
Cluster summarization	4	First model success, fallback to second model, all fail raises ValueError, single model (not list)
Flag selection	3	Subclass relationship, ChatSummaryUF constructs correctly, default is ChatSummary (not UF)
Output format	2	Not-too-big returns unchanged, summary + "Ok." + hot_messages format
Fallback to recursive	2	Result exceeds max_tokens, not enough messages for graduation
`summarize_all()` parity	1	Delegates to parent, same output format
Stale discard + rebuild	2	`_fed_count` shrink triggers `_init_context_window()`, incremental feeding tracks correctly
Incremental feeding	2	Skips non-user/assistant roles, empty content not fed
Low token budget	1	Graceful fallback without crash
TF-IDF embedder	7	Sparse dict output, empty string, stopwords filtered, vocabulary growth, doc count, cosine similarity (high for similar, low for different)
ContextWindow integration	6	Append/render, hot count tracking, graduation to forest, force merge, cold+hot render, dirty resolution

8 smoke tests in tests/basic/test_smoke_uf.py with a distilled Flask #1169 conversation (3 topics: path traversal, Windows drive letters, file descriptor leak):

Test	What's verified
Clusters form from realistic data	Cold clusters and hot messages created from multi-topic conversation
Output format matches recursive	`[summary_msg, "Ok.", *hot_messages]` structure preserved
Render produces cold and hot	`render()` returns cluster summaries followed by hot contents
Distinct topics form separate clusters	Messages about different topics cluster separately, not by timestamp
Fallback to recursive when inflated	Low budget triggers graceful fallback
Stale rebuild after shrink	Correct behavior when done_messages shrinks (summary applied)
System messages filtered	System role messages not fed into forest
Deterministic render	`render()` produces identical output on repeated calls

Benchmark

Tested on 17 real aider conversations (136 paired observation points where both backends triggered summarization):

Metric	Value
McNemar's test	p = 0.248 (no significant difference in recall)
Cost ratio	1.14× (union-find uses slightly more tokens due to per-cluster prompts)
Latency overhead	Sub-millisecond (TF-IDF embedding + union-find operations; LLM calls dominate)

The union-find backend produces quality-equivalent summaries. The value is structured context: visibility into what the model remembers, and selective control over what it forgets.

Follow-up (not in this PR)

/topics — read-only command showing topic clusters with token counts and previews
/drop-topic N — selective topic removal with done_messages sync and threading guard
Cross-session topic persistence ([Feature] Chat history archive #4079)

Future paths

The new code is fully contained — 4 new files, no modifications to base_coder.py, no changes to the threading model, no new state that other systems depend on. The opt-in flag keeps all three options cheap:

Make it the default: Change default="recursive" to default="union-find" in args.py. One line.
Rip it out: Delete 4 files, remove 15 lines from args.py + main.py. Five minutes.
Keep as-is: Zero maintenance. The recursive path doesn't know the union-find path exists.

…ry-summarizer) Port 4 modules from standalone implementation (145 existing tests): - context_window.py: Forest (union-find clusters) + ContextWindow (hot/cold zones) - embedding_service.py: Pure Python TF-IDF embedder (no new dependencies) - cluster_summarizer.py: Per-cluster summarization via model cascade - chat_summary_uf.py: ChatSummaryUF(ChatSummary) drop-in subclass Integration: --chat-history-summarizer union-find flag in args.py, conditional construction in main.py. Default unchanged (recursive). Backend fixes applied during port: - Stable root ordering via _root_order list (deterministic render output) - Weighted centroid averaging in union() (prevents cluster identity distortion) Safety: mandatory fallback to recursive if output exceeds budget, stale-safety preserved via _fed_count mechanism, same tokenizer and model cascade. 49 new tests covering 12 areas. 523 existing tests unaffected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

8 tests with realistic 3-topic conversation (path traversal, Windows drive letters, file descriptor leak): cluster formation, output format, cold+hot render, distinct topic clustering, recursive fallback, stale rebuild, system message filtering, deterministic rendering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CLAassistant · 2026-03-19T02:05:22Z

All committers have signed the CLA.

- Fix hot tail mismatch: track fed message indices so hot_count maps back to correct original messages even with system/tool messages interspersed - Fix unbounded _hot growth: trim graduated entries after each append - Fix empty embedding drop in union(): preserve non-empty side - Remove unused Forest.is_dirty() and Forest.dirty_inputs() - Remove list-vector branches from _cosine_similarity and union() (all embeddings are sparse dicts from TFIDFEmbedder) - Add tests: mixed-role preservation (2), memory bounds (2) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- resolve_dirty() failure now falls back to recursive instead of crashing - Remove _maybe_evict() and evict_at parameter — dead code since _maybe_graduate() already keeps hot zone at <= graduate_at - Update all tests to remove evict_at references Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Non-root members' _content and _embedding are only needed before merge (for dirty input collection and centroid computation). After union(), only the root's centroid and summary matter. Without cleanup, _content grows unbounded in long sessions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-authored-by: aider (claude-sonnet-4-5) <aider@aider.chat>

…istencies" This reverts commit 7ddc0a2.

Control-loop bug: summarization triggers on tokens (too_big) but graduation triggers on message count (graduate_at=26). Token budget fires first, no cold clusters exist, falls back to recursive every time. Union-find path was unreachable in real usage. Fix: when summarize() runs with no cold clusters and >4 hot messages, force_graduate() moves the oldest half to the cold forest. This breaks the deadlock and lets clusters form before rendering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kimjune01 and others added 2 commits March 18, 2026 15:47

kimjune01 and others added 2 commits March 18, 2026 19:18

kimjune01 marked this pull request as draft March 19, 2026 03:42

kimjune01 and others added 4 commits March 18, 2026 20:42

fix: resolve union-find memory leak and embedding type inconsistencies

7ddc0a2

Co-authored-by: aider (claude-sonnet-4-5) <aider@aider.chat>

Revert "fix: resolve union-find memory leak and embedding type incons…

816c46c

…istencies" This reverts commit 7ddc0a2.

fix: token-aware graduation + integration tests with real tokenizer

a3a0606

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add opt-in union-find chat history summarizer#4940

Add opt-in union-find chat history summarizer#4940
kimjune01 wants to merge 9 commits intoAider-AI:mainfrom
kimjune01:feat/topics-command

kimjune01 commented Mar 19, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kimjune01 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Why

Background

What changes

What doesn't change

How it works

Safety

Test coverage

Benchmark

Follow-up (not in this PR)

Future paths

Uh oh!

CLAassistant commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kimjune01 commented Mar 19, 2026 •

edited

Loading

CLAassistant commented Mar 19, 2026 •

edited

Loading