Implementation - Layer 5: Chat History Summary #3

mrhillsman · 2025-06-12T17:20:02Z

mrhillsman
Jun 12, 2025
Maintainer

Balancing Active Context vs. Summarization: The Moving Window Strategy

The Core Trade-off

The decision of when to summarize is essentially asking: "When does the cost of maintaining full context exceed the value of perfect recall?"

Key Considerations:

Token limits and costs
Information loss from summarization
Relevance decay over time
Task continuity requirements
Real-time performance

Here's a practical framework based on the CWA principles:

Dynamic Summarization Triggers

class ContextWindowManager:
    def __init__(self):
        # Thresholds based on empirical performance data
        self.token_soft_limit = 50_000  # ~50% of typical 100k window
        self.token_hard_limit = 80_000  # ~80% - must summarize
        self.message_age_limit = timedelta(hours=2)
        self.topic_shift_threshold = 0.7  # Cosine similarity
        
    def should_summarize(self, context_state):
        """Multi-factor decision for summarization"""
        
        triggers = {
            "token_pressure": context_state.total_tokens > self.token_soft_limit,
            "age_decay": self.check_message_age(context_state),
            "topic_shift": self.detect_topic_change(context_state),
            "task_completion": context_state.current_task_completed,
            "explicit_checkpoint": context_state.user_requested_summary
        }
        
        # Weighted scoring
        trigger_weights = {
            "token_pressure": 0.4,
            "age_decay": 0.2,
            "topic_shift": 0.25,
            "task_completion": 0.1,
            "explicit_checkpoint": 0.05
        }
        
        score = sum(
            triggers[k] * trigger_weights[k] 
            for k in triggers
        )
        
        return score > 0.5 or context_state.total_tokens > self.token_hard_limit

Real-World Example: AI Assistant Deployment

Timeline of a 4-Hour Architecture Session

9:00 AM - Session Start

context_state = {
    "tokens": 0,
    "messages": [],
    "active_task": "Design microservices architecture",
    "participants": ["architect_jane", "dev_bob", "dev_alice"]
}

9:00-10:30 AM - Initial Discussion (~25,000 tokens)

Discussing high-level architecture
Exploring different patterns
Decision: Keep everything in active context
Why: Active debate, constant back-references, building shared understanding

10:30 AM - First Summarization Trigger

# Topic shift detected: Moving from "patterns" to "specific service design"
summary_1 = {
    "time_range": "9:00-10:30",
    "key_decisions": [
        "Use event-driven architecture",
        "Reject synchronous REST for inter-service comm"
    ],
    "context_preserved": [
        "Event sourcing pattern details",
        "Rejected alternatives and reasons"
    ],
    "tokens_reduced": 25_000 → 3_000  # 88% reduction
}

# Active context now contains:
# - Summary reference (3,000 tokens)
# - Last 10 messages for continuity (2,000 tokens)
# - Current topic messages (starting fresh)

10:30 AM-12:00 PM - Service Design (~40,000 new tokens)

Designing individual services
API contracts
Critical moment at 11:45 AM: Someone asks "Why didn't we use REST again?"
System retrieves from summary: "REST rejected due to coupling concerns at 10:15 AM"

12:00 PM - Task Completion Trigger

# Major milestone completed
summary_2 = {
    "combines": ["summary_1", "current_context"],
    "hierarchical_summary": {
        "executive": "Designed event-driven microservices with 5 core services",
        "decisions": [...],  # All major decisions
        "technical_details": {
            "preserved_in_full": ["API contracts", "Event schemas"],
            "summarized": ["Discussion threads", "Alternative explorations"]
        }
    }
}

The Sliding Window Pattern

class SlidingContextWindow:
    def __init__(self):
        self.window_stages = {
            "hot": {  # Full verbatim in context
                "max_tokens": 20_000,
                "max_age": timedelta(minutes=30),
                "description": "Active discussion"
            },
            "warm": {  # Compressed but detailed
                "max_tokens": 10_000,
                "max_age": timedelta(hours=2),
                "compression_ratio": 3,
                "description": "Recent context with key details"
            },
            "cool": {  # Highly summarized
                "max_tokens": 5_000,
                "max_age": timedelta(hours=8),
                "compression_ratio": 10,
                "description": "Session highlights only"
            },
            "cold": {  # Moved to Layer 5 (Summary storage)
                "storage": "database",
                "description": "Searchable long-term memory"
            }
        }
    
    def age_context(self):
        """Progressive compression as context ages"""
        for message_group in self.context:
            age = datetime.now() - message_group.timestamp
            
            if age > self.window_stages["warm"]["max_age"]:
                # Compress from hot to warm
                message_group.content = self.compress(
                    message_group.content,
                    ratio=self.window_stages["warm"]["compression_ratio"],
                    preserve=["decisions", "code", "key_terms"]
                )

Intelligent Summarization Strategies

1. Task-Aware Summarization

def summarize_by_task_phase(messages, task_context):
    if task_context.phase == "exploration":
        # Preserve alternative paths and reasoning
        return {
            "summary_type": "exploratory",
            "preserve_ratio": 0.4,  # Keep 40% of content
            "focus": ["alternatives", "pros_cons", "questions"]
        }
    
    elif task_context.phase == "implementation":
        # Focus on decisions and technical details
        return {
            "summary_type": "technical",
            "preserve_ratio": 0.2,  # Keep 20% of content
            "focus": ["decisions", "code", "configurations"],
            "can_discard": ["small_talk", "resolved_questions"]
        }

2. Importance Scoring

def score_message_importance(message, context):
    importance = 0.0
    
    # Decision indicators
    if any(phrase in message.lower() for phrase in 
           ["decided to", "we'll go with", "final choice"]):
        importance += 0.3
    
    # Contains code or technical specs
    if "```" in message or re.search(r'https?://', message):
        importance += 0.2
    
    # Referenced later
    if message.id in context.back_references:
        importance += 0.4
    
    # From key stakeholder
    if message.author_role in ["architect", "team_lead"]:
        importance += 0.1
    
    return importance

Practical Guidelines

Keep in Active Context When:

Active task ongoing - Summarizing mid-task loses critical nuance
Frequent back-references - Users saying "as I mentioned earlier"
Technical debugging - Full error messages and stack traces needed
Rapid iteration - Quick back-and-forth under 30 minutes
Token budget available - Under 50% of context window

Trigger Summarization When:

Topic shift - Moving to unrelated subject (similarity < 0.7)
Natural break points - Task completed, break time, participant change
Token pressure - Approaching 70-80% of context limit
Time decay - Messages older than 2 hours with no references
Explicit checkpoint - User says "let's summarize what we've decided"

Generic Strategy (let's call it the "2-Hour Rule" - could be used to determine a baseline therefore recommended starting strategy)

< 30 minutes: Almost never summarize (unless token pressure)
30-120 minutes: Summarize at natural breakpoints
> 2 hours: Aggressively summarize older portions
> 8 hours: Move to cold storage (Layer 5)

Anti-Pattern to Avoid

# DON'T DO THIS - Aggressive immediate summarization
def bad_approach(message):
    if len(context.messages) > 10:
        # Loses critical nuance and frustrates users
        context.messages = [summarize(context.messages)]
        context.add(message)

Metrics to gauge success?

User satisfaction: a/b testing of preference to sliding window vs. fixed history
Context retrieval accuracy: x% of relevant information preserved in summaries
Token efficiency: x% reduction in token usage with <x% information loss
Task continuity: x% of tasks completed without context-related interruptions

The key insight: Summarization is not about age or size alone, but about information value decay relative to the current conversation state. Rather than always using a specific strategy the best system will be to adapt the summarization strategy based on task type, conversation dynamics, and user behavior patterns.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implementation - Layer 5: Chat History Summary #3

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Implementation - Layer 5: Chat History Summary #3

Uh oh!

Uh oh!

mrhillsman Jun 12, 2025 Maintainer

Balancing Active Context vs. Summarization: The Moving Window Strategy

The Core Trade-off

Key Considerations:

Dynamic Summarization Triggers

Real-World Example: AI Assistant Deployment

Timeline of a 4-Hour Architecture Session

The Sliding Window Pattern

Intelligent Summarization Strategies

1. Task-Aware Summarization

2. Importance Scoring

Practical Guidelines

Keep in Active Context When:

Trigger Summarization When:

Generic Strategy (let's call it the "2-Hour Rule" - could be used to determine a baseline therefore recommended starting strategy)

Anti-Pattern to Avoid

Metrics to gauge success?

Replies: 0 comments

mrhillsman
Jun 12, 2025
Maintainer