Skip to content

Conversation

@enyst
Copy link
Collaborator

@enyst enyst commented Oct 26, 2025

Summary
The current LLMSummarizingCondenser triggers condensation purely by event count (max_size). This causes premature condensations for large-context models (e.g., GPT-5, Gemini), especially on hard tasks where the agent reads many files first. The agent can condense multiple times before it writes code, forgets objectives, and loops back to earlier stages.

What’s happening

  • Condensation is triggered whenever len(view) > max_size (default 120). This ignores the actual tokenized prompt size and the model’s context window.
  • In practice, large-context models can continue without condensing because the prompt token count is still within budget. Yet we condense early and lose useful context.

Changes

  • LLMSummarizingCondenser is now token-budget aware when llm.max_input_tokens is available:
    • should_condense computes token usage via LLMConvertibleEvent.events_to_messages + llm.get_token_count and compares it to a budget: max_input_tokens - max_output_tokens - headroom.
    • headroom = token_margin_ratio * max_input_tokens (default 0.1) to leave buffer for response and metadata.
    • get_condensation uses a binary search to keep as much tail context as fits under the token budget while preserving keep_first at the head.
    • If limits are unknown or counting fails, behavior falls back to the original event-count logic. Backward compatibility is preserved.

Why this helps

  • Reduces premature condensation for large-context models
  • Preserves more recent, relevant context while maintaining safe headroom
  • Leaves behavior unchanged when model limits are unknown

Testing

  • Added a focused unit test for token-budget behavior (mocking token counts)
  • Existing condenser tests continue to pass

Open questions

  • Is 10% headroom a sensible default across providers? Should it be model-specific?
  • Do we want to expose token_margin_ratio widely in config/presets?

Co-authored-by: openhands [email protected]

@enyst can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Base Image Docs / Tags
golang golang:1.21-bookworm Link
java eclipse-temurin:17-jdk Link
python nikolaik/python-nodejs:python3.12-nodejs22 Link

Pull (multi-arch manifest)

docker pull ghcr.io/openhands/agent-server:44dba94-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-44dba94-python \
  ghcr.io/openhands/agent-server:44dba94-python

All tags pushed for this build

ghcr.io/openhands/agent-server:44dba94-golang
ghcr.io/openhands/agent-server:v1.0.0a4_golang_tag_1.21-bookworm_binary
ghcr.io/openhands/agent-server:44dba94-java
ghcr.io/openhands/agent-server:v1.0.0a4_eclipse-temurin_tag_17-jdk_binary
ghcr.io/openhands/agent-server:44dba94-python
ghcr.io/openhands/agent-server:v1.0.0a4_nikolaik_s_python-nodejs_tag_python3.12-nodejs22_binary

The 44dba94 tag is a multi-arch manifest (amd64/arm64); your client pulls the right arch automatically.

…nput_tokens

- Add token-aware should_condense that compares tokenized messages against a budget derived from llm.max_input_tokens, llm.max_output_tokens, and a configurable token_margin_ratio
- Choose tail size via binary search to keep as much recent context as fits, falling back to event-count heuristic when limits are unknown
- Preserve backward compatibility; default event-count behavior remains when model limits are absent

Co-authored-by: openhands <[email protected]>
if max_input:
# Build messages for token counting
messages = LLMConvertibleEvent.events_to_messages(view.events)
total_tokens = self.llm.get_token_count(messages)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting the LLM used by the condenser is not necessarily the LLM used by the agent, and condensation is intended to benefit the latter.


# Prefer token-aware check when LLM has context window info and
# we can estimate message tokens. Fallback to event-count otherwise.
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire block probably deserves to be pulled out into a function that can be used by any condenser. I don't think there are any others in the SDK that need this info but it'd be good to have for folks extending condensers if it was, e.g., a static method on the base class.

@blacksmith-sh
Copy link
Contributor

blacksmith-sh bot commented Nov 3, 2025

[Automatic Post]: It has been a while since there was any activity on this PR. @enyst, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants