Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: timing out causes the agent to stop #5601

Open
1 task done
neubig opened this issue Dec 14, 2024 · 9 comments · May be fixed by #5951
Open
1 task done

[Bug]: timing out causes the agent to stop #5601

neubig opened this issue Dec 14, 2024 · 9 comments · May be fixed by #5951
Assignees
Labels
bug Something isn't working fix-me Attempt to fix this issue with OpenHands

Comments

@neubig
Copy link
Contributor

neubig commented Dec 14, 2024

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Describe the bug and reproduction steps

Currently, when the agent times out after 120 seconds of a program running, the state changes to "agent has encountered an error" and you need to send a message to the agent to ask it to keep going.

Screenshot_20241214_112534_Chrome

Better behavior would be that the agent gets a message that the command timed out but does not stop (this was the behavior in previous versions of Open Hands).

OpenHands Installation

app.all-hands.dev

OpenHands Version

No response

Operating System

None

Logs, Errors, Screenshots, and Additional Context

No response

@neubig neubig added bug Something isn't working fix-me Attempt to fix this issue with OpenHands labels Dec 14, 2024
@openhands-agent
Copy link
Contributor

OpenHands started fixing the issue! You can monitor the progress here.

@enyst
Copy link
Collaborator

enyst commented Dec 18, 2024

Better behavior would be that the agent gets a message that the command timed out but does not stop (this was the behavior in previous versions of Open Hands).

I'll add here for the record: I am unable to reproduce in some normal way this behavior on the local installation (with local docker), it works just fine. I can see it only on the hosted version.

We use many timeouts in the code, and in this case, I've looked at 3 of them:

  • the runtime timeout here doesn't get hit on local install, the 5 extra seconds help. I suspect this is the one that gets things messy on the hosted version.
  • the bash command timeout around here gets hit, the agent gets the information that the command timed out, and continues normally
  • same for the ipython timeout here.

All these have by default the value of sandbox.timeout setting.

@enyst enyst mentioned this issue Dec 18, 2024
1 task
@avi12
Copy link

avi12 commented Dec 30, 2024

@neubig
Copy link
Contributor Author

neubig commented Dec 30, 2024

It would still be great to get this fixed.

@avi12
Copy link

avi12 commented Dec 31, 2024 via email

@rbren
Copy link
Collaborator

rbren commented Dec 31, 2024

The problem here is that the underlying runtime dies (e.g. due to running out of memory) which leaves the HTTP client in the lurch. The HTTP request times out, and we get this error.

It's not an easy fix unfortunately. We could probably add an API to check how many times the runtime has rebooted, and send the user a message like "Runtime rebooted, potentially due to memory usage. Please try again."

@neubig
Copy link
Contributor Author

neubig commented Dec 31, 2024

OK, sounds good. I confirmed that if I just run a command that times out (sleep 120) I get the expected message.

Separately, this is happening when I run OpenHands unit tests according to our standard unit testing github workflow. A combination of:

  1. A better error message (this issue) and
  2. Configurable runtime size Add runtime size configuration feature #5805

Should make this significantly better.

@neubig neubig added fix-me Attempt to fix this issue with OpenHands and removed fix-me Attempt to fix this issue with OpenHands labels Dec 31, 2024
@openhands-agent
Copy link
Contributor

OpenHands started fixing the issue! You can monitor the progress here.

@openhands-agent
Copy link
Contributor

An attempt was made to automatically fix this issue, but it was unsuccessful. A branch named 'openhands-fix-issue-5601' has been created with the attempted changes. You can view the branch here. Manual intervention may be required.

Additional details about the failure:
While some progress has been made in understanding the root cause, the issue hasn't been fully resolved yet. From the thread discussion, it became clear that:

  1. The original issue is more complex than just a timeout problem - it's related to the underlying runtime dying (e.g., due to memory issues) which causes the HTTP client to timeout.

  2. The simple timeout case (like running sleep 120) works as expected, but the more complex cases involving runtime failures still need to be addressed.

  3. A proposed solution was mentioned to add an API to check runtime reboot counts and provide better error messages like "Runtime rebooted, potentially due to memory usage. Please try again."

  4. The issue is being addressed alongside another PR (Add runtime size configuration feature #5805) for configurable runtime size.

The AI agent's last message seems to be describing an ideal solution but doesn't reflect the actual current state of the fix. The thread indicates this is still an ongoing issue that requires additional work, particularly around handling runtime failures and providing better error messages to users.

neubig pushed a commit that referenced this issue Jan 1, 2025
This ensures that all requests go through the proper error handling path,
including the 502 error handling that converts the error to a more helpful
AgentRuntimeDisconnectedError message.

Fixes #5601
@neubig neubig self-assigned this Jan 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fix-me Attempt to fix this issue with OpenHands
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants