-
Notifications
You must be signed in to change notification settings - Fork 42
feat: add node-level timeouts to prevent stuck queued states #462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 3 commits
735cd6d
4303a3d
8634281
4ee52fd
4be13c9
9839bc6
de475b8
2c7f3c1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,17 +13,21 @@ | |
|
|
||
|
|
||
| async def find_state(namespace_name: str, nodes: list[str]) -> State | None: | ||
| current_time_ms = int(time.time() * 1000) | ||
| data = await State.get_pymongo_collection().find_one_and_update( | ||
| { | ||
| "namespace_name": namespace_name, | ||
| "status": StateStatusEnum.CREATED, | ||
| "node_name": { | ||
| "$in": nodes | ||
| }, | ||
| "enqueue_after": {"$lte": int(time.time() * 1000)} | ||
| "enqueue_after": {"$lte": current_time_ms} | ||
| }, | ||
| { | ||
| "$set": {"status": StateStatusEnum.QUEUED} | ||
| "$set": { | ||
| "status": StateStatusEnum.QUEUED, | ||
| "queued_at": current_time_ms | ||
| } | ||
| }, | ||
| return_document=ReturnDocument.AFTER | ||
| ) | ||
|
Comment on lines
20
to
52
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The query seems to be correct, however how will we repick timed-out states? |
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -38,6 +38,7 @@ | |
| from apscheduler.schedulers.asyncio import AsyncIOScheduler | ||
| from apscheduler.triggers.cron import CronTrigger | ||
| from .tasks.trigger_cron import trigger_cron | ||
| from .tasks.check_node_timeout import check_node_timeout | ||
|
|
||
| # Define models list | ||
| DOCUMENT_MODELS = [State, GraphTemplate, RegisteredNode, Store, Run, DatabaseTriggers] | ||
|
|
@@ -76,6 +77,15 @@ async def lifespan(app: FastAPI): | |
| max_instances=1, | ||
| id="every_minute_task" | ||
| ) | ||
| scheduler.add_job( | ||
| check_node_timeout, | ||
| CronTrigger.from_crontab("* * * * *"), | ||
| replace_existing=True, | ||
| misfire_grace_time=60, | ||
| coalesce=True, | ||
| max_instances=1, | ||
| id="check_node_timeout_task" | ||
| ) | ||
|
Comment on lines
+87
to
+95
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not needed with db queries |
||
| scheduler.start() | ||
|
|
||
| # main logic of the server | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,6 +28,9 @@ class State(BaseDatabaseModel): | |
| retry_count: int = Field(default=0, description="Number of times the state has been retried") | ||
| fanout_id: str = Field(default_factory=lambda: str(uuid.uuid4()), description="Fanout ID of the state") | ||
| manual_retry_fanout_id: str = Field(default="", description="Fanout ID from a manual retry request, ensuring unique retries for unite nodes.") | ||
| queued_at: Optional[int] = Field(None, description="Unix time in milliseconds when state was queued") | ||
| timeout_minutes: Optional[int] = Field(None, gt=0, description="Timeout in minutes for this specific state, taken from node registration") | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. a better model here is to store |
||
| queued_at: Optional[int] = Field(None, description="Unix time in milliseconds when the state was queued") | ||
|
agam1092005 marked this conversation as resolved.
Outdated
|
||
|
|
||
| @before_event([Insert, Replace, Save]) | ||
| def _generate_fingerprint(self): | ||
|
|
@@ -102,5 +105,12 @@ class Settings: | |
| ("status", 1), | ||
| ], | ||
| name="run_id_status_index" | ||
| ), | ||
| IndexModel( | ||
| [ | ||
| ("status", 1), | ||
| ("queued_at", 1), | ||
| ], | ||
| name="timeout_query_index" | ||
| ) | ||
|
agam1092005 marked this conversation as resolved.
|
||
| ] | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. while this model of periodic jobs will work, its unnecessary as we can write a database query to figure out timeout nodes, we probably do not need to set the status timeout just from if the status is Queued and current_time > timeout_at we can figure it. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| import time | ||
| from app.models.db.state import State | ||
| from app.models.state_status_enum import StateStatusEnum | ||
| from app.singletons.logs_manager import LogsManager | ||
| from app.config.settings import get_settings | ||
|
|
||
| logger = LogsManager().get_logger() | ||
|
|
||
|
|
||
| async def check_node_timeout(): | ||
| try: | ||
| settings = get_settings() | ||
| current_time_ms = int(time.time() * 1000) | ||
|
|
||
| logger.info(f"Checking for timed out nodes at {current_time_ms}") | ||
|
|
||
| # Find all QUEUED states with queued_at set | ||
| queued_states = await State.find( | ||
| State.status == StateStatusEnum.QUEUED, | ||
| State.queued_at != None | ||
| ).to_list() | ||
|
coderabbitai[bot] marked this conversation as resolved.
Outdated
|
||
|
|
||
| states_to_timeout = [] | ||
|
|
||
| for state in queued_states: | ||
| # Use state-specific timeout if available, otherwise fall back to global | ||
| timeout_minutes = state.timeout_minutes if state.timeout_minutes else settings.node_timeout_minutes | ||
| timeout_ms = timeout_minutes * 60 * 1000 | ||
| timeout_threshold = current_time_ms - timeout_ms | ||
|
|
||
| if state.queued_at <= timeout_threshold: | ||
| state.status = StateStatusEnum.TIMEDOUT | ||
| state.error = f"Node execution timed out after {timeout_minutes} minutes" | ||
| states_to_timeout.append(state) | ||
|
|
||
| if states_to_timeout: | ||
| # Update all timed out states in bulk | ||
| await State.save_all(states_to_timeout) | ||
| logger.info(f"Marked {len(states_to_timeout)} states as TIMEDOUT") | ||
|
|
||
| except Exception: | ||
| logger.error("Error checking node timeout", exc_info=True) | ||
Uh oh!
There was an error while loading. Please reload this page.