Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 14 additions & 4 deletions lmdeploy/pytorch/paging/scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -246,7 +246,11 @@ def _reorder_waiting():
def _schedule_decoding(self, prealloc_size: int = 0):
"""Schedule decoding."""

running = self.running
def _reorder_running():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just sort running in reversed order, so we don't need nested loops.

Copy link
Contributor Author

@Tsundoku958 Tsundoku958 Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that in the original logic, when traversing the sequence, the order of eviction priority is the same as the order of block allocation requests. Consider the following two scenarios:

  1. When the sequence is sorted in descending order of timestamps, if there are still a certain number of free blocks during scheduling, the latest requests will be allocated space first. This may cause the earliest arriving requests to fail to obtain GPU blocks.
  2. When the sequence is sorted in ascending order of timestamps, if there are almost no free blocks left during scheduling, the earliest arriving requests will be evicted first. This violates the First-Come-First-Served (FCFS) principle.

Therefore, I think adding a nested loop could address both of these situations.

"""Reorder running."""
return sorted(self.running, key=lambda seq: seq.arrive_time)

running = _reorder_running()
assert len(running) != 0

eviction_helper = self.eviction_helper
Expand All @@ -270,9 +274,9 @@ def __evict_for_seq(seq: SchedulerSequence, num_required_blocks: int):
return eviction_helper.evict_for_seq(seq, evictable, prealloc_size)

# 1. running
for seq in running:
while len(running) > 0:
# token + n

seq = running.pop(0)
num_required_blocks = self.block_manager.num_required_blocks(seq, prealloc_size)
if len(seq.logical_blocks) + num_required_blocks > self.block_manager.num_gpu_blocks:
# Reach max gpu cache size.
Expand All @@ -284,7 +288,13 @@ def __evict_for_seq(seq: SchedulerSequence, num_required_blocks: int):
seq.set_step(0)
continue

if not __evict_for_seq(seq, num_required_blocks):
while not __evict_for_seq(seq, num_required_blocks):
if len(running) == 0:
break
seq_preempted = running.pop(-1)
self._set_message_status(seq_preempted, MessageStatus.WAITING)

if self.block_manager.get_num_free_gpu_blocks() < num_required_blocks:
self._set_message_status(seq, MessageStatus.WAITING)
continue

Expand Down
Loading