router and infer parrall. #965

hiworldwzj · 2025-07-09T05:51:28Z

No description provided.

gemini-code-assist

Summary of Changes

Hello @hiworldwzj, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant architectural refactoring to enhance the parallelism and efficiency of the LightLLM serving system. It transitions the communication paradigm between the router and inference components from a request-response RPC model to a continuous, shared-memory-driven approach. This change aims to reduce communication latency, streamline request processing, and provide a more robust mechanism for handling request lifecycle events like abortion.

Highlights

Inter-process Communication Overhaul: Replaced RPC-based communication between the router and inference processes with a shared-memory-based ShmReqsIOBuffer for improved parallelism and reduced overhead. This new mechanism allows for more direct and efficient data exchange.
Continuous Inference Loop: Introduced a persistent infer_loop within the inference backend that continuously processes requests and commands from the shared memory. This eliminates the need for explicit RPC calls for prefill, decode, and pause operations, streamlining the inference pipeline.
Enhanced Abort Handling: Implemented a dedicated AbortedReqCmd dataclass and a shared memory mechanism for robustly signaling and processing aborted requests. The router now explicitly marks requests as router_aborted via the shared buffer, ensuring timely and consistent cleanup by the inference processes.
Simplified Scheduling and Pausing: Removed the complex, event-driven pausing and scheduling logic, including pause_strategy and schedule_event. The new batch generation in the router manager now proceeds only when no requests are paused, simplifying the overall scheduling mechanism.
Node-level Synchronization: Added new distributed group and broadcast primitives (node_broadcast_tensor, node_nccl_group) to facilitate efficient synchronization of shared memory operations across inference processes within the same node.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant architectural change by refactoring the interaction between the router and the inference engine. It moves from an RPC-based prefill/decode cycle to a continuous batching model where the inference engine runs its own loop. Communication is now handled via a shared memory buffer (ShmReqsIOBuffer), which should improve performance by reducing RPC overhead. The changes are extensive, touching routing, request queuing, and the model inference backend.

My review has identified a few areas for improvement, primarily concerning code clarity and robustness. I've suggested simplifying some logic to be more idiomatic and readable. More critically, I've pointed out the use of bare except blocks which can hide bugs and should be replaced with specific exception handling.

gemini-code-assist · 2025-07-09T05:54:03Z

lightllm/server/router/shm_reqs_io_buffer.py

+        try:
+            shm = shared_memory.SharedMemory(name=self.name, create=True, size=LIGHTLLM_REQS_BUFFER_BYTE_SIZE)
+        except:
+            shm = shared_memory.SharedMemory(name=self.name, create=False, size=LIGHTLLM_REQS_BUFFER_BYTE_SIZE)


Using a bare except: is a bad practice as it can catch unexpected exceptions (like KeyboardInterrupt or SystemExit) and hide bugs. It's better to catch specific exceptions. In this case, you should catch FileExistsError.

Suggested change

try:

shm = shared_memory.SharedMemory(name=self.name, create=True, size=LIGHTLLM_REQS_BUFFER_BYTE_SIZE)

except:

shm = shared_memory.SharedMemory(name=self.name, create=False, size=LIGHTLLM_REQS_BUFFER_BYTE_SIZE)

try:

shm = shared_memory.SharedMemory(name=self.name, create=True, size=LIGHTLLM_REQS_BUFFER_BYTE_SIZE)

except FileExistsError:

shm = shared_memory.SharedMemory(name=self.name, create=False, size=LIGHTLLM_REQS_BUFFER_BYTE_SIZE)

gemini-code-assist · 2025-07-09T05:54:04Z

lightllm/server/router/shm_reqs_io_buffer.py

+            try:
+                shm = shared_memory.SharedMemory(name=self.name, create=True, size=LIGHTLLM_REQS_BUFFER_BYTE_SIZE)
+                logger.info(f"create shm {self.name}")
+            except:
+                shm = shared_memory.SharedMemory(name=self.name, create=False, size=LIGHTLLM_REQS_BUFFER_BYTE_SIZE)
+                logger.info(f"link shm {self.name}")


Similar to the previous point, the bare except: here should be replaced with a specific exception, FileExistsError, to avoid masking other potential errors and improve code robustness.

Suggested change

try:

shm = shared_memory.SharedMemory(name=self.name, create=True, size=LIGHTLLM_REQS_BUFFER_BYTE_SIZE)

logger.info(f"create shm {self.name}")

except:

shm = shared_memory.SharedMemory(name=self.name, create=False, size=LIGHTLLM_REQS_BUFFER_BYTE_SIZE)

logger.info(f"link shm {self.name}")

try:

shm = shared_memory.SharedMemory(name=self.name, create=True, size=LIGHTLLM_REQS_BUFFER_BYTE_SIZE)

logger.info(f"create shm {self.name}")

except FileExistsError:

shm = shared_memory.SharedMemory(name=self.name, create=False, size=LIGHTLLM_REQS_BUFFER_BYTE_SIZE)

logger.info(f"link shm {self.name}")

gemini-code-assist · 2025-07-09T05:54:04Z

lightllm/server/router/manager.py

+    def _get_paused_req_num(self) -> int:
+        if self.running_batch is None:
+            return 0
+        else:
+            count = 0
+            for req in self.running_batch.reqs:
+                if req.is_paused:
+                    count += 1
+            return count


This function can be written more concisely using a generator expression with sum(). This improves readability and is more idiomatic Python.

def _get_paused_req_num(self) -> int: if self.running_batch is None: return 0 return sum(req.is_paused for req in self.running_batch.reqs)

gemini-code-assist · 2025-07-09T05:54:04Z

lightllm/server/router/manager.py

+    def _get_paused_req_num_in_dp_index(self, dp_index: int) -> int:
+        if self.running_batch is None:
+            return 0
+        else:
+            count = 0
+            for req in self.running_batch.reqs:
+                if req.is_paused and req.sample_params.suggested_dp_index == dp_index:
+                    count += 1
+            return count


Similar to _get_paused_req_num, this function can be made more concise and readable by using sum() with a generator expression.

def _get_paused_req_num_in_dp_index(self, dp_index: int) -> int: if self.running_batch is None: return 0 return sum( req.is_paused and req.sample_params.suggested_dp_index == dp_index for req in self.running_batch.reqs )

… wzj_router

gemini-code-assist bot reviewed Jul 9, 2025

View reviewed changes

hiworldwzj force-pushed the wzj_router branch from 1cc3784 to e186bed Compare July 14, 2025 09:11

hiworldwzj added 27 commits July 15, 2025 02:50

add shm_reqs_buffer.py

90b9ab3

fix

a6d2f24

fix

a2fcf3b

fix

bf7ce59

fix

b0062a5

fix

ca45a2c

fix

433a442

fix

297dcbc

fix

c80f351

fix

d9a4774

first overlap demo.

a52ca86

fix

4cb1f7b

fix

9cac880

fix

445c959

fix

992d7b5

fix

f885680

fix

c01ea7e

fix

e4dfa95

fix all

7e6cbf6

fix

f693e55

fix

3de8ab5

fix

2a5a9ce

fix

df79f8e

fix

5834e80

fix

dd5d872

fix

d3a9f88

fix

d4aa1ee

hiworldwzj and others added 30 commits July 15, 2025 06:24

fix

2dff9a1

fix

4c111a3

fix

b172eaf

fix

241ec63

inference overlap

678bb5f

Merge branch 'wzj_router' of https://github.com/ModelTC/lightllm into…

2d46245

… wzj_router

fix

204f6fa

fix

dde6f18

fix

6cd8c56

back the infer_struct

8cc2325

overlap sample

a7fbb15

add mtp index

ef35cf6

fix

2d81be7

fix next token ids.

8705d0a

fix

35f1bfa

fix

d437d7a

mtp overlap (draft)

ac47e1f

fix

4cfae7a

diverse mode ok

9dd795d

add penalty_counter mode

517c18e

fix

5fb73f9

fix

dcb76ed

fix

2efa7f5

fix

4157ff4

improve pin mem manager

7c1a597

overlap mtp

965cdae

merge latest

65fb8c6

fix

1dad212

fix mtp

0e3cdb7

fix

098ae9d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

router and infer parrall. #965

router and infer parrall. #965

Uh oh!

hiworldwzj commented Jul 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 9, 2025

Uh oh!

gemini-code-assist bot Jul 9, 2025

Uh oh!

gemini-code-assist bot Jul 9, 2025

Uh oh!

gemini-code-assist bot Jul 9, 2025

Uh oh!

Uh oh!

router and infer parrall. #965

Are you sure you want to change the base?

router and infer parrall. #965

Uh oh!

Conversation

hiworldwzj commented Jul 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!