Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway returns not meaningful response when pod is running but container not ready #781

Open
Jeffwan opened this issue Mar 3, 2025 · 2 comments · May be fixed by #787
Open

Gateway returns not meaningful response when pod is running but container not ready #781

Jeffwan opened this issue Mar 3, 2025 · 2 comments · May be fixed by #787
Assignees
Labels
area/gateway kind/bug Something isn't working kind/documentation Improvements or additions to documentation
Milestone

Comments

@Jeffwan
Copy link
Collaborator

Jeffwan commented Mar 3, 2025

🐛 Describe the bug

We made few changes in recent weeks to make sure response is explainable. I still see some case not expected today.

Pod Running - {"error":{"code":500,"message":"invalid character 'u' looking for beginning of value"}}%  

checking the status

READY   STATUS              RESTARTS   AGE
deepseek-r1-671b-858b4b9569-4w46n-head-2r8p8                  0/1     Running

checking gateway logs

E0303 00:05:28.686093       1 gateway.go:502] "error to unmarshal response" err="invalid character 'u' looking for beginning of value" requestID="89b2bf28-e03d-4211-b007-5a1b9eebc8de" responseBody="upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: Connection refused"

The root problem is we only consider Pod Status but didn't consider container ready or not. If that case, server is ready to serving request but router routes the request to pod and result in failure.

We should fix this issue and add a detailed page on the state machine and result code user may receive.

pod terminating - {"error":{"code":503,"message":"error on getting pods for model deepseek-r1-671b"}}%

pod not exist - {"error":{"code":400,"message":"model deepseek-r1-671b does not exist"}}

ContainerCreating - {"error":{"code":503,"message":"error on getting pods for model deepseek-r1-671b"}}%

Steps to Reproduce

Make sure the pod is ready but use probe to control readiness of container

Expected behavior

In such case, router should not forward request to pod.

Environment

0.2.0

@Jeffwan Jeffwan added area/gateway kind/bug Something isn't working kind/documentation Improvements or additions to documentation labels Mar 3, 2025
@Jeffwan Jeffwan added this to the v0.3.0 milestone Mar 3, 2025
@Jeffwan Jeffwan changed the title Gateway returns not meaningful response when pod is ready but container not ready Gateway returns not meaningful response when pod is running but container not ready Mar 3, 2025
@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Mar 3, 2025

Have a short sync up with @varungup90 I mean pod Running, but ready. Pod will only be ready after all containers are ready.

@varungup90 varungup90 linked a pull request Mar 3, 2025 that will close this issue
@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Mar 4, 2025

Image

I think the problem is probably due to the env miss #776 this change, and it forward request to worker pod. Note, worker use different probe from head.

After applying the change, it works fine

{"error":{"code":503,"message":"error on getting pods for model deepseek-r1-671b"}}

We should improve the logs like pod may not be ready etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/gateway kind/bug Something isn't working kind/documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants