Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Routing of Upstream Packets (Two clients and one game server) #988

Open
zezhehh opened this issue Jul 8, 2024 · 23 comments
Open
Assignees
Labels
kind/bug Something isn't working

Comments

@zezhehh
Copy link
Contributor

zezhehh commented Jul 8, 2024

What happened:

Hi all,

We've observed an issue that a proxy pod uses the same socket for traffic from different clients to the same game server, resulting in one of the clients not receiving response from the game server? Have any specific edge cases been identified as causing this issue?

[Our architecture setup]
In the game server kubernetes cluster, we have a Load Balancer that routes to multiple proxy pods (not as sidecars) and control planes with the Agones provider. We’re using the same token for both clients.

image

What you expected to happen:

We expect two clients can receive corresponding responses.

image

How to reproduce it (as minimally and precisely as possible):

Unknown. Once it started to occur at a some point, it started happening intermittently throughout the day. We suspect there may be a buggy state in a specific pod instance.

Anything else we need to know?:

Environment:

  • Quilkin version:
    v0.8.0

  • Execution environment (binary, container, etc):
    kubernetes, container

│ Containers:                                                                                                                   │
│   quilkin:                                                                                                                    │
│     Container ID:  containerd://9975e4e955b7102297415506422e3c1ebd5b4c39a61bd5039656807e5ae4a1a7                              │
│     Image:         us-docker.pkg.dev/quilkin/release/quilkin:0.8.0                                                            │
│     Image ID:      us-docker.pkg.dev/quilkin/release/quilkin@sha256:3f0abe1af9bc378b16af84f7497f273caaf79308fd04ca302974d070f │
│ e68b8e2 
  • Operating system:

  • Custom filters? (Yes/No - if so, what do they do?):

version: v1alpha1                                                                                                             │
│ filters:                                                                                                                      │
│   - name: quilkin.filters.capture.v1alpha1.Capture                                                                            │
│     config:                                                                                                                   │
│       suffix:                                                                                                                 │
│         size: 7                                                                                                               │
│         remove: true                                                                                                          │
│   - name: quilkin.filters.token_router.v1alpha1.TokenRouter 
  • Log(s):
  • Others:
@zezhehh zezhehh added the kind/bug Something isn't working label Jul 8, 2024
@XAMPPRocky
Copy link
Collaborator

Thank you for your issue! Would you mind testing one of the latest images and see if you can reproduce? Just to eliminate the possibility that it has already been fixed.

@zezhehh
Copy link
Contributor Author

zezhehh commented Jul 9, 2024

Thank you for your issue! Would you mind testing one of the latest images and see if you can reproduce? Just to eliminate the possibility that it has already been fixed.

@XAMPPRocky Thanks for your reply. We would appreciate it if there is a latest image available.us-docker.pkg.dev/quilkin/release/quilkin:0.9.0-dev-50d91e4
Or is there guidance document for building custom image?


Edited: make build-image works

@markmandel
Copy link
Contributor

You can also grab from one of our PR builds, e.g #987 (comment)

@zezhehh
Copy link
Contributor Author

zezhehh commented Jul 10, 2024

Thank you for your issue! Would you mind testing one of the latest images and see if you can reproduce? Just to eliminate the possibility that it has already been fixed.

@XAMPPRocky We have tried using the latest image, but unfortunately, the issue persists. Additionally, the CPU and memory usage are much higher than in version v0.8.0. 🥲

@XAMPPRocky
Copy link
Collaborator

@zezhehh That is odd, because we have used and tested this setup of having one token per gameserver with multiple clients on a single proxy and haven't had an issue at all. Would you be able to check your load balancer setup that you're running infront of the proxy, my first reckon is with that as we don't put any load balancers in front of the proxies, so that's one difference I see with the setup we have tested.

@markmandel
Copy link
Contributor

@zezhehh can you share what kind of LB it is? (i.e. is it a Google Cloud / AWS LoadBalancer? How is it configured etc?) Maybe there is something in there.

@zezhehh
Copy link
Contributor Author

zezhehh commented Jul 11, 2024

@XAMPPRocky @markmandel

We have a GCP k8s LB setting as follows, which should preserve the original source IP:port.

│ Name:                     quilkin-proxy                                                                                                                                             │
│ Namespace:                quilkin                                                                                                                                                   │
│ Labels:                   app.kubernetes.io/managed-by=Terraform                                                                                                                    │
│ Annotations:              cloud.google.com/l4-rbs: enabled                                                                                                                          │
│                           cloud.google.com/neg: {"exposed_ports": {"7337":{},"7338":{}}}                                                                                            │
│                           cloud.google.com/neg-status:                                                                                                                              │
│                             {"network_endpoint_groups":{"7337":"k8s1-c8523907-quilkin-quilkin-proxy-7337-c77ae3d6","7338":"k8s1-c8523907-quilkin-quilkin-proxy-7338-c0...           │
│                           service.kubernetes.io/backend-service: k8s2-e2o2llay-quilkin-quilkin-proxy-6p9r6aaw                                                                       │
│                           service.kubernetes.io/firewall-rule: k8s2-e2o2llay-quilkin-quilkin-proxy-6p9r6aaw                                                                         │
│                           service.kubernetes.io/firewall-rule-for-hc: k8s2-e2o2llay-quilkin-quilkin-proxy-6p9r6aaw-fw                                                               │
│                           service.kubernetes.io/healthcheck: k8s2-e2o2llay-quilkin-quilkin-proxy-6p9r6aaw                                                                           │
│                           service.kubernetes.io/udp-forwarding-rule: a6675171c0d494944ac00781b235adf0                                                                               │
│ Selector:                 role=proxy                                                                                                                                                │
│ Type:                     LoadBalancer                                                                                                                                              │
│ IP Family Policy:         SingleStack                                                                                                                                               │
│ IP Families:              IPv4                                                                                                                                                      │
│ IP:                       10.64.10.69                                                                                                                                               │
│ IPs:                      10.64.10.69                                                                                                                                               │
│ IP:                       34.34.148.137                                                                                                                                             │
│ LoadBalancer Ingress:     34.34.148.137                                                                                                                                             │
│ Port:                     proxy-udp  7337/UDP                                                                                                                                       │
│ TargetPort:               proxy-udp/UDP                                                                                                                                             │
│ NodePort:                 proxy-udp  30118/UDP                                                                                                                                      │
│ Endpoints:                10.64.64.185:7777,10.64.64.187:7777,10.64.64.74:7777                                                                                                      │
│ Port:                     ping-udp  7338/UDP                                                                                                                                        │
│ TargetPort:               ping-udp/UDP                                                                                                                                              │
│ NodePort:                 ping-udp  31821/UDP                                                                                                                                       │
│ Endpoints:                10.64.64.185:7600,10.64.64.187:7600,10.64.64.74:7600                                                                                                      │
│ Session Affinity:         None                                                                                                                                                      │
│ External Traffic Policy:  Local                                                                                                                                                     │
│ HealthCheck NodePort:     30737     

The issue has been confirmed by examining dumped UDP traffic in the pcap file, which can be viewed using Wireshark.

Quilkin Proxy Capture.pcap.zip

The file contains data from two ongoing games.

The Quilkin proxy is identified as 10.64.65.99.

In the first game, involving clients 35.189.221.32:38400 and 35.189.221.32:38402, communicating with the game server 10.64.8.46:8884, everything is functioning correctly.

However, in the second game, involving clients 35.189.221.32:38404 and 35.189.221.32:38405, communicating with the game server 10.64.8.46:7262, the client 35.189.221.32:38404 did not receive any response.

An example request can be identified with correlation_id: 044bbeb (Started from the packet No. 1835).

image

  • Packet No. 1835: The client 35.189.221.32:38404 sent a request to the proxy listener port at 10.64.65.99:7777.
  • Packet No. 1836: The proxy forwarded the packet to the game server 10.64.8.46:7262 from the socket at 10.64.65.99:34118.
  • Packets No. 1837-1839: The game server sent three packets in response to the same socket 10.64.65.99:34118.
  • Packets No. 1840-1842: The proxy forwarded these three packets to the other client at 35.189.221.32:38405.

For visual reference:

We have a normal game:
image

And a problematic one:
image

@markmandel
Copy link
Contributor

To clarify the point a little further before I start going over unit tests and seeing if I can replicate this in one.

  1. Is it intermittent where you get mis-routed packets, or once it starts, it doesn't stop?
  2. Is there any chance the game server could be sending data to the wrong port in the proxy by accident?

@zezhehh
Copy link
Contributor Author

zezhehh commented Jul 12, 2024

To clarify the point a little further before I start going over unit tests and seeing if I can replicate this in one.

  1. Is it intermittent where you get mis-routed packets, or once it starts, it doesn't stop?
  2. Is there any chance the game server could be sending data to the wrong port in the proxy by accident?
  1. It's intermittent, yes. Eventually, it occurs more frequently once it "starts."
  2. From the pcap file, we can see that the socket Quilkin utilized to communicate with the game server at ⁠10.64.8.46:7262 was the same for two clients (⁠35.189.221.32:38404 and ⁠35.189.221.32:38405), both connecting through ⁠10.64.65.99:34118. Although the game server sent responses to the same port, it is the client's port that the game server observed. Therefore, the answer is no (at least in this example).

Note: We also observed that the problematic port is likely the same one (34118). Not sure if the info is helpful.

@markmandel
Copy link
Contributor

Hmnn, not 100% sure I followed that. I need to double check the code, because I know this got optimised a while back (not by me, so I'm not as familiar anymore) so we could handle way more endpoints per proxy , but I'm fairly sure, it should be:

image

I.e. for each client connecting, there should be a different port the gameserver connects to to send packets back.

If it's the same port, I'm not sure how we differentiate which packet should go where 🤔 are you saying there is only one quilkin proxy port being used by the gameserver process?

@XAMPPRocky
Copy link
Collaborator

If it's one port, I'm going to reckon that the load balancer is not preserving the IP:port of the client when sending traffic to the proxy, so to the proxy it looks like a single client.

@zezhehh
Copy link
Contributor Author

zezhehh commented Jul 15, 2024

@markmandel @XAMPPRocky Yes, we understand that the issue lies in the fact that the proxy uses the same port for two clients communicating with the game server. However, it is a symptom and not something we intentionally did (we didn’t change the source codes to remap the socket usage). What we can confirm is:

  1. We are running multiple proxies, but the problem arises when one specific proxy is used to route the traffic for those two clients.
  2. The proxy listens to the two clients from two different source ports (same IPs) after passing through the load balancer.
  3. The proxy forwards the packets from the clients to the game server using the same port.

(The No.2 and 3 can be observed in the dumped UDP traffic.

@XAMPPRocky
Copy link
Collaborator

XAMPPRocky commented Jul 15, 2024

@zezhehh to clarify I mean that I'm not sure that the load balancer is always providing a unique ip:port pair. Not that you've made a change but however the load balancer works / is configured it is not always sending unique addresses.

Would you be able to test this with a NodePort for proxy traffic instead? I think cutting out the load balancer will help us determine if you can replicate it with direct traffic to the proxy.

@markmandel
Copy link
Contributor

Yes, we understand that the issue lies in the fact that the proxy uses the same port for two clients communicating with the game server.

That wasn't what I was getting at. I was getting at the traffic from the proxy to the game server should be over 2 different ports at 10.64.65.99 as there should be a separate port for each backing client (and since there are 2, there should be two ports).

Is that what you are seeing?

The proxy forwards the packets from the clients to the game server using the same port.

It seems like you are... but that leaves me extremely confused, because then ALL traffic back to clients would only go to one client. Without a different port on the proxy for each connection to the game server and back -- there's no way to differentiate where the traffic should head back to.

@zezhehh
Copy link
Contributor Author

zezhehh commented Jul 16, 2024

@zezhehh to clarify I mean that I'm not sure that the load balancer is always providing a unique ip:port pair. Not that you've made a change but however the load balancer works / is configured it is not always sending unique addresses.

Would you be able to test this with a NodePort for proxy traffic instead? I think cutting out the load balancer will help us determine if you can replicate it with direct traffic to the proxy.

Hmm.. I don't think the Load Balancer is the issue here. We have .spec.externalTrafficPolicy set to Local, as per the official doc:

Local preserves the client source IP and avoids a second hop for LoadBalancer and NodePort type Services, but risks potentially imbalanced traffic spreading.

@zezhehh
Copy link
Contributor Author

zezhehh commented Jul 16, 2024

@markmandel

Sorry for any confusion. I'll try to make it clear.

The proxy forwards the packets from the clients to the game server using the same port.

Yes, it's what we're seeing. The game server records the client socket (which is actually the proxy socket) for each client, identified by user ID, and responds to this specific socket. In other words, the game server doesn’t check for conflicts with other sockets but simply sends the response to the originating socket.

This situation only occurs when the error arises. In most cases, everything functions as expected: two clients from two sockets...

@markmandel
Copy link
Contributor

Got it - thanks.

Also, I assume there more than one endpoints in play at this point as well? (just to replicate as close as we can in a unit test to see if we can replicate).

@zezhehh
Copy link
Contributor Author

zezhehh commented Jul 18, 2024

Also, I assume there more than one endpoints in play at this point as well? (just to replicate as close as we can in a unit test to see if we can replicate).

@markmandel Yes, those automated testing matches occur within the same cluster as the real matches.

@markmandel markmandel self-assigned this Aug 23, 2024
markmandel added a commit to markmandel/quilkin that referenced this issue Aug 23, 2024
This is an integration test to ensure that concurrent clients to the
same proxy and endpoint didn't mix packets. Could not replicate the
reported issue below, but it felt like a good test to have for concurrency
testing.

Work on googleforgames#988
@markmandel
Copy link
Contributor

I finally got some time to look into this - check out the test I wrote in #1010 -- unfortunately I could not replicate any of your reported issues

Would love you to look at the test though, see if there is something else to the scenario that I didn't manage to capture in the integration test. Let me know if you see anything.

markmandel added a commit that referenced this issue Aug 24, 2024
This is an integration test to ensure that concurrent clients to the
same proxy and endpoint didn't mix packets. Could not replicate the
reported issue below, but it felt like a good test to have for concurrency
testing.

Work on #988
@zezhehh
Copy link
Contributor Author

zezhehh commented Aug 26, 2024

I finally got some time to look into this - check out the test I wrote in #1010 -- unfortunately I could not replicate any of your reported issues

Would love you to look at the test though, see if there is something else to the scenario that I didn't manage to capture in the integration test. Let me know if you see anything.

hmm.. Could you try to allocate the clients with the same ip and different ports?

@markmandel
Copy link
Contributor

hmm.. Could you try to allocate the clients with the same ip and different ports?

Unless you mean something else, the unit test has two sockets on the same IP (localhost) but different ports -- so I believe this tests this scenario, unless I am misunderstanding?

@zezhehh
Copy link
Contributor Author

zezhehh commented Aug 26, 2024

hmm.. Could you try to allocate the clients with the same ip and different ports?

Unless you mean something else, the unit test has two sockets on the same IP (localhost) but different ports -- so I believe this tests this scenario, unless I am misunderstanding?

Okay then all good. Thanks! We have some other different setup (same tokens from clients, etc.), but let's talk tomorrow! :)

@markmandel
Copy link
Contributor

Just for easy discovery, assuming there's an issue in Quillkin, it's likely one of these spots:

So weird.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants