Skip to content

Fix for unexpected socket closures and data leakage under heavy load #646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

todddialpad
Copy link

This is to address issue #645 and in aiohttp/aiohappyeyeballs#93 and aiohttp/aiohappyeyeballs#112

# libuv will make socket non-blocking
tr._open(sock.fileno())
tr._open(sockfd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks correct -- but I'm wondering how vanilla asyncio handles the same thing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think vanilla asyncio has an easier problem in that it can just have python sockets "all the way down", so just let reference counting take care of cleanup, while here we need to manage the disconnect with libuv dealing in file descriptors. I am suspecting there is some error handling path where a file descriptor is closed while the python socket object remains alive and not detached, so when it is finally closed, it messes up any new socket that happens to have the same file descriptor.
e.g. create socket s, call a loop method passing in an explicit socket, <bad error path which will end with sock.close()> overlapping with an .accept. I think the .accept never results in a python socket object being created.

So with the methods accepting sockets and other methods that internally work directly in file descriptors can there be a discrepancy?

@MarkusSintonen
Copy link

@todddialpad very nice! Do you know does it help with the other issue #506 which seems to be also related to incorrect sharing of sockets etc?

Any possibility to add some test here?

@todddialpad
Copy link
Author

@todddialpad very nice! Do you know does it help with the other issue #506 which seems to be also related to incorrect sharing of sockets etc?

Any possibility to add some test here?

I am trying to get a stable test. It is tricky because it is a race condition, if my guess is correct. I think it is a race if TLS negotiation during a call to loop.create_connection with an explicit socket is cancelled, and a subsequent incoming connection is accepted before the CancelledError is propagated. I think both libuv (or uvloop) first and aiohttp second close the underlying file descriptor.

So if this is the case, I don't think this will fix issue #506 , which could be a similar but different root cause.

@MarkusSintonen
Copy link

So if this is the case, I don't think this will fix issue #506 , which could be a similar but different root cause.

Ok I see, the linked issue was also concerning as it looked as it was trying to write data into some incorrect socket. The error was also something we observed at similar time instances when we observed the response data getting leaked to incorrect requests. But we dont know is that issue actually related to the data leakage or just something else. (These RuntimeErrors dont happen with vanilla asyncio)

@todddialpad
Copy link
Author

@todddialpad very nice! Do you know does it help with the other issue #506 which seems to be also related to incorrect sharing of sockets etc?
Any possibility to add some test here?

I am trying to get a stable test. It is tricky because it is a race condition, if my guess is correct. I think it is a race if TLS negotiation during a call to loop.create_connection with an explicit socket is cancelled, and a subsequent incoming connection is accepted before the CancelledError is propagated. I think both libuv (or uvloop) first and aiohttp second close the underlying file descriptor.

So if this is the case, I don't think this will fix issue #506 , which could be a similar but different root cause.

I still haven't been able to isolate a standalone, self-contained test. The test environment in which I generated the same error we see in production involves 2 VMs with significant network latency between them. The first of the VMs is just a web server, the second is a web server that accepts requests, and then makes outgoing client requests (using aiohttp) to the first webserver with TLS and a short timeout (around 1 second).

With this setup, I quite reliably get a failure within 250 connections. When I run with this patch applied, I have never had a failure in 20,000 connections.

We have also run this in our production environment. When we first encountered this failure, we hit it within 1 hour of using aiohttp >= 3.10. Since running with this patch we have been running for 5 days with no failures.

@todddialpad
Copy link
Author

Is accepting this blocked on the tests that are failing? I don't think those failures are related to this change, as they are also failing for PR #644, which is solely a documentation change.

I looked at the test logs and I would guess that a dependency is causing the changed results. Related to this, I notice that in the failing tests, and alpha release of Cython 3.1 is being used (Using cached Cython-3.1.0a1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata). Is this intentional?
In the last test run that passed, the release version was used (Using cached Cython-3.0.11-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata)

@AntonArsentiev
Copy link

Hello everyone. Did i think right that this MR fix issues below?

"RuntimeError: File descriptor 2877 is used by transport <TCPTransport closed=False reading=True 0x55b8dc9baa90>"

@AntonArsentiev
Copy link

Hello everyone :)

Like many other users of this library, I would be happy for this fix to be implemented in one of the upcoming releases.
Could you figure out roughly when to expect this fix?

@Dreamsorcerer
Copy link

@1st1 @fantix Is anyone available to merge and release this? We're getting asked to put workarounds into aiohttp to deal with this, would be nice to have this fixed here instead.

@bdraco
Copy link

bdraco commented Mar 31, 2025

We added a workaround for this issue in aio-libs/aiohttp#10464 but its causing issues when using with asyncio SelectorEventLoop aio-libs/aiohttp#10617 so we will likely be reverting it and waiting for this PR instead

@webknjaz
Copy link

Hey @elprans @1st1, do you think you'll be able to spare a minute to get this in?

bdraco added a commit to aio-libs/aiohttp that referenced this pull request Apr 1, 2025
bdraco added a commit to aio-libs/aiohttp that referenced this pull request Apr 1, 2025
…10464 (#10656)

Reverts #10464

While this change improved the situation for uvloop users, it caused a
regression with `SelectorEventLoop` (issue #10617)

The alternative fix is MagicStack/uvloop#646
(not merged at the time of this PR)

issue #10617 appears to be very similar to
python/cpython@d5aeccf

If someone can come up with a working reproducer for #10617 we can
revisit this.
cc @top-oai

Minimal implementation that shows on cancellation the socket is cleaned
up without the explicit `close`
#10617 (comment)
so this should be unneeded unless I've missed something (very possible
with all the moving parts here)

## Related issue number

fixes #10617
patchback bot pushed a commit to aio-libs/aiohttp that referenced this pull request Apr 1, 2025
…10464 (#10656)

Reverts #10464

While this change improved the situation for uvloop users, it caused a
regression with `SelectorEventLoop` (issue #10617)

The alternative fix is MagicStack/uvloop#646
(not merged at the time of this PR)

issue #10617 appears to be very similar to
python/cpython@d5aeccf

If someone can come up with a working reproducer for #10617 we can
revisit this.
cc @top-oai

Minimal implementation that shows on cancellation the socket is cleaned
up without the explicit `close`
#10617 (comment)
so this should be unneeded unless I've missed something (very possible
with all the moving parts here)

## Related issue number

fixes #10617

(cherry picked from commit 06db052)
patchback bot pushed a commit to aio-libs/aiohttp that referenced this pull request Apr 1, 2025
…10464 (#10656)

Reverts #10464

While this change improved the situation for uvloop users, it caused a
regression with `SelectorEventLoop` (issue #10617)

The alternative fix is MagicStack/uvloop#646
(not merged at the time of this PR)

issue #10617 appears to be very similar to
python/cpython@d5aeccf

If someone can come up with a working reproducer for #10617 we can
revisit this.
cc @top-oai

Minimal implementation that shows on cancellation the socket is cleaned
up without the explicit `close`
#10617 (comment)
so this should be unneeded unless I've missed something (very possible
with all the moving parts here)

## Related issue number

fixes #10617

(cherry picked from commit 06db052)
bdraco added a commit to aio-libs/aiohttp that referenced this pull request Apr 1, 2025
…'s a failure in start_connection() #10464 (#10657)

**This is a backport of PR #10656 as merged into master
(06db052).**

Reverts #10464

While this change improved the situation for uvloop users, it caused a
regression with `SelectorEventLoop` (issue #10617)

The alternative fix is MagicStack/uvloop#646
(not merged at the time of this PR)

issue #10617 appears to be very similar to
python/cpython@d5aeccf

If someone can come up with a working reproducer for #10617 we can
revisit this.
cc @top-oai

Minimal implementation that shows on cancellation the socket is cleaned
up without the explicit `close`
#10617 (comment)
so this should be unneeded unless I've missed something (very possible
with all the moving parts here)

## Related issue number

fixes #10617

Co-authored-by: J. Nick Koston <[email protected]>
@1st1
Copy link
Member

1st1 commented Apr 16, 2025

Hey @elprans @1st1, do you think you'll be able to spare a minute to get this in?

Sorry, @fantix and I will be going through this PR and others this week.

@jperezr21
Copy link

Hey! I noticed that aiohttp 3.11.14 has been yanked. For those of us using uvloop and aiohttp and running into the File descriptor 91 is used by transport error, do you happen to know if there’s a temporary workaround or a specific combination of versions we can pin to in the meantime? Totally understand if we need to wait for this to be merged, just trying to keep things running smoothly in the short term. Thanks a lot!

@Dreamsorcerer
Copy link

You can pin to a yanked version.

@jugalshah291
Copy link

I folks
We are also running into below issue,

File descriptor 91 is used by transport

Wonder if their is a fix

Our setup

aiohappyeyeballs==2.6.1
aiohttp==3.11.18
aiohttp-cors==0.8.1
uvloop==0.21.0

@jperezr21
Copy link

Pinning to aiohttp==3.11.14 solved it for me

@bdraco
Copy link

bdraco commented May 8, 2025

Pinning to aiohttp==3.11.14 solved it for me

Just a heads-up: if you're using the default asyncio event loop (typically SelectorEventLoop), pinning to aiohttp==3.11.14 may introduce other issues due to some side effects in that version. If you're using uvloop exclusively, it's likely fine.

Ideally, we were hoping this PR would be merged to avoid relying on workarounds in aiohttp as we've already been down that road, had to revert, and don’t want a repeat. Unfortunately, this PR seems to have stalled.

@jugalshah291
Copy link

jugalshah291 commented May 9, 2025

Can we please prioritize this PR, it seems to be impacting many users

@todddialpad
Copy link
Author

@fantix thanks for having a look at this. I am not sure what to make of the test failures. The failures seem to be all related to Unix transports and subprocess transports. The PR only should affect TCP transports. I'm trying to repro. My dev environment is Ubuntu / py3.12, and the tests that are failing here are passing there. For example:

test_process_send_signal_1 (test_process.Test_UV_Process.test_process_send_signal_1) ... ok
test_process_streams_basic_1 (test_process.Test_UV_Process.test_process_streams_basic_1) ... ok
test_process_streams_devnull (test_process.Test_UV_Process.test_process_streams_devnull) ... ok
test_process_streams_pass_fds (test_process.Test_UV_Process.test_process_streams_pass_fds) ... ok

Do you have any ideas on how to proceed?

@fantix
Copy link
Member

fantix commented May 9, 2025

They are breaking in the debug build, maybe try this:

- name: Test (debug build)
if: steps.release.outputs.version == 0
run: |
make distclean && make debug && make test

@todddialpad
Copy link
Author

  • Tests / test (3.12, macos-latest) (pull_request)

Yes, I get the failures with the debug build, good eye. Thanks.

I have instrumented the changed code, and in a failing test, the modifications never even run (which makes sense since the test isn't creating any TCP connections).

I have built without this patch, and still see the failures with the debug build.

git clone --recursive https://github.com/magicstack/uvloop.git uvloop.official
cd uvloop.official/
python3 -m venv uvloop-dev
source uvloop-dev/bin/activate
pip install -e .[dev]
pip install psutil
make debug
make test
======================================================================
FAIL: test_process_streams_pass_fds (test_process.Test_UV_Process.test_process_streams_pass_fds) [Alive handle after test] (handle_name='UVProcessTransport')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "uvloop.official/uvloop/_testbase.py", line 142, in tearDown
    self.assertEqual(
AssertionError: 1 != 0 : alive UVProcessTransport after test

So, could an upstream dependency have broken the debug build?

@todddialpad
Copy link
Author

So, could an upstream dependency have broken the debug build?

Since the last successful test run, the following upstream dependencies have changed:

Cython-3.1.0 (was 3.0.12)
aiohttp-3.11.18 (was 3.11.16)
frozenlist-1.6.0 (was 1.5.0)
mypy_extensions-1.1.0 (was 1.0.0)
setuptools-80.3.1 (was 78.1.0)

I rebuilt using Cython-3.0.12 and the tests passed.

Would a manual execution of the tests on the main branch still pass (assuming it will grab Cython 3.1.0)?

@todddialpad
Copy link
Author

Would a manual execution of the tests on the main branch still pass (assuming it will grab Cython 3.1.0)?

I forked the main branch and tried running the tests. It fails with Cython 3.1.0. I pinned Cython to < 3.1.0 and the tests pass. I included this PR, and with the pinned Cython, all tests pass.

So I believe this PR could be merged. I created an issue for Cython 3.1.0 #677 .

@jugalshah291
Copy link

Hi checking back on this, any ETA on when it would be merged

@entelligence-ai-reviews

Walkthrough

This update refines the management of socket file descriptors in the Loop class when creating new TCP transports. By detaching the file descriptor from the socket and transferring ownership directly to the transport, the code ensures clearer resource management and eliminates unnecessary references to the original socket object. This change clarifies the lifecycle and responsibility for the socket's file descriptor within the transport implementation.

Changes

File(s) Summary
uvloop/loop.pyx Modified TCP transport creation to detach the file descriptor from the socket using sock.detach(), pass it to tr._open(), and remove the call to tr._attach_fileobj(sock), improving resource management and clarifying ownership.

Sequence Diagram

This diagram shows the interactions between components:

sequenceDiagram
    title Socket File Descriptor Handling in TCP Transport
    
    participant Client
    participant EventLoop as "Event Loop"
    participant TCPTransport
    participant Socket
    participant Protocol
    
    Client->>EventLoop: Create connection
    activate EventLoop
    
    EventLoop->>TCPTransport: new(self, protocol, None, waiter, context)
    activate TCPTransport
    
    Note over EventLoop,TCPTransport: Before PR: Used sock.fileno() directly
    Note over EventLoop,TCPTransport: After PR: Takes ownership of file descriptor
    
    EventLoop->>Socket: detach()
    activate Socket
    Socket-->>EventLoop: sockfd (file descriptor)
    deactivate Socket
    
    EventLoop->>TCPTransport: _open(sockfd)
    Note right of TCPTransport: Uses detached file descriptor
    
    TCPTransport->>TCPTransport: _init_protocol()
    EventLoop->>EventLoop: await waiter
    
    alt Exception occurs
        EventLoop->>TCPTransport: _close()
        TCPTransport-->>EventLoop: Exception propagated
    else Success
        Note over EventLoop,TCPTransport: Before PR: Called _attach_fileobj(sock)
        Note over EventLoop,TCPTransport: After PR: This call was removed
        
        alt ssl is True
            EventLoop->>Protocol: _get_app_transport(context)
            Protocol-->>EventLoop: app_transport
        end
    end
    
    deactivate TCPTransport
    deactivate EventLoop
Loading

▶️AI Code Reviews for VS Code, Cursor, Windsurf
Install the extension

Note for Windsurf Please change the default marketplace provider to the following in the windsurf settings:

Marketplace Extension Gallery Service URL: https://marketplace.visualstudio.com/_apis/public/gallery

Marketplace Gallery Item URL: https://marketplace.visualstudio.com/items

Entelligence.ai can learn from your feedback. Simply add 👍 / 👎 emojis to teach it your preferences. More shortcuts below

Emoji Descriptions:

  • ⚠️ Potential Issue - May require further investigation.
  • 🔒 Security Vulnerability - Fix to ensure system safety.
  • 💻 Code Improvement - Suggestions to enhance code quality.
  • 🔨 Refactor Suggestion - Recommendations for restructuring code.
  • ℹ️ Others - General comments and information.

Interact with the Bot:

  • Send a message or request using the format:
    @entelligenceai + *your message*
Example: @entelligenceai Can you suggest improvements for this code?
  • Help the Bot learn by providing feedback on its responses.
    @entelligenceai + *feedback*
Example: @entelligenceai Do not comment on `save_auth` function !

Also you can trigger various commands with the bot by doing
@entelligenceai command

The current supported commands are

  1. config - shows the current config
  2. retrigger_review - retriggers the review

More commands to be added soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.