Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Deterministically Release the Memory of a pyarrow.Table #45078

Open
georgikoyrushki95 opened this issue Dec 19, 2024 · 0 comments
Open

Comments

@georgikoyrushki95
Copy link

Describe the usage question you have. Please include as many useful details as possible.

Hello, I have a use case in Python involving arrow flight that is exemplified by the below snippet:

import pyarrow as pa
import pyarrow.flight as flight

def do_some_work(…):
     # … set-up
     client = flight.FlightClient(xxx)
     ticket = xxx
    
     reader = client.do_get(ticket)
     # Assume the table is quite large - 500 MB
     arrow_table: pa.Table = reader.read_all()
     
     # Assume res is a very small object, compared to
     # the size of the table.
     res = do_something_quick_with(arrow_table)
    
     # (A) From this point onwards arrow_table is no longer needed…

     # … rest of the pipeline that uses res and does a lot of other things …

The above snippet is a slight simplification. The real-world scenario is a little more complex because the table is obtained in a library I don’t necessarily have easy control over and is passed to user-level code.

At point (A) above benchmarking in high volume scenarios has shown it would be really good to free up the memory of the arrow_table. The table itself does not have an explicit .close() method or anything indicating we’re able to free the memory associated with it. A few things I have tried are:

  • Obtaining the actual RecordBatchReader and calling close on it:
reader: RecordBatchReader = client.do_get(ticket).to_reader()
# … use the reader to obtain the arrow table …

# close the reader
reader.close()
  • Deleting the reference to the arrow table via del and hoping at some point GC would kick in.
  • Deleting the reference via del and explicitly calling the GC (just for testing, I am aware this is not a recommended practice).

In the last 2 cases above, just as a debugging exercise, I ended up printing the number of references to the arrow_table object before calling del. Expectation was it’d be 1, but it was more than that, so my assumption is something gets held internally within the flight framework.

The above said, my question is - is there a deterministic way that always work to release the memory of a pyarrow.Table. I can imagine why in most of the cases doing this would be quite cumbersome and it’d be best to rely on the reference counting mechanism + the GC naturally kicking in, but in this particular case it would be quite useful.

I would also be grateful, if I can get some pointers to the lifetime implications of these objects in Python. It is not very clear from the documentation, for example, if the arrow_tables lifetime from above is tied to the lifetime of the reader and vice versa. Again, I appreciate in 99% of the cases we shouldn’t need to care about it, but there’s still this 1% that having this explained a little more in depth would be of great use!

P.S. There is a near-identical example I had to do within Java and the VectorSchemaRoot’s API conveniently exposes a .close() method, which works quite nicely in my use case.

Component(s)

Documentation, FlightRPC, Python

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant