Extracting file handles from DataArray #10320

charles-turner-1 · 2025-05-14T02:25:34Z

charles-turner-1
May 14, 2025

I've been working on some functionality that lets users inspect either:

A netCDF file
A list of netCDF files (as strings and/or paths)
An xarray Dataset
An xarray DataArray
An intake-esm catalogue

in order to validate the user supplied chunking passed to xarray to open a dataset with. The idea is that by inspecting the netCDF disk chunks, we can easily assert whether the user provided chunks are integer multiples of the disk chunks (if not, we expect performance issues), and adjust them to match disk chunks if they are not.

Relevant functionality is here and a PR improving the functionality here.

However, I'm stumped on whether it's possible to actually extract all the file handles from an xr.DataArray, if opened with xr.open_mfdataset:

def _get_file_handles(dataset: Dataset | DataArray) -> list[Path]:
    """
    Get the file handles from a dataset or dataarray.

    Parameters
    ----------
    dataset : Dataset | DataArray
        The dataset or dataarray to get the file handles from.

    Returns
    -------
    list[Path]
        A list of file handles.
    """

    if encoding_fname := dataset.encoding.get("source", False):
        # We must have a single file.
        return [Path(encoding_fname)]

    # If not, we are going to need to extract file handles from the ._close
    # attribute
    file_handles: list[Path] = []
    if isinstance(dataset._close, partial):
        # Extract the list of bound methods from the partial's arguments
        bound_methods = dataset._close.args[0]
        for bound_method in bound_methods:
            # The bound method is tied to a NetCDF4DataStore object
            if hasattr(bound_method, "__self__"):
                file_handles.append(Path(bound_method.__self__._filename))

    return file_handles

A dataset opened with xr.open_mfdataset will have ._close attributes from which the file handles can be extracted with the above logic. However, creating a DataArray from that dataset will set the .encoding['source'] attribute to the first file handle in the list of paths passed in to open the dataset, and I seem to lose any access to the full set of file handles.

I assume since Datasets & DataArrays are lazily loaded that there must still exist some sort of file handle somewhere which could be accessed somehow, even if it is a bit hacky.

Incidentally, AFAIK xarray doesn't really provide any mechanism to confirm that user provided chunks match up with disk chunks nicely, which is the gap this tool I've been working on aims to address (I know a warning is emitted if chunking separates disk chunks, but no info is given on how to fix it). If this can be done cleanly, I'm happy to open a PR adding the functionality if the community thinks it would be useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Extracting file handles from DataArray #10320

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Extracting file handles from DataArray #10320

Uh oh!

Uh oh!

charles-turner-1 May 14, 2025

Replies: 0 comments

charles-turner-1
May 14, 2025