[Question] Efficiently selecting nearest time data per group in Xarray #10233

Parrot7483 · 2025-04-18T11:41:35Z

Parrot7483
Apr 18, 2025

Subject: Efficiently selecting nearest time data per group (sv) in Xarray

Hi Xarray community,

I'm working with GNSS data where I need to calculate satellite positions based on ephemeris data. I have two main Xarray Datasets:

ranges: Contains observation data, indexed by coordinates sv (satellite ID, string) and time (datetime). This dataset defines the (sv, time) pairs for which I need calculations.
nav_data: Contains satellite ephemeris data, also indexed by sv (string) and time (datetime). For each sv, there are multiple entries at different times, representing updated ephemeris messages.

Goal:

For each (sv, time) coordinate pair present in ranges, I need to select the corresponding ephemeris data from nav_data using the following logic:

Match the sv coordinate exactly.
For that specific sv, find the entry in nav_data with the nearest time to the time from the ranges pair.

Current (Slow) Approach:

I'm currently using nested loops, which is inefficient for my dataset size (potentially thousands of time steps and multiple satellites):

# ranges has coordinates sv, time
# nav_data has coordinates sv, time
# result is pre-allocated with coordinates matching ranges

for satellite in ranges.sv.values:
    # Pre-filter nav_data for the current satellite
    nav_data_sat = nav_data.sel(sv=satellite).dropna(dim='time', how='all')

    # Iterate through the time coordinates relevant for the calculation
    for dt in ranges.time.values:
        # Find the single ephemeris entry for 'satellite' closest in time to 'dt'
        # This assumes we need a result for every combination, adjust if ranges is sparse
        ephemeris = nav_data_sat.sel(time=dt, method='nearest')

        # Perform calculation using the selected ephemeris and dt
        # x, y, z, ... = satellite_position_velocity_clock_correction(ephemeris, dt)

        # Store results for this specific (dt, satellite) pair
        # result['x'].loc[dt, satellite] = x
        # ... etc ...

entire module here

Challenge & Attempts:

I need a vectorized Xarray solution to replace these loops. I've tried:

nav_data.reindex_like(ranges, method='nearest'): This doesn't work as intended because method='nearest' is applied to both time (which is desired) and sv (which is not desired – I need an exact match for sv).
nav_data.groupby('sv').apply(...): I attempted to group nav_data by sv and then use reindex or sel with method='nearest' within the applied function, something like lambda ds: ds.reindex(time=ranges.time, method='nearest'). However, I ran into issues getting this to work correctly, possibly related to handling the coordinates and combining the results back.

Question:

What is the idiomatic Xarray way to efficiently perform this grouped nearest-neighbor lookup? Specifically, how can I select data from nav_data based on the (sv, time) coordinates in ranges, ensuring an exact match on sv and a nearest match on time for each group?

Thanks for any guidance or suggestions!

Answered by Parrot7483

May 5, 2025

One has to do the following it is reasonably faster but not perfect.

Group by id and time using UniqueGrouper and BinGrouper
squeeze the id coordinate
select using pad or nearest
collapse the time_bins coordinate using max or min to get a dense matrix.

View full answer

dcherian · 2025-04-18T16:21:02Z

dcherian
Apr 18, 2025
Maintainer

Interesting problem! Can you write a minimal example with synthetic data that we could test out please?

0 replies

Parrot7483 · 2025-04-23T15:48:34Z

Parrot7483
Apr 23, 2025
Author

Here you go.

import numpy as np
import xarray as xr

def f(ephemeris, x):
    a = ephemeris.a.item()
    b = ephemeris.b.item()

    return a * x + b

# Function to compute values based on nearest time
def compute_nearest_time_values(ephemeris, observations, x):
    """
    Computes f(x) = a * x + b for each (id, time) pair in observations,
    using the nearest (id, time) pair from data for coefficients a and b.
    """
    result = xr.Dataset(
        {
            "value": (("id", "time"), np.empty((len(observations.id), len(observations.time))))
        },
        coords={"id": observations.id, "time": observations.time},
    )

    for identifier in observations.id:
        observations_id = ephemeris.sel(id=identifier).dropna(dim="time", how="all")

        for time in observations.time:
            # Find nearest entry in data
            nearest = observations_id.sel(time=time, method="nearest")
            # Compute the value
            value = f(nearest, x)
            # Store the result
            result["value"].loc[identifier, time] = value

    return result

# Dummy ids
i0, i1 = "id0", "id1"

# Dummy ephemeris
e0, e1, e2, e3 = "2025-01-01T00:00", "2025-01-01T06:00", "2025-01-01T12:00", "2025-01-01T18:00"
ephemeris = xr.Dataset(
    {
        "a": (("id", "time"), [[1, np.nan, 3, 4], [10, 20, np.nan, 40]]),
        "b": (("id", "time"), [[5, np.nan, 3, 2], [50, 40, np.nan, 20]]),
    },
    coords={"id": [i0, i1], "time": np.array([e0, e1, e2, e3], dtype="datetime64")},
)

# Dummy observation
o0, o1, o2 = "2025-01-01T02:30", "2025-01-01T06:15", "2025-01-01T12:15"
observations = xr.Dataset(
    coords={"id": [i0, i1], "time": np.array([o0, o1, o2], dtype="datetime64")},
)

x = 10
result = compute_nearest_time_values(ephemeris, observations, x)

assert float(result.sel(id=i0, time=o0).value) == 1 * x + 5
assert float(result.sel(id=i0, time=o1).value) == 3 * x + 3
assert float(result.sel(id=i0, time=o2).value) == 3 * x + 3 # 06:00 does not exist, 12:00 is nearest
assert float(result.sel(id=i1, time=o0).value) == 10 * x + 50
assert float(result.sel(id=i1, time=o1).value) == 20 * x + 40
assert float(result.sel(id=i1, time=o2).value) == 40 * x + 20 # 12:00 does not exist, 18:00 is nearest

0 replies

Parrot7483 · 2025-05-05T09:08:17Z

Parrot7483
May 5, 2025
Author

One has to do the following it is reasonably faster but not perfect.

Group by id and time using UniqueGrouper and BinGrouper
squeeze the id coordinate
select using pad or nearest
collapse the time_bins coordinate using max or min to get a dense matrix.

2 replies

dcherian May 8, 2025
Maintainer

Nice, sorry I didn't get to it. Can you post some code for anyone reading this later?

Parrot7483 May 15, 2025
Author

Code will come soon

Parrot7483 · 2025-05-15T15:52:22Z

Parrot7483
May 15, 2025
Author

I think the main issue is xarray does not supports multi select with different methods. One would need method=None (exact) for one dimension and method="pad" for the other. I just had the same problem again.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Question] Efficiently selecting nearest time data per group in Xarray #10233

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

[Question] Efficiently selecting nearest time data per group in Xarray #10233

Uh oh!

Parrot7483 Apr 18, 2025

Replies: 4 comments · 2 replies

Uh oh!

dcherian Apr 18, 2025 Maintainer

Uh oh!

Uh oh!

Parrot7483 Apr 23, 2025 Author

Uh oh!

Parrot7483 May 5, 2025 Author

Uh oh!

dcherian May 8, 2025 Maintainer

Uh oh!

Parrot7483 May 15, 2025 Author

Uh oh!

Parrot7483 May 15, 2025 Author

Parrot7483
Apr 18, 2025

Replies: 4 comments 2 replies

dcherian
Apr 18, 2025
Maintainer

Parrot7483
Apr 23, 2025
Author

Parrot7483
May 5, 2025
Author

dcherian May 8, 2025
Maintainer

Parrot7483 May 15, 2025
Author

Parrot7483
May 15, 2025
Author