Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resample beliefs about instantaneous sensors #118

Merged
merged 30 commits into from
Nov 22, 2022

Conversation

Flix6x
Copy link
Collaborator

@Flix6x Flix6x commented Oct 31, 2022

This PR implements resampling for a BeliefDataFrame describing an instantaneous sensor. It does so by updating the data frequency without changing the event resolution. At some point, I'd like to start introducing terminology to distinguish between the resolution of events and the frequency of the data, but not in this PR.

@Flix6x Flix6x added the Resampling Dealing with downsampling and/or upsampling label Oct 31, 2022
@Flix6x Flix6x self-assigned this Oct 31, 2022
@Flix6x Flix6x marked this pull request as ready for review October 31, 2022 13:01
@Flix6x Flix6x requested a review from nhoening October 31, 2022 13:02
Copy link
Contributor

@nhoening nhoening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably super useful, but can be much clearer.

timely_beliefs/tests/test_belief_utils.py Show resolved Hide resolved
timely_beliefs/tests/test_belief_utils.py Outdated Show resolved Hide resolved
timely_beliefs/beliefs/classes.py Outdated Show resolved Hide resolved
@@ -749,3 +751,21 @@ def extreme_timedeltas_not_equal(
if isinstance(td_a, pd.Timedelta):
td_a = td_a.to_pytimedelta()
return td_a != td_b


def downsample_first(df: pd.DataFrame, resolution: timedelta) -> pd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function name was zero helpful to me. Comment also didn't help a lot.

Is this downsampling, and in case of non-obvious cases opting for the first event?

Also, handling DST correctly is so crucial, we could even consider it in to he name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this issue. For me it boils down to a more fundamental question: does resampling affect the data frequency or the event resolution? Hopefully these examples illustrate the question:

Let's say there is a sensor recording the average wind speed over a period of 3 seconds. That is, its event resolution is 3 seconds. And let's say a record is made every second, so we get a sliding window of measurements. That is, its data frequency is 1 second. (To be explicit, I'm following pandas terminology here, which measures frequency in units of time, say, s rather than 1/s.)

Now we want to resample and save the results to a new sensor that records the maximum 3-second wind gust in a 10 minute period, for every sequential period of 10 minutes (this is actually a useful measure, e.g. reported by KNMI). That is, the new sensor has an event resolution of 10 minutes, and also a data frequency of 10 minutes.

This resampling operation is downsampling, with both the event resolution and the data frequency being downsampled to 10 minutes. Something like .resample(pd.Timedelta("PT10M")).max() should do the trick to transform the data, and the event resolution would be updated to 10 minutes.

Now consider a sensor recording instantaneous temperature readings, once per hour. That is, its event resolution is 0 hours and its data frequency is 1 hour. What would it mean to resample to 30 minutes? I see two possibilities:

  1. We measure instantaneous temperature every 30 minutes. The event resolution stays 0 hours and the data frequency becomes 30 minutes. Something like .resample(pd.Timedelta("PT30M")).ffill() (or .interpolate()) would do the trick to transform the data. The event resolution would not change. We are upsampling (with regards to the data frequency).
  2. We measure average temperatures over a period of 30 minutes. Some combination of .resample().interpolate().rolling().mean() should do the trick to transform the data. The event resolution would be updated to 30 minutes. We are upsampling (with regards to the data frequency) and downsampling (with regards to the event resolution).

For example, given temperature readings:

10 °C at 1 PM
12 °C at 2 PM

With linear interpolation, option 1 yields:

10 °C at 1 PM
11 °C at 1.30 PM
12 °C at 2 PM

With linear interpolation, option 2 yields:

10.5 °C average between 1 PM and 1.30 PM
11.5 °C average between 1.30 PM and 2 PM

For now, I'm gravitating towards a policy like "in timely-beliefs, event resampling, by default, updates both the data frequency and the event resolution for non-instantaneous sensors, and only the data frequency for instantaneous sensors." But I'd like to support updating the event resolution for instantaneous sensors, too, maybe only when explicitly requested. For example, consider instantaneous battery SoC measurements every 5 minutes. When resampled to a day, one might not really be interested to know the SoC at midnight every day (.first(), and keep the original zero event resolution), but rather something like the daily min or max SoC (.min()/.max(), and update the event resolution to 1 day).

Copy link
Contributor

@nhoening nhoening Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the distinction now (between treating data frequency and event resolution)

W.r.t. your last paragraph,

  1. I can imagine resampling policies, which can be chosen by way of general configuration setting or even per sensor attribute. You named two policies here, hopefully we could keep that simple for now. And you know which default you'd prefer.
  2. The text you've written can probably enter the timely-beliefs documentation, so everybody can learn about our thoughts here.

Finally, you made no suggestion for a better function name yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resample_to_frequency or resample_frequency?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the function.

I can imagine resampling policies

I now made the util function accept different resampling methods, and restricted its use in resample_events to BeliefsDataFrames containing as many events as data points. Solving the general case is a lot harder, but this should already support most use cases.

The text you've written can probably enter the timely-beliefs documentation, so everybody can learn about our thoughts here.

I opened #123 for this.



def downsample_first(df: pd.DataFrame, resolution: timedelta) -> pd.DataFrame:
"""Resample data representing instantaneous events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumption (instantaneous events) is never checked. Or that the index is even a DateTimeIndex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reply on this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more leaning towards making this a private function. Would that take away your concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. If it matters so much, why not add an assert statement? I'm guessing you wouldn't be sure what the outcomes would be if df wasn't instantaneous?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #124 as a follow-up.

timely_beliefs/beliefs/utils.py Outdated Show resolved Hide resolved
@Flix6x
Copy link
Collaborator Author

Flix6x commented Nov 15, 2022

FYI After you get a chance to reply to our open comments, I plan to clear up the terminology in the docstrings with regards to the sensor resolution vs data frequency.

@Flix6x Flix6x added this to the 1.15.0 milestone Nov 21, 2022
Copy link
Contributor

@nhoening nhoening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve with the logic in place here, but I still don't grok something about the instantaneous case.

timely_beliefs/tests/test_belief_utils.py Outdated Show resolved Hide resolved
@@ -1416,6 +1434,13 @@ def resample_events(
return self
df = self

# Resample instantaneous sensors
# The event resolution stays zero, but the event frequency is updated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the event resolution doesn't stay zero in all cases, now, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does here, because the resampling method defaults to asfreq.

@@ -1416,6 +1434,13 @@ def resample_events(
return self
df = self

# Resample instantaneous sensors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't explain why your new function only applies to instantaneous sensors. Can you add that?

And why is that the one case in which we take so much care of DST transitions? (this resample_events function here has two other cases)

Note: The NB about keep_only_most_recent_belief=True applies to only one case, the reader might assume it's about all of them.



def downsample_first(df: pd.DataFrame, resolution: timedelta) -> pd.DataFrame:
"""Resample data representing instantaneous events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. If it matters so much, why not add an assert statement? I'm guessing you wouldn't be sure what the outcomes would be if df wasn't instantaneous?

@Flix6x
Copy link
Collaborator Author

Flix6x commented Nov 21, 2022

Not quite. If it matters so much, why not add an assert statement? I'm guessing you wouldn't be sure what the outcomes would be if df wasn't instantaneous?

You guessed correctly. I really intended this PR to be only about resampling instantaneous sensor data. That means I just haven't investigated yet whether all of this DST handling logic would be useful for resampling non-instantaneous sensor data, too. I'll gladly make that a follow-up ticket.

One extra complexity is that the function currently works on regular DataFrames, too. They don't have a sensor as metadata.

Copy link
Contributor

@nhoening nhoening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarifications!!

@Flix6x Flix6x merged commit 47a28a0 into main Nov 22, 2022
@Flix6x Flix6x deleted the resample-beliefs-about-instantaneous-sensors branch November 22, 2022 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resampling Dealing with downsampling and/or upsampling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants