Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replaced datapipe with ocf-data-sampler #323

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

zaryab-ali
Copy link

Description

removed ocf_datapipe from the entire file and replaced with ocf_data_sampler, still some changes might need to be made, needs reviewing

Fixes #313

Comment on lines 186 to 191
t0 = pd.Timestamp(sample["site_init_time_utc"][0])
site_id = sample["site_id"][0]

# Get valid times for this forecast
valid_times = pd.to_datetime(
[t0 + np.timedelta64((i + 1) * FREQ_MINS, "m") for i in range(n_valid_times)]
valid_times = pd.date_range(
start=t0 + pd.Timedelta(minutes=FREQ_MINS),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valid_times are supposed to be the timestamps the model makes predictions for, so they should not contain any history, i e start from t0, from where interval start and end are calculated. Judging by this TODO we might not store it in the samples right now; you might have to either make the change in the dataset itself or pull interval_start/interval_end from the data config here to get the correct times.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please have a look at the recent commit and let me know if the changes i made are correct or not

Comment on lines 199 to 201
start=t0 + pd.Timedelta(interval_start),
end=t0 + pd.Timedelta(interval_end),
freq=f"{time_resolution.astype(int)}min",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lifting from config is a good way to do it, but the sample ModelPipe is getting already has history, so the first time in it is not t0, it's t0+interval_start. You can get t0 from last time - interval_end for example, or first time - interval_start (you can see how that happens in this function)

Copy link
Author

@zaryab-ali zaryab-ali Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know if the following logic is ok
`last_time = pd.Timestamp(sample["site_target_time_utc"][-1])
t0 = last_time - pd.Timedelta(interval_end)

valid_times = pd.date_range(
start=t0 + pd.Timedelta(interval_start),
end=t0 + pd.Timedelta(interval_end),
freq=f"{time_resolution.astype(int)}min"
)`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should've explained how time windows in our samples work better. When a forecast is made at time t0 (which we also sometimes refer to as the init time), it will create a forecast for x minutes forward, which is what interval_end governs (in that case, interval_end = x). To do that, the model also needs information on what the generation was just before, so we supply to it y minutes of history, which is what interval_start governs (interval_start = -y in this example). So any sample will contain y+x minutes of generation data, some for history and some to check the forecast against at training time. When running inference, we are only interested in [t0, t0+interval_end] period, so these are the dates we need to extract from the sample.

Comment on lines 188 to 192
config = load_yaml_configuration(self.config_path)

interval_start = np.timedelta64(config.input_data.site.interval_start_minutes, "m")
interval_end = np.timedelta64(config.input_data.site.interval_end_minutes, "m")
time_resolution = np.timedelta64(config.input_data.site.time_resolution_minutes, "m")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I would probably unpack the config in the init or maybe even move the whole thing to main and pass extracted interval_end or whatever you end up using to avoid doing it again and again for every batch


# Make mask dataset for sundown
# Get solar elevation and create sundown mask
elevation = sample["site_solar_elevation"]
da_sundown_mask = xr.DataArray(
data=elevation < MIN_DAY_ELEVATION,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There used to be a pv_system dimension which I would keep, as the intention I think is for this to work for multiple sites simultaneously, and even if we end up using just one site this functionality would still allow it

data_pipeline = data_pipeline.set_length(num_batches)
# Filter for specific site IDs
dataset.valid_t0_and_site_ids = dataset.valid_t0_and_site_ids[
dataset.valid_t0_and_site_ids.site_id.isin(ALL_SITE_IDS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea but I think the selection might not work - I think at this point dataset.valid_t0_and_site_ids is organised as a pd.DataFrame with site_ids as column names (see here)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valid_t0_per_site = pd.DataFrame(index=valid_t0_times_per_site) valid_t0_per_site["site_id"] = site_id

i think the above code sets site_id as column, and isin function should work properly, let me know if i am not seeing something here



def get_datapipe(config_path: str) -> NumpyBatch:
"""Construct datapipe yielding batches of concurrent samples for all sites
def get_datapipe(config_path: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename to get_dataset or maybe just move the contents into main as it doesn't do anything complicated anymore, just calls the dataset and filters.

Comment on lines 158 to 159
dataset = SitesDataset(config_path)
return dataset.datasets_dict["site"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating an entire dataset just for this I'd maybe just unpack the config and open the sites data directly, as in the function that creates .datasets_dict you're using here

Copy link
Contributor

@AUdaltsova AUdaltsova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I've left some comments, but it's starting to look very good now.

@zaryab-ali
Copy link
Author

@AUdaltsova i made all the changes you suggested, let me go through them

  1. updated valid times according to your instructions(hopefully got it right this time)
  2. unpacked the config in the main function instead of predict_batch
  3. added the pv_system dimension back
  4. the part where i filter for specific ids, left it the same for now, replied to your comment with reasoning, let me know about it
  5. got rid of the get_datapipe function and moved the functionality into main
  6. instead of creating a siteDataset object, used the function get_dataset_dict( do have a question about this thou, because this function returns a dict which i manually converted into xr.dataset, but when we get it as an attribute of the class, the class does some preprocessing on it before returning an xr.dataset, let me know if the method i am using is ok )

@zaryab-ali zaryab-ali requested a review from AUdaltsova March 10, 2025 01:52
@zaryab-ali
Copy link
Author

hi again @AUdaltsova sorry for bothering again, just wanted to leave a reminder that i committed the changes a few days ago so have a look whenever you have time, i understand that an ocf-data-sampler model is not available right now so if the changes look legit we can merge this in a dev branch for now and when a model does become available, we can merge it to main after some testing

@AUdaltsova
Copy link
Contributor

Hi @zaryab-ali, thanks for nudging me! Sorry to leave you hanging. I've taken a quick look and it looks good - I'll need to check it a bit more thoroughly before we can merge since there's no tests, and unfortunately it's a super busy time here at OCF so it might take me a bit to get back to you! Will try to do this first opportunity I get.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update backtest scripts to use ocf-data-sampler, for site
2 participants