-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replaced datapipe with ocf-data-sampler #323
base: main
Are you sure you want to change the base?
Conversation
scripts/backtest_sites.py
Outdated
t0 = pd.Timestamp(sample["site_init_time_utc"][0]) | ||
site_id = sample["site_id"][0] | ||
|
||
# Get valid times for this forecast | ||
valid_times = pd.to_datetime( | ||
[t0 + np.timedelta64((i + 1) * FREQ_MINS, "m") for i in range(n_valid_times)] | ||
valid_times = pd.date_range( | ||
start=t0 + pd.Timedelta(minutes=FREQ_MINS), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valid_times are supposed to be the timestamps the model makes predictions for, so they should not contain any history, i e start from t0, from where interval start and end are calculated. Judging by this TODO we might not store it in the samples right now; you might have to either make the change in the dataset itself or pull interval_start/interval_end from the data config here to get the correct times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please have a look at the recent commit and let me know if the changes i made are correct or not
scripts/backtest_sites.py
Outdated
start=t0 + pd.Timedelta(interval_start), | ||
end=t0 + pd.Timedelta(interval_end), | ||
freq=f"{time_resolution.astype(int)}min", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lifting from config is a good way to do it, but the sample ModelPipe is getting already has history, so the first time in it is not t0, it's t0+interval_start. You can get t0 from last time - interval_end for example, or first time - interval_start (you can see how that happens in this function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me know if the following logic is ok
`last_time = pd.Timestamp(sample["site_target_time_utc"][-1])
t0 = last_time - pd.Timedelta(interval_end)
valid_times = pd.date_range(
start=t0 + pd.Timedelta(interval_start),
end=t0 + pd.Timedelta(interval_end),
freq=f"{time_resolution.astype(int)}min"
)`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I should've explained how time windows in our samples work better. When a forecast is made at time t0 (which we also sometimes refer to as the init time), it will create a forecast for x minutes forward, which is what interval_end
governs (in that case, interval_end = x
). To do that, the model also needs information on what the generation was just before, so we supply to it y minutes of history, which is what interval_start
governs (interval_start = -y
in this example). So any sample will contain y+x
minutes of generation data, some for history and some to check the forecast against at training time. When running inference, we are only interested in [t0, t0+interval_end] period, so these are the dates we need to extract from the sample.
scripts/backtest_sites.py
Outdated
config = load_yaml_configuration(self.config_path) | ||
|
||
interval_start = np.timedelta64(config.input_data.site.interval_start_minutes, "m") | ||
interval_end = np.timedelta64(config.input_data.site.interval_end_minutes, "m") | ||
time_resolution = np.timedelta64(config.input_data.site.time_resolution_minutes, "m") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I would probably unpack the config in the init or maybe even move the whole thing to main and pass extracted interval_end or whatever you end up using to avoid doing it again and again for every batch
|
||
# Make mask dataset for sundown | ||
# Get solar elevation and create sundown mask | ||
elevation = sample["site_solar_elevation"] | ||
da_sundown_mask = xr.DataArray( | ||
data=elevation < MIN_DAY_ELEVATION, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There used to be a pv_system dimension which I would keep, as the intention I think is for this to work for multiple sites simultaneously, and even if we end up using just one site this functionality would still allow it
scripts/backtest_sites.py
Outdated
data_pipeline = data_pipeline.set_length(num_batches) | ||
# Filter for specific site IDs | ||
dataset.valid_t0_and_site_ids = dataset.valid_t0_and_site_ids[ | ||
dataset.valid_t0_and_site_ids.site_id.isin(ALL_SITE_IDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea but I think the selection might not work - I think at this point dataset.valid_t0_and_site_ids
is organised as a pd.DataFrame with site_ids as column names (see here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valid_t0_per_site = pd.DataFrame(index=valid_t0_times_per_site) valid_t0_per_site["site_id"] = site_id
i think the above code sets site_id as column, and isin function should work properly, let me know if i am not seeing something here
scripts/backtest_sites.py
Outdated
|
||
|
||
def get_datapipe(config_path: str) -> NumpyBatch: | ||
"""Construct datapipe yielding batches of concurrent samples for all sites | ||
def get_datapipe(config_path: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rename to get_dataset or maybe just move the contents into main as it doesn't do anything complicated anymore, just calls the dataset and filters.
scripts/backtest_sites.py
Outdated
dataset = SitesDataset(config_path) | ||
return dataset.datasets_dict["site"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of creating an entire dataset just for this I'd maybe just unpack the config and open the sites data directly, as in the function that creates .datasets_dict you're using here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I've left some comments, but it's starting to look very good now.
for more information, see https://pre-commit.ci
@AUdaltsova i made all the changes you suggested, let me go through them
|
hi again @AUdaltsova sorry for bothering again, just wanted to leave a reminder that i committed the changes a few days ago so have a look whenever you have time, i understand that an ocf-data-sampler model is not available right now so if the changes look legit we can merge this in a dev branch for now and when a model does become available, we can merge it to main after some testing |
Hi @zaryab-ali, thanks for nudging me! Sorry to leave you hanging. I've taken a quick look and it looks good - I'll need to check it a bit more thoroughly before we can merge since there's no tests, and unfortunately it's a super busy time here at OCF so it might take me a bit to get back to you! Will try to do this first opportunity I get. |
Description
removed ocf_datapipe from the entire file and replaced with ocf_data_sampler, still some changes might need to be made, needs reviewing
Fixes #313