replaced datapipe with ocf-data-sampler #323

zaryab-ali · 2025-02-14T20:51:45Z

Description

removed ocf_datapipe from the entire file and replaced with ocf_data_sampler, still some changes might need to be made, needs reviewing

Fixes #313

for more information, see https://pre-commit.ci

AUdaltsova · 2025-02-17T16:10:25Z

scripts/backtest_sites.py

+        t0 = pd.Timestamp(sample["site_init_time_utc"][0])
+        site_id = sample["site_id"][0]

        # Get valid times for this forecast
-        valid_times = pd.to_datetime(
-            [t0 + np.timedelta64((i + 1) * FREQ_MINS, "m") for i in range(n_valid_times)]
+        valid_times = pd.date_range(
+            start=t0 + pd.Timedelta(minutes=FREQ_MINS),


valid_times are supposed to be the timestamps the model makes predictions for, so they should not contain any history, i e start from t0, from where interval start and end are calculated. Judging by this TODO we might not store it in the samples right now; you might have to either make the change in the dataset itself or pull interval_start/interval_end from the data config here to get the correct times.

can you please have a look at the recent commit and let me know if the changes i made are correct or not

…tosampler

for more information, see https://pre-commit.ci

AUdaltsova · 2025-02-26T14:18:09Z

scripts/backtest_sites.py

+            start=t0 + pd.Timedelta(interval_start),
+            end=t0 + pd.Timedelta(interval_end),
+            freq=f"{time_resolution.astype(int)}min",


Lifting from config is a good way to do it, but the sample ModelPipe is getting already has history, so the first time in it is not t0, it's t0+interval_start. You can get t0 from last time - interval_end for example, or first time - interval_start (you can see how that happens in this function)

let me know if the following logic is ok
`last_time = pd.Timestamp(sample["site_target_time_utc"][-1])
t0 = last_time - pd.Timedelta(interval_end)

valid_times = pd.date_range(
start=t0 + pd.Timedelta(interval_start),
end=t0 + pd.Timedelta(interval_end),
freq=f"{time_resolution.astype(int)}min"
)`

Sorry, I should've explained how time windows in our samples work better. When a forecast is made at time t0 (which we also sometimes refer to as the init time), it will create a forecast for x minutes forward, which is what interval_end governs (in that case, interval_end = x). To do that, the model also needs information on what the generation was just before, so we supply to it y minutes of history, which is what interval_start governs (interval_start = -y in this example). So any sample will contain y+x minutes of generation data, some for history and some to check the forecast against at training time. When running inference, we are only interested in [t0, t0+interval_end] period, so these are the dates we need to extract from the sample.

AUdaltsova · 2025-02-26T14:20:29Z

scripts/backtest_sites.py

+        config = load_yaml_configuration(self.config_path)
+
+        interval_start = np.timedelta64(config.input_data.site.interval_start_minutes, "m")
+        interval_end = np.timedelta64(config.input_data.site.interval_end_minutes, "m")
+        time_resolution = np.timedelta64(config.input_data.site.time_resolution_minutes, "m")


Also I would probably unpack the config in the init or maybe even move the whole thing to main and pass extracted interval_end or whatever you end up using to avoid doing it again and again for every batch

AUdaltsova · 2025-02-26T14:23:32Z

scripts/backtest_sites.py


-        # Make mask dataset for sundown
+        # Get solar elevation and create sundown mask
+        elevation = sample["site_solar_elevation"]
        da_sundown_mask = xr.DataArray(
            data=elevation < MIN_DAY_ELEVATION,


There used to be a pv_system dimension which I would keep, as the intention I think is for this to work for multiple sites simultaneously, and even if we end up using just one site this functionality would still allow it

AUdaltsova · 2025-02-26T14:41:48Z

scripts/backtest_sites.py

-    data_pipeline = data_pipeline.set_length(num_batches)
+    # Filter for specific site IDs
+    dataset.valid_t0_and_site_ids = dataset.valid_t0_and_site_ids[
+        dataset.valid_t0_and_site_ids.site_id.isin(ALL_SITE_IDS)


This is a good idea but I think the selection might not work - I think at this point dataset.valid_t0_and_site_ids is organised as a pd.DataFrame with site_ids as column names (see here)

valid_t0_per_site = pd.DataFrame(index=valid_t0_times_per_site) valid_t0_per_site["site_id"] = site_id

i think the above code sets site_id as column, and isin function should work properly, let me know if i am not seeing something here

AUdaltsova · 2025-02-26T14:43:28Z

scripts/backtest_sites.py



-def get_datapipe(config_path: str) -> NumpyBatch:
-    """Construct datapipe yielding batches of concurrent samples for all sites
+def get_datapipe(config_path: str):


I'd rename to get_dataset or maybe just move the contents into main as it doesn't do anything complicated anymore, just calls the dataset and filters.

AUdaltsova · 2025-02-26T14:49:23Z

scripts/backtest_sites.py

+    dataset = SitesDataset(config_path)
+    return dataset.datasets_dict["site"]


Instead of creating an entire dataset just for this I'd maybe just unpack the config and open the sites data directly, as in the function that creates .datasets_dict you're using here

AUdaltsova

Great work! I've left some comments, but it's starting to look very good now.

for more information, see https://pre-commit.ci

zaryab-ali · 2025-03-05T12:50:45Z

@AUdaltsova i made all the changes you suggested, let me go through them

updated valid times according to your instructions(hopefully got it right this time)
unpacked the config in the main function instead of predict_batch
added the pv_system dimension back
the part where i filter for specific ids, left it the same for now, replied to your comment with reasoning, let me know about it
got rid of the get_datapipe function and moved the functionality into main
instead of creating a siteDataset object, used the function get_dataset_dict( do have a question about this thou, because this function returns a dict which i manually converted into xr.dataset, but when we get it as an attribute of the class, the class does some preprocessing on it before returning an xr.dataset, let me know if the method i am using is ok )

zaryab-ali · 2025-03-10T01:56:51Z

hi again @AUdaltsova sorry for bothering again, just wanted to leave a reminder that i committed the changes a few days ago so have a look whenever you have time, i understand that an ocf-data-sampler model is not available right now so if the changes look legit we can merge this in a dev branch for now and when a model does become available, we can merge it to main after some testing

AUdaltsova · 2025-03-10T10:56:45Z

Hi @zaryab-ali, thanks for nudging me! Sorry to leave you hanging. I've taken a quick look and it looks good - I'll need to check it a bit more thoroughly before we can merge since there's no tests, and unfortunately it's a super busy time here at OCF so it might take me a bit to get back to you! Will try to do this first opportunity I get.

zaryab-ali and others added 2 commits February 14, 2025 03:01

replaced datapipe with ocf-data-sampler

457d7d7

[pre-commit.ci] auto fixes from pre-commit.com hooks

4b0c5ff

for more information, see https://pre-commit.ci

zaryab-ali mentioned this pull request Feb 14, 2025

Update backtest scripts to use ocf-data-sampler, for site #313

Open

AUdaltsova reviewed Feb 17, 2025

View reviewed changes

zaryab-ali and others added 4 commits February 24, 2025 03:01

cleaning up git and branches

ac2d8ef

Merge branch 'sitetosampler' of github.com:zaryab-ali/PVNet into site…

99b4f8a

…tosampler

updated valid_times

692446a

[pre-commit.ci] auto fixes from pre-commit.com hooks

0cdbc16

for more information, see https://pre-commit.ci

AUdaltsova reviewed Feb 26, 2025

View reviewed changes

AUdaltsova requested changes Feb 26, 2025

View reviewed changes

zaryab-ali and others added 2 commits March 5, 2025 17:27

solving minor issues

b00cfae

[pre-commit.ci] auto fixes from pre-commit.com hooks

5e38166

for more information, see https://pre-commit.ci

zaryab-ali requested a review from AUdaltsova March 10, 2025 01:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replaced datapipe with ocf-data-sampler #323

replaced datapipe with ocf-data-sampler #323

zaryab-ali commented Feb 14, 2025

AUdaltsova Feb 17, 2025

zaryab-ali Feb 25, 2025

AUdaltsova Feb 26, 2025

zaryab-ali Feb 26, 2025 •

edited

Loading

AUdaltsova Feb 27, 2025

AUdaltsova Feb 26, 2025

AUdaltsova Feb 26, 2025

AUdaltsova Feb 26, 2025

zaryab-ali Feb 26, 2025

AUdaltsova Feb 26, 2025

AUdaltsova Feb 26, 2025

AUdaltsova left a comment

zaryab-ali commented Mar 5, 2025

zaryab-ali commented Mar 10, 2025

AUdaltsova commented Mar 10, 2025

		dataset = SitesDataset(config_path)
		return dataset.datasets_dict["site"]

replaced datapipe with ocf-data-sampler #323

Are you sure you want to change the base?

replaced datapipe with ocf-data-sampler #323

Conversation

zaryab-ali commented Feb 14, 2025

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaryab-ali Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AUdaltsova left a comment

Choose a reason for hiding this comment

zaryab-ali commented Mar 5, 2025

zaryab-ali commented Mar 10, 2025

AUdaltsova commented Mar 10, 2025

zaryab-ali Feb 26, 2025 •

edited

Loading