Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update backtest scripts to use ocf-data-sampler, for site #313

Open
zaryab-ali opened this issue Feb 1, 2025 · 15 comments · May be fixed by #323
Open

Update backtest scripts to use ocf-data-sampler, for site #313

zaryab-ali opened this issue Feb 1, 2025 · 15 comments · May be fixed by #323
Assignees

Comments

@zaryab-ali
Copy link

@peterdudfield you mentioned this in phase 2 in openclimatefix/ocf-data-sampler#98
if my understaning is correct, we are trying to change ocf_datapipes to ocf-data-sampler and the changes need to made in this file https://github.com/openclimatefix/PVNet/blob/main/scripts/backtest_sites.py

any guidence or reference for getting started would be really helpful

@peterdudfield
Copy link
Contributor

I think @AUdaltsova has been working on this, so Ill let @AUdaltsova update you.

@AUdaltsova
Copy link
Contributor

Hi @zaryab-ali, thanks for jumping in! You are right, it refers to the script you've mentioned and this UK GSP script, which is what I've been working on. We still very much need the site version sorted as well, so the help would be very welcome if it's something you want to do!

The general idea behind these scripts is to create samples for a given time window, run them through a pretrained model, and save inference results. In general, I expect you'll be able to get rid of some functions that build the datapipe from scratch and use the site dataset as is. Things like ModelPipe I'd expect will stay, but might need updating to work with ocf-data-sampler instead of ocf_datapipes.

I appreciate that it's a lot to get through those scripts and what they're trying to do! Certainly took me a while :) So don't hesitate to ping me if you need help.

Also, I expect I'll be adding the uk gsp script soon, so if you want you can wait for that to have something for reference.

@AUdaltsova
Copy link
Contributor

@zaryab-ali hi again! Just wanted to let you know that I've put the version of uk gsp backtest I've been working on here, so you can take a look if you want! NB that this script uses an older version of ocf-data-sampler (0.19 I think), so still relies on ocf_datapipes a little bit - your version shouldn't have that! You also don't need to include the ensemble_member parameter as it's specific to my work. Also, it is definitely not perfect, so feel free to take creative liberties if you think you can do something better/cleaner (eg I was thinking that populating the data config pulled off HF with paths can be done a lot better. Speaking of which, you might want to create some fake data you can point to for debugging!)

Hope that's helpful, let me know if you have any questions!

@zaryab-ali
Copy link
Author

@AUdaltsova thank you for the reference code,
i will start working on this ASAP
also let me know if there is a slack channel or something that i can join for asking questions, or should i ask about any future queries right here

@AUdaltsova
Copy link
Contributor

@zaryab-ali no problem, thanks for volunteering! As far as I know we don't operate a community slack or anything similar at the moment, so feel free to just ping me here. We have a discussions page if you have other questions unrelated to this issue

@AUdaltsova
Copy link
Contributor

Hi @zaryab-ali, I think you've deleted your comment about data download, just wanted to check if you need any help with that?

@AUdaltsova AUdaltsova assigned zaryab-ali and unassigned AUdaltsova Feb 6, 2025
@zaryab-ali
Copy link
Author

@AUdaltsova thanks for reaching out, i was having an issue with the huggingface data but thats resolved, i was hoping you could help me with another issue, i followed the readme instructions to the dot but i am getting this error when running the save_batches.py
In 'config.yaml': Could not find 'datamodule/ocf_datapipes.yaml'

Image

@AUdaltsova
Copy link
Contributor

Hi! Glad to hear it's resolved! Re: config error, you need to change where datamodule points in the config.yaml, as it defaults to a nonexistent one (maybe we should remove that...). If you're trying to create batches, I'd point it to streamed_batches.yaml (there is one in config/datamodule folder in PVNet), but be sure to check the data config it's pointing to, especially the data paths. You can look at data configs we have on HF for reference

@zaryab-ali
Copy link
Author

zaryab-ali commented Feb 13, 2025

@AUdaltsova hi again, i've made the necessary changes to backtest_sites.py ( might be a little rough and might need reviewing and changes ), i wanted to ask which branch of pvnet should i create a pull request in

@zaryab-ali
Copy link
Author

zaryab-ali commented Feb 14, 2025

also i think the example_configuation.yaml needs to be updated to something like this
`input_data:
default_history_minutes: 120(removed)
default_forecast_minutes: 480(removed)

site:
file_path: plcholder
metadata_file_path: plceholder
time_resolution_minutes: 15
interval_start_minutes: -120 # Change from 0 to -120 to get 120 minutes of history
interval_end_minutes: 480 # This is correct as is (matches your old forecast_minutes)
dropout_timedeltas_minutes: null
dropout_fraction: 0
`
and the names of the column in metadata.csv given in here need to be updated to work with ocf data sample

let me know if i am mistaken about any of this

@zaryab-ali zaryab-ali linked a pull request Feb 14, 2025 that will close this issue
@zaryab-ali
Copy link
Author

zaryab-ali commented Feb 14, 2025

@AUdaltsova i created a pull request here, i noticed that some of the changes i mentioned in the above comment about example_configuration.yaml are present in a commit in another branch, here

also i wanted to know if you could share link to any data other than the pv link in the read me, because right now i am using sort of a hacky way by creating my own csv and netcdf, because like i mentioned earlier, the coloumn names dont match with what ocf-data-sampler is expecting,

basically what i wanna know is that is this something that can be solved by changing the csv and netcdf or do i need to add some logic in backtest_sites.py to handle this during run time

@AUdaltsova
Copy link
Contributor

@zaryab-ali hi again! That's great news, thanks for being on it! Re: config changes, I think we've merged the dev branch we were using for all ocf-data-sampler compatible changes into main right about the time you've left those comments, which might be the source of the confusion. PVNet's main branch should be the one relying on ocf-data-sampler now, so your PR is in the right place I think. I'll leave further comments on the PR itself to keep it in one place.

Re: pv data structure, this script is not supposed to be working with the uk_pv dataset we have on HF, instead it should open local data with paths given in the config files, so yeah, if it works with csv and netcdf you've created it should be fine. I'll see if I can find an example of those structures for you so you can double-check everything is working as it should

@zaryab-ali
Copy link
Author

@AUdaltsova I'm working on the changes you suggested on the PR, let me know when you can provide the example data you mentioned above for double-checking

@AUdaltsova
Copy link
Contributor

@zaryab-ali hi again! I don't think we have any sites data in open use at the moment, but this script creating data for testing should be helpful! You can just use whatever it spits out or maybe extend the time a bit so you have more data to play with.

Though I've realised we don't actually have any models trained with ocf-data-sampler for sites yet, so it will be really hard to validate it's working correctly. I think a first pass is still going to be useful but unless you're willing to train a model yourself just for the sake of doing this, it can get really tricky to finish the script.

Just to be super clear, I think it will still be useful to merge your PR after some changes because it does a lot of useful work for this script and is not going to break anything anyway, and then if you want I can ping you when we have the models to validate this and you can do another PR to finish this? If this sounds okay, I can comment more on your PR with what I would check for in these circumstances, and we can go from there.

@zaryab-ali
Copy link
Author

zaryab-ali commented Feb 24, 2025

@AUdaltsova i have made the changes you suggested on the PR, and i guess you are right, training my own model would be a little excessive just for backtesting(plus i dont have the hardware to train a model to yeild any meaningful results), lemme know if you need me to make any other changes besides the one you mentioned in the PR #323, so i can make the changes and update the PR and we can do further testing when a model does become available

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants