-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSoC Project Proposal]: Extend CrocoLake's available datasets #75
Comments
@enrico-mi can you provide a link to the |
Hi @MathewBiddle -- I'm working on publishing it and sharing the link by the end of the day |
@MathewBiddle I have added a link to the repo |
HI @MathewBiddle , @enrico-mi I am interested in working on this project , My background includes familiarity with python (over 4 years including packages such as pandas, numpy, xarray, sklearn, etc.) , Git/Github , Cloud technologies(Azure) , and other data lifecycle management skills ( building data pipelines, warehousing etc. ), I would love to get a headstart and guidance on where to get started on this project. |
Hi @MathewBiddle . @enrico-mi , I've brushed up my understandings of xarray and NETCDF files along with certain conventions for fair data, If you can provide some headstart and guidance on where to get started on this project , I can start contributing to this issue |
Hi @RATED-R-SUNDRAM , thank you for your interest in the project. Our application with Google is still pending so we cannot get started yet -- we should have news in a couple of weeks and we'll be able to get started then in case. Thank you for your patience. |
Sure @enrico-mi , looking forward to it |
Hi @enrico-mi, @david11133, My name is Miranda Zhu. I'd like to contribute to the CrocoLake project. As a junior studying data science at UC Berkeley, I have a strong foundation in Python and experiences working with databases. After reviewing the repository, I have some questions about how to best contribute to the project:
|
Hi @Mirandazhu02 , Just to clarify, I'm not a mentor, but I'm a contributor, just like you! Currently, I'm working on the CPR converter and am happy to share what I know. To get started, I highly recommend setting up the development environment with the latest If you're interested in adding a new conversion module, I suggest using existing modules, like Once Enrico is available, he'll be happy to provide example datasets to help you get started. Best of luck! |
Hi @Mirandazhu02 , great to hear of your interest!
CPR is also .csv so the GLODAP converter should be a good reference to start.
At the moment we're keeping all the parameters, and the mapping is done on the coordinates (latitude, longitude, time). We'll trim the parameters later, as I am waiting on colleagues' feedback on that point.
I have added a CONTRIBUTE.md file since you wrote this message, so you can look at that now for more info. |
Project Description
CrocoLake is a datalake gathering several physical and biogeochemical ocean observations, with the goal of providing an efficient format and a unified interface for data assimilation and ocean modeling activities. CrocoLakeTools contains the python modules to convert existing datasets to CrocoLake's structure (i.e. parquet format with common schema), and a unified interface to load them in the same dataframe.
This project consists in taking an existing dataset that is not yet included in CrocoLake and developing or adapting existing modules to convert it to CrocoLake's format. The list of possible datasets to include has already been put together by the mentor. The dataset to convert will be chosen together with the mentee, based on their experience and familiarity with the format and size of the original database. The project can then be tailored to the mentee's interests and skills: from the conversion of a single-file csv dataset, to the multi-processing conversion of a dataset containing multiple netCDF files.
Expected Outcomes
Skills Required
Python (pandas, xarray), git
Additional Background/Issues
CrocoLakeTools current version is here.
Mentor(s)
Enrico Milanese ([email protected], @enrico-mi)
Expected Project Size
175 hours
Project Difficulty
Intermediate
The text was updated successfully, but these errors were encountered: