Description
Project Description
CrocoLake is a datalake gathering several physical and biogeochemical ocean observations, with the goal of providing an efficient format and a unified interface for data assimilation and ocean modeling activities. CrocoLakeTools contains the python modules to convert existing datasets to CrocoLake's structure (i.e. parquet format with common schema), and a unified interface to load them in the same dataframe.
This project consists in taking an existing dataset that is not yet included in CrocoLake and developing or adapting existing modules to convert it to CrocoLake's format. The list of possible datasets to include has already been put together by the mentor. The dataset to convert will be chosen together with the mentee, based on their experience and familiarity with the format and size of the original database. The project can then be tailored to the mentee's interests and skills: from the conversion of a single-file csv dataset, to the multi-processing conversion of a dataset containing multiple netCDF files.
Expected Outcomes
- CrocoLake's success depends on the amount of observations that it can serve in a uniform manner, hence the importance of this project.
- Adding more modules to convert datasets increases the examples on how to build one, reducing the efforts of future users that are interested in adding their own datasets.
- CrocoLakeTools is at its infant stage, and this project would also provide feedback on its accessibility to new users.
Skills Required
Python (pandas, xarray), git
Additional Background/Issues
CrocoLakeTools current version is here.
Mentor(s)
Enrico Milanese ([email protected], @enrico-mi)
Expected Project Size
175 hours
Project Difficulty
Intermediate