Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC Project Proposal]: Extend CrocoLake's available datasets #75

Open
enrico-mi opened this issue Feb 10, 2025 · 10 comments
Open

[GSoC Project Proposal]: Extend CrocoLake's available datasets #75

enrico-mi opened this issue Feb 10, 2025 · 10 comments
Labels
GSoC25 project idea Designates a proposed project idea

Comments

@enrico-mi
Copy link

enrico-mi commented Feb 10, 2025

Project Description

CrocoLake is a datalake gathering several physical and biogeochemical ocean observations, with the goal of providing an efficient format and a unified interface for data assimilation and ocean modeling activities. CrocoLakeTools contains the python modules to convert existing datasets to CrocoLake's structure (i.e. parquet format with common schema), and a unified interface to load them in the same dataframe.

This project consists in taking an existing dataset that is not yet included in CrocoLake and developing or adapting existing modules to convert it to CrocoLake's format. The list of possible datasets to include has already been put together by the mentor. The dataset to convert will be chosen together with the mentee, based on their experience and familiarity with the format and size of the original database. The project can then be tailored to the mentee's interests and skills: from the conversion of a single-file csv dataset, to the multi-processing conversion of a dataset containing multiple netCDF files.

Expected Outcomes

  • CrocoLake's success depends on the amount of observations that it can serve in a uniform manner, hence the importance of this project.
  • Adding more modules to convert datasets increases the examples on how to build one, reducing the efforts of future users that are interested in adding their own datasets.
  • CrocoLakeTools is at its infant stage, and this project would also provide feedback on its accessibility to new users.

Skills Required

Python (pandas, xarray), git

Additional Background/Issues

CrocoLakeTools current version is here.

Mentor(s)

Enrico Milanese ([email protected], @enrico-mi)

Expected Project Size

175 hours

Project Difficulty

Intermediate

@enrico-mi enrico-mi added GSoC25 project idea Designates a proposed project idea labels Feb 10, 2025
@MathewBiddle
Copy link
Contributor

@enrico-mi can you provide a link to the CrocoLakeTools GitHub repository?

@enrico-mi
Copy link
Author

Hi @MathewBiddle -- I'm working on publishing it and sharing the link by the end of the day

@enrico-mi
Copy link
Author

@MathewBiddle I have added a link to the repo

@RATED-R-SUNDRAM
Copy link

HI @MathewBiddle , @enrico-mi I am interested in working on this project , My background includes familiarity with python (over 4 years including packages such as pandas, numpy, xarray, sklearn, etc.) , Git/Github , Cloud technologies(Azure) , and other data lifecycle management skills ( building data pipelines, warehousing etc. ),

I would love to get a headstart and guidance on where to get started on this project.

@RATED-R-SUNDRAM
Copy link

Hi @MathewBiddle . @enrico-mi , I've brushed up my understandings of xarray and NETCDF files along with certain conventions for fair data, If you can provide some headstart and guidance on where to get started on this project , I can start contributing to this issue

@enrico-mi
Copy link
Author

Hi @RATED-R-SUNDRAM , thank you for your interest in the project. Our application with Google is still pending so we cannot get started yet -- we should have news in a couple of weeks and we'll be able to get started then in case. Thank you for your patience.

@RATED-R-SUNDRAM
Copy link

RATED-R-SUNDRAM commented Feb 21, 2025

Sure @enrico-mi , looking forward to it

@Mirandazhu02
Copy link

Mirandazhu02 commented Mar 26, 2025

Hi @enrico-mi, @david11133,

My name is Miranda Zhu. I'd like to contribute to the CrocoLake project. As a junior studying data science at UC Berkeley, I have a strong foundation in Python and experiences working with databases. After reviewing the repository, I have some questions about how to best contribute to the project:

  1. Looking at both the ARGO and GLODAP converters, I notice they handle different data sources and formats – ARGO processes already-converted parquet data with quality filtering, while GLODAP converts directly from CSV. For the CPR dataset with its rich biological parameters, which approach would you recommend as more suitable, and what specific data preprocessing challenges should I anticipate given the biological nature of the data?

  2. The CPR dataset has 422 parameters, which is significantly more than the current datasets in CrocoLake. How would you suggest I approach the parameter mapping strategy to maintain consistency with TRITON standards while accommodating this large number of biological variables?

  3. As a new contributor to CrocoLakeTools, what would be the most effective way to begin my work? Would you recommend first setting up the development environment with the latest code from the 'from-private' branch, exploring one of the existing converters in detail, or perhaps starting with documentation improvements while gaining familiarity with the codebase?

@david11133
Copy link

david11133 commented Mar 30, 2025

Hi @Mirandazhu02 ,

Just to clarify, I'm not a mentor, but I'm a contributor, just like you! Currently, I'm working on the CPR converter and am happy to share what I know.

To get started, I highly recommend setting up the development environment with the latest develop branch and exploring the existing converters, like GLODAP. This will give you a solid understanding of the workflow. Contributing to documentation or tackling smaller fixes is also a great way to familiarize yourself with the codebase.

If you're interested in adding a new conversion module, I suggest using existing modules, like converterGLODAP.py or converterSprayGliders.py, as references. Following their structure will make the process much easier.

Once Enrico is available, he'll be happy to provide example datasets to help you get started.

Best of luck!

@enrico-mi
Copy link
Author

Hi @enrico-mi, @david11133,

My name is Miranda Zhu. I'd like to contribute to the CrocoLake project. As a junior studying data science at UC Berkeley, I have a strong foundation in Python and experiences working with databases. After reviewing the repository, I have some questions about how to best contribute to the project:

Hi @Mirandazhu02 , great to hear of your interest!

  1. Looking at both the ARGO and GLODAP converters, I notice they handle different data sources and formats – ARGO processes already-converted parquet data with quality filtering, while GLODAP converts directly from CSV. For the CPR dataset with its rich biological parameters, which approach would you recommend as more suitable, and what specific data preprocessing challenges should I anticipate given the biological nature of the data?

CPR is also .csv so the GLODAP converter should be a good reference to start.

  1. The CPR dataset has 422 parameters, which is significantly more than the current datasets in CrocoLake. How would you suggest I approach the parameter mapping strategy to maintain consistency with TRITON standards while accommodating this large number of biological variables?

At the moment we're keeping all the parameters, and the mapping is done on the coordinates (latitude, longitude, time). We'll trim the parameters later, as I am waiting on colleagues' feedback on that point.

  1. As a new contributor to CrocoLakeTools, what would be the most effective way to begin my work? Would you recommend first setting up the development environment with the latest code from the 'from-private' branch, exploring one of the existing converters in detail, or perhaps starting with documentation improvements while gaining familiarity with the codebase?

I have added a CONTRIBUTE.md file since you wrote this message, so you can look at that now for more info.
Please note that @david11133 is already working on the CPR dataset at the moment, so I suggest you to look at other datasets (e.g. Saildrones or GDP) and that you reach out over email to discuss what you feel comfortable tackling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GSoC25 project idea Designates a proposed project idea
Projects
None yet
Development

No branches or pull requests

5 participants