-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
good first issueGood for newcomersGood for newcomers
Description
Background
The AquaData project aims to aggregate and showcase impactful stories (see Issue #10 ) from WorldFish initiatives. An essential component of this project is the effective collection of relevant data from diverse sources. To support this, we need to develop a scraping tool that can efficiently gather information from various online and offline sources.
Objective
The primary goal of this scraping tool is to:
- Automate the process of extracting data from multiple sources, including websites, PDFs, and other document formats.
- Ensure the data collected is accurate, relevant, and formatted for subsequent use in the AquaData story generation tool or any other tool in development.
Key Requirements
- Versatility in Data Sources: The tool should handle a wide range of data sources with different structures and formats.
- Data Quality and Relevance: Implement filters or criteria to ensure the data scraped is pertinent to the AquaData project's objectives.
- Integration with Story Generation Tool: Design the scraper to seamlessly feed data into the AI story generation tool being developed (related to Issue Improve AI stories generator #10 ).
Proposed Actions
- Tool Design and Architecture: Outline the architecture of the scraping tool, including the technology stack and the approach for handling different data sources.
- Development and Testing: Begin coding the tool, followed by rigorous testing with various data sources to ensure reliability and efficiency. This phase will include:
- Developing scraping algorithms for different types of web pages and document formats. I will start from text files, then moving to excel sheets if feasible.
- Implementing error handling and data validation mechanisms to maintain the integrity of the scraped data.
- Data Processing and Formatting: Implement features for cleaning and formatting the scraped data in a way that is compatible with the story generation tool.
Resources
- Dataverse metadata from different CGIAR organizations (including worldfish), updated every week: https://github.com/WorldFishCenter/aquadata.data.mapping/tree/main/inst/dataverse_raw
- py function to get dataverse metadata: https://github.com/WorldFishCenter/aquadata.data.mapping/blob/main/inst/python/get_dataverse_metadata.py
- Dataverse API documentation: https://guides.dataverse.org/en/latest/api/
Metadata
Metadata
Assignees
Labels
good first issueGood for newcomersGood for newcomers