Skip to content

Developing scraping tool #9

@langbart

Description

@langbart

Background

The AquaData project aims to aggregate and showcase impactful stories (see Issue #10 ) from WorldFish initiatives. An essential component of this project is the effective collection of relevant data from diverse sources. To support this, we need to develop a scraping tool that can efficiently gather information from various online and offline sources.

Objective

The primary goal of this scraping tool is to:

  1. Automate the process of extracting data from multiple sources, including websites, PDFs, and other document formats.
  2. Ensure the data collected is accurate, relevant, and formatted for subsequent use in the AquaData story generation tool or any other tool in development.

Key Requirements

  • Versatility in Data Sources: The tool should handle a wide range of data sources with different structures and formats.
  • Data Quality and Relevance: Implement filters or criteria to ensure the data scraped is pertinent to the AquaData project's objectives.
  • Integration with Story Generation Tool: Design the scraper to seamlessly feed data into the AI story generation tool being developed (related to Issue Improve AI stories generator #10 ).

Proposed Actions

  1. Tool Design and Architecture: Outline the architecture of the scraping tool, including the technology stack and the approach for handling different data sources.
  2. Development and Testing: Begin coding the tool, followed by rigorous testing with various data sources to ensure reliability and efficiency. This phase will include:
    • Developing scraping algorithms for different types of web pages and document formats. I will start from text files, then moving to excel sheets if feasible.
    • Implementing error handling and data validation mechanisms to maintain the integrity of the scraped data.
  3. Data Processing and Formatting: Implement features for cleaning and formatting the scraped data in a way that is compatible with the story generation tool.

Resources

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions