aonprd-parse

A comprehensive data processing pipeline for cleaning, parsing, deduplicating HTML data, and importing it into Memgraph.

Introduction

This project processes HTML data through various stages including manual cleaning, decomposing into graph structures, deduplication, and importing into Memgraph for advanced querying and analysis. It utilizes asynchronous programming for improved performance.

Features

Manual Cleaning: Applies specific string replacements to clean HTML files.
Decomposition: Parses HTML into graph structures using BeautifulSoup and NetworkX.
Deduplication: Identifies and removes duplicate HTML files based on hashing and similarity metrics.
CSV Preparation: Converts graph data into CSV files suitable for Memgraph import.
Memgraph Integration: Imports processed data into Memgraph with defined relationships.
Logging: Comprehensive logging at each processing stage for easy debugging and monitoring.
Asynchronous Processing: Utilizes asyncio for improved performance in I/O-bound operations.

Project Structure

aonprd-parse/
├── config/
│   ├── config.py
│   └── config.yaml
├── src/
│   ├── cleaning/
│   │   ├── __init__.py
│   │   ├── manual_cleaning.py
│   │   └── cleaner.py
│   ├── decomposing/
│   │   ├── __init__.py
│   │   ├── decomposer.py
│   │   └── condense_decomposition.py
│   ├── importing/
│   │   ├── __init__.py
│   │   ├── csv_prep.py
│   │   └── memgraph.py
│   ├── processing/
│   │   ├── __init__.py
│   │   └── unwrap.py
│   ├── utils/
│   │   ├── __init__.py
│   │   ├── logging.py
│   │   ├── file_operations.py
│   │   └── data_handling.py
│   ├── __init__.py
│   └── process.py
├── data/
│   ├── raw_html_data/
│   ├── manual_cleaned_html_data/
│   ├── decomposed/
│   ├── condensed/
│   ├── processed/
│   └── import_files/
├── logs/
├── .gitignore
├── README.md
├── requirements.txt
└── pytest.ini

Installation

Clone the Repository:

git clone https://github.com/yourusername/aonprd-parse.git
cd aonprd-parse

Set Up Virtual Environment:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Dependencies:
```
pip install -r requirements.txt
```

Usage

Run the main processing script to execute the entire pipeline:

python src/process.py

Note: Ensure that all required directories and database files exist before running the scripts.

Configuration

All configuration parameters are centralized in config/config.py and config/config.yml. Adjust paths, logging configurations, and processing limits as needed.

Logging

Logs are stored in the logs/ directory. Each script generates its own log file for easy tracking. Log levels can be adjusted in the configuration files.

Contributing

Contributions are welcome! Please fork the repository and submit a pull request.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
config		config
src		src
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
results.txt		results.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

aonprd-parse

Table of Contents

Introduction

Features

Project Structure

Installation

Usage

Configuration

Logging

Contributing

License

About

Releases

Packages

Languages

JoeLuker/aonprd-parse

Folders and files

Latest commit

History

Repository files navigation

aonprd-parse

Table of Contents

Introduction

Features

Project Structure

Installation

Usage

Configuration

Logging

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages