A comprehensive data processing pipeline for cleaning, parsing, deduplicating HTML data, and importing it into Memgraph.
- Introduction
- Features
- Project Structure
- Installation
- Usage
- Configuration
- Logging
- Contributing
- License
This project processes HTML data through various stages including manual cleaning, decomposing into graph structures, deduplication, and importing into Memgraph for advanced querying and analysis. It utilizes asynchronous programming for improved performance.
- Manual Cleaning: Applies specific string replacements to clean HTML files.
- Decomposition: Parses HTML into graph structures using BeautifulSoup and NetworkX.
- Deduplication: Identifies and removes duplicate HTML files based on hashing and similarity metrics.
- CSV Preparation: Converts graph data into CSV files suitable for Memgraph import.
- Memgraph Integration: Imports processed data into Memgraph with defined relationships.
- Logging: Comprehensive logging at each processing stage for easy debugging and monitoring.
- Asynchronous Processing: Utilizes asyncio for improved performance in I/O-bound operations.
aonprd-parse/
├── config/
│ ├── config.py
│ └── config.yaml
├── src/
│ ├── cleaning/
│ │ ├── __init__.py
│ │ ├── manual_cleaning.py
│ │ └── cleaner.py
│ ├── decomposing/
│ │ ├── __init__.py
│ │ ├── decomposer.py
│ │ └── condense_decomposition.py
│ ├── importing/
│ │ ├── __init__.py
│ │ ├── csv_prep.py
│ │ └── memgraph.py
│ ├── processing/
│ │ ├── __init__.py
│ │ └── unwrap.py
│ ├── utils/
│ │ ├── __init__.py
│ │ ├── logging.py
│ │ ├── file_operations.py
│ │ └── data_handling.py
│ ├── __init__.py
│ └── process.py
├── data/
│ ├── raw_html_data/
│ ├── manual_cleaned_html_data/
│ ├── decomposed/
│ ├── condensed/
│ ├── processed/
│ └── import_files/
├── logs/
├── .gitignore
├── README.md
├── requirements.txt
└── pytest.ini
-
Clone the Repository:
git clone https://github.com/yourusername/aonprd-parse.git cd aonprd-parse
-
Set Up Virtual Environment:
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Dependencies:
pip install -r requirements.txt
Run the main processing script to execute the entire pipeline:
python src/process.py
Note: Ensure that all required directories and database files exist before running the scripts.
All configuration parameters are centralized in config/config.py
and config/config.yml
. Adjust paths, logging configurations, and processing limits as needed.
Logs are stored in the logs/
directory. Each script generates its own log file for easy tracking. Log levels can be adjusted in the configuration files.
Contributions are welcome! Please fork the repository and submit a pull request.
This project is licensed under the MIT License.