This repository contains a collection of logical fallacy datasets, tools for generating synthetic data, and resources for fine-tuning language models on logical fallacy detection and generation.
The organic datasets are sourced from various projects:
Synthetic datasets are generated from the organic datasets to expand the number of examples for each fallacy category. The generate_synthetic_data.py
script is used for this purpose.
The training datasets consist of articles generated from the synthetic sentences. Due to computational constraints, only 3 categories of articles have been fully generated. The training data is stored in JSONL format in the data/training/
directory.
generate_synthetic_data.py
: Main script for generating synthetic fallacy sentences and articles.validate_dataset.py
: Script to validate the generated datasets.check_status.py
: Script to check the status of fine-tuning jobs, create files, and test the model.
The generated datasets are used to fine-tune a LLaMA 2 or 3 model. The fine-tuning process was performed using Any Scale.
- Clone the repository: git clone https://github.com/kuwrom/fallacy_detection.git
- Install the required dependencies: pip install -r requirements.txt
- Generate synthetic data: python generate_synthetic_data.py
- Validate the generated dataset: python validate_dataset.py
-
- Use the
check_status.py
script for various operations:
- Create files for fine-tuning:
- Start a fine-tuning job:
- List fine-tuning jobs:
- Retrieve file content:
- Test the fine-tuned model:
- Use the
The results of the model fine-tuning can be found in the result.txt
file.
Contributions to expand the dataset or improve the data generation process are welcome. Please submit a pull request or open an issue to discuss proposed changes.
This project builds upon the work of several open-source projects and datasets. We thank the authors and contributors of the original datasets for making their work available.