Skip to content

DadaNanjesha/AI-content-detector-Humanizer

Repository files navigation

AI Content Detector & Humanizer

A comprehensive web application that combines AI content detection with text humanization capabilities. Analyze PDF documents for AI-generated content and transform AI-written text into natural, human-like writing while preserving academic integrity.

Streamlit Python Hugging Face

🚀 Features

🔍 PDF AI Content Detection

  • Advanced AI Detection: Classify text as Human-written, AI-generated, or hybrid content
  • PDF Annotation: Generate color-coded PDFs with visual highlights
  • Sentence-level Analysis: Precise classification at the sentence level
  • Interactive Visualizations: Charts and metrics for content analysis
  • Batch Processing: Handle multiple documents efficiently

✍️ AI Text Humanization

  • Citation Protection: Automatically detect and preserve academic citations
  • Smart Rewriting: Expand contractions, replace synonyms, add transitions
  • Customizable Intensity: Adjust transformation levels with sliders
  • Real-time Metrics: Track word count and sentence count changes
  • Academic Focus: Maintain formal tone while enhancing readability

🛠️ Technologies Used

Core Framework

  • Streamlit - Web application framework
  • Python 3.8+ - Backend programming language

PDF Processing

  • PyMuPDF (fitz) - PDF text extraction and annotation
  • ReportLab - PDF generation and manipulation

Natural Language Processing

  • spaCy - Advanced NLP processing and POS tagging
  • NLTK - Tokenization, stemming, and WordNet integration
  • Transformers - Hugging Face AI model integration

AI & Machine Learning

  • Hugging Face Transformers - Pre-trained AI detection models
  • scikit-learn - Machine learning utilities
  • torch - Deep learning framework

Data Processing & Visualization

  • pandas - Data manipulation and analysis
  • altair - Interactive visualizations and charts
  • NumPy - Numerical computing

Font & Typography

  • DejaVu Sans - Open-source font for PDF annotations
  • Noto Sans - Unicode-compatible font family

📁 Project Structure

AI-Content-Detector-Humanizer/
│
├── main.py                          # Main Streamlit application entry point
├── requirements.txt                 # Python dependencies
├── setup.sh                        # Environment setup script
├── nltk.txt                        # NLTK resource requirements
├── README.md                       # Project documentation
├── .gitignore                      # Git ignore rules
├── Proofile                        # Deployment configuration
│
├── pages/                          # Streamlit multi-page modules
│   ├── ai_detection.py            # PDF detection and annotation page
│   ├── humanize_text.py           # Text humanization page
│   └── __pycache__/               # Python bytecode cache
│
└── utils/                          # Utility modules and helpers
    ├── __init__.py                # Package initialization
    ├── ai_detection_utils.py      # AI content classification logic
    ├── citation_utils.py          # Citation detection and handling
    ├── humanizer.py               # Text humanization algorithms
    ├── model_loaders.py           # ML model loading utilities
    ├── pdf_utils.py               # PDF processing functions
    └── __pycache__/               # Python bytecode cache
│
├── DejaVuSans.ttf                 # Font file for PDF annotations
├── NotoSans-Regular.ttf           # Unicode-compatible font
└── venv/                          # Python virtual environment (local)

🚀 Installation & Setup

Prerequisites

  • Python 3.8 or higher
  • pip (Python package manager)
  • Git

Step-by-Step Installation

  1. Clone the repository

    git clone https://github.com/your-username/ai-content-detector-humanizer.git
    cd ai-content-detector-humanizer
  2. Set up virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Download NLTK resources

    python -c "import nltk; nltk.download('punkt'); nltk.download('averaged_perceptron_tagger'); nltk.download('wordnet')"
  5. Download spaCy model

    python -m spacy download en_core_web_sm

Quick Setup (Alternative)

Run the setup script:

chmod +x setup.sh
./setup.sh

🎯 Usage

Starting the Application

streamlit run main.py

The application will open in your default browser at http://localhost:8501

PDF AI Content Detection

  1. Navigate to the "PDF Detection & Annotation" page
  2. Upload a PDF document (up to 200MB)
  3. View AI classification results with interactive charts
  4. Download color-coded annotated PDF
  5. Analyze extracted text in the expandable section

AI Text Humanization

  1. Navigate to the "Humanize AI Text" page
  2. Paste your AI-generated text
  3. Adjust synonym replacement and transition probabilities
  4. Click "Humanize Text" to process
  5. View enhanced text with citation protection
  6. Download the humanized result

⚙️ Configuration

Environment Variables

Create a .env file for custom configuration:

HUGGINGFACE_TOKEN=your_hf_token_here
MODEL_CACHE_DIR=./model_cache
MAX_FILE_SIZE=209715200  # 200MB in bytes

Model Configuration

The application uses Hugging Face models for AI detection. Configure in utils/model_loaders.py:

DETECTION_MODEL = "model-name"
CONFIDENCE_THRESHOLD = 0.8
BATCH_SIZE = 32

🔧 Advanced Features

Custom Model Integration

Extend AI detection capabilities by modifying utils/ai_detection_utils.py:

def classify_text_custom(text, model_name="your-custom-model"):
    # Implement custom classification logic
    pass

Citation Style Support

Add new citation patterns in utils/citation_utils.py:

CITATION_PATTERNS = {
    'apa': r'your-regex-pattern',
    'mla': r'your-regex-pattern',
    'chicago': r'your-regex-pattern'
}

📊 Performance Optimization

Caching Strategies

The application implements Streamlit caching for:

  • Model loading and inference
  • PDF processing operations
  • Text humanization results

Memory Management

  • Lazy loading of large models
  • Automatic cleanup of temporary files
  • Efficient batch processing for large documents

🧪 Testing

Run the test suite:

python -m pytest tests/ -v

Test Coverage

  • PDF text extraction accuracy
  • Citation detection and preservation
  • AI classification consistency
  • Text humanization quality

🐛 Troubleshooting

Common Issues

Issue: "No text could be extracted from PDF" Solution: Ensure PDF contains selectable text, not scanned images

Issue: "spaCy model not found" Solution: Run python -m spacy download en_core_web_sm

Issue: "NLTK resources missing" Solution: Run the NLTK download commands in installation steps

Issue: "Model loading timeout" Solution: Check internet connection and Hugging Face token

📈 REST API Documentation

This repository exposes a small HTTP API for the Humanizer so other services can transform AI-generated text programmatically. The API is implemented with FastAPI and provides interactive OpenAPI documentation at the following paths when the service is running:

  • Swagger UI: http://127.0.0.1:8000/docs
  • ReDoc: http://127.0.0.1:8000/redoc

Base URL (development): http://127.0.0.1:8000

Endpoints

  • GET /health — simple health check that returns { "status": "ok" }.
  • POST /humanize — humanize text and return the rewritten text plus metrics.

POST /humanize

  • Description: Protects citations, expands contractions, optionally replaces synonyms, and can add academic transitional phrases. Returns the final humanized text and word/sentence counts.
  • Request JSON body fields:
    • text (string, required): Input text to humanize.
    • p_syn (float, optional, 0.0–1.0): Synonym replacement intensity. Default 0.2.
    • p_trans (float, optional, 0.0–1.0): Academic transition insertion probability. Default 0.2.
    • preserve_linebreaks (bool, optional): Preserve original line breaks. Default true.

Example request (curl):

curl -s -X POST "http://127.0.0.1:8000/humanize" \
   -H "Content-Type: application/json" \
   -d '{"text": "Recent studies (Smith et al., 2020) show promising results. It can't be ignored.", "p_syn": 0.3, "p_trans": 0.2, "preserve_linebreaks": true}'

Example response (truncated):

{
   "humanized_text": "Moreover, Recent studies (Smith et al., 2020) show promising results. It cannot be ignored.",
   "orig_word_count": 11,
   "orig_sentence_count": 2,
   "new_word_count": 13,
   "new_sentence_count": 3,
   "words_added": 2,
   "sentences_added": 1
}

Running the API locally

  1. Install dependencies (ensure fastapi and uvicorn are present in requirements.txt):
pip install -r requirements.txt
  1. Start the API server (development):
python -m uvicorn api.humanize_api:app --host 127.0.0.1 --port 8000 --reload
  1. Open the interactive docs at http://127.0.0.1:8000/docs to try the endpoint with built-in examples.

Programmatic usage (Python example):

import requests

payload = {
      "text": "Recent studies (Smith et al., 2020) show promising results. It can't be ignored.",
      "p_syn": 0.3,
      "p_trans": 0.2,
      "preserve_linebreaks": True,
}

r = requests.post('http://127.0.0.1:8000/humanize', json=payload)
print(r.json()['humanized_text'])

Custom Integration

The utility modules can still be imported for in-process usage (no HTTP):

from utils.ai_detection_utils import classify_text_hf
from utils.humanizer import minimal_rewriting

# AI Detection
classification_map, percentages = classify_text_hf(text)

# Text Humanization
humanized_text = minimal_rewriting(text, p_syn=0.2, p_trans=0.2)

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

Code Style

  • Follow PEP 8 guidelines
  • Use type hints where possible
  • Include docstrings for all functions
  • Write comprehensive tests

🙏 Acknowledgments

  • Hugging Face for pre-trained models and transformers library
  • Streamlit for the excellent web application framework
  • spaCy and NLTK for NLP capabilities
  • PyMuPDF team for robust PDF processing
  • Altair for beautiful visualizations

📞 Support

For support and questions:

  • Create an issue on GitHub
  • Check the documentation
  • Review troubleshooting section

🔮 Roadmap

  • Multi-language support
  • Additional citation styles
  • Real-time collaboration features
  • Advanced AI model fine-tuning
  • Mobile application
  • [ ✅ ] API service deployment
  • Plugin system for extensibility

Built with ❤️ for the open-source community

Report Bug · Request Feature

About

A comprehensive web application that detects AI-generated content in PDF documents and transforms AI text into natural human-like writing. Built with Streamlit, spaCy, and Hugging Face transformers.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages