AI Content Detector & Humanizer

A comprehensive web application that combines AI content detection with text humanization capabilities. Analyze PDF documents for AI-generated content and transform AI-written text into natural, human-like writing while preserving academic integrity.

🚀 Features

🔍 PDF AI Content Detection

Advanced AI Detection: Classify text as Human-written, AI-generated, or hybrid content
PDF Annotation: Generate color-coded PDFs with visual highlights
Sentence-level Analysis: Precise classification at the sentence level
Interactive Visualizations: Charts and metrics for content analysis
Batch Processing: Handle multiple documents efficiently

✍️ AI Text Humanization

Citation Protection: Automatically detect and preserve academic citations
Smart Rewriting: Expand contractions, replace synonyms, add transitions
Customizable Intensity: Adjust transformation levels with sliders
Real-time Metrics: Track word count and sentence count changes
Academic Focus: Maintain formal tone while enhancing readability

🛠️ Technologies Used

Core Framework

Streamlit - Web application framework
Python 3.8+ - Backend programming language

PDF Processing

PyMuPDF (fitz) - PDF text extraction and annotation
ReportLab - PDF generation and manipulation

Natural Language Processing

spaCy - Advanced NLP processing and POS tagging
NLTK - Tokenization, stemming, and WordNet integration
Transformers - Hugging Face AI model integration

AI & Machine Learning

Hugging Face Transformers - Pre-trained AI detection models
scikit-learn - Machine learning utilities
torch - Deep learning framework

Data Processing & Visualization

pandas - Data manipulation and analysis
altair - Interactive visualizations and charts
NumPy - Numerical computing

Font & Typography

DejaVu Sans - Open-source font for PDF annotations
Noto Sans - Unicode-compatible font family

📁 Project Structure

AI-Content-Detector-Humanizer/
│
├── main.py                          # Main Streamlit application entry point
├── requirements.txt                 # Python dependencies
├── setup.sh                        # Environment setup script
├── nltk.txt                        # NLTK resource requirements
├── README.md                       # Project documentation
├── .gitignore                      # Git ignore rules
├── Proofile                        # Deployment configuration
│
├── pages/                          # Streamlit multi-page modules
│   ├── ai_detection.py            # PDF detection and annotation page
│   ├── humanize_text.py           # Text humanization page
│   └── __pycache__/               # Python bytecode cache
│
└── utils/                          # Utility modules and helpers
    ├── __init__.py                # Package initialization
    ├── ai_detection_utils.py      # AI content classification logic
    ├── citation_utils.py          # Citation detection and handling
    ├── humanizer.py               # Text humanization algorithms
    ├── model_loaders.py           # ML model loading utilities
    ├── pdf_utils.py               # PDF processing functions
    └── __pycache__/               # Python bytecode cache
│
├── DejaVuSans.ttf                 # Font file for PDF annotations
├── NotoSans-Regular.ttf           # Unicode-compatible font
└── venv/                          # Python virtual environment (local)

🚀 Installation & Setup

Prerequisites

Python 3.8 or higher
pip (Python package manager)
Git

Step-by-Step Installation

Clone the repository

git clone https://github.com/your-username/ai-content-detector-humanizer.git
cd ai-content-detector-humanizer

Set up virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```

Download NLTK resources

python -c "import nltk; nltk.download('punkt'); nltk.download('averaged_perceptron_tagger'); nltk.download('wordnet')"

Download spaCy model
```
python -m spacy download en_core_web_sm
```

Quick Setup (Alternative)

Run the setup script:

chmod +x setup.sh
./setup.sh

🎯 Usage

Starting the Application

streamlit run main.py

The application will open in your default browser at http://localhost:8501

PDF AI Content Detection

Navigate to the "PDF Detection & Annotation" page
Upload a PDF document (up to 200MB)
View AI classification results with interactive charts
Download color-coded annotated PDF
Analyze extracted text in the expandable section

AI Text Humanization

Navigate to the "Humanize AI Text" page
Paste your AI-generated text
Adjust synonym replacement and transition probabilities
Click "Humanize Text" to process
View enhanced text with citation protection
Download the humanized result

⚙️ Configuration

Environment Variables

Create a .env file for custom configuration:

HUGGINGFACE_TOKEN=your_hf_token_here
MODEL_CACHE_DIR=./model_cache
MAX_FILE_SIZE=209715200  # 200MB in bytes

Model Configuration

The application uses Hugging Face models for AI detection. Configure in utils/model_loaders.py:

DETECTION_MODEL = "model-name"
CONFIDENCE_THRESHOLD = 0.8
BATCH_SIZE = 32

🔧 Advanced Features

Custom Model Integration

Extend AI detection capabilities by modifying utils/ai_detection_utils.py:

def classify_text_custom(text, model_name="your-custom-model"):
    # Implement custom classification logic
    pass

Citation Style Support

Add new citation patterns in utils/citation_utils.py:

CITATION_PATTERNS = {
    'apa': r'your-regex-pattern',
    'mla': r'your-regex-pattern',
    'chicago': r'your-regex-pattern'
}

📊 Performance Optimization

Caching Strategies

The application implements Streamlit caching for:

Model loading and inference
PDF processing operations
Text humanization results

Memory Management

Lazy loading of large models
Automatic cleanup of temporary files
Efficient batch processing for large documents

🧪 Testing

Run the test suite:

python -m pytest tests/ -v

Test Coverage

PDF text extraction accuracy
Citation detection and preservation
AI classification consistency
Text humanization quality

🐛 Troubleshooting

Common Issues

Issue: "No text could be extracted from PDF" Solution: Ensure PDF contains selectable text, not scanned images

Issue: "spaCy model not found" Solution: Run python -m spacy download en_core_web_sm

Issue: "NLTK resources missing" Solution: Run the NLTK download commands in installation steps

Issue: "Model loading timeout" Solution: Check internet connection and Hugging Face token

📈 REST API Documentation

This repository exposes a small HTTP API for the Humanizer so other services can transform AI-generated text programmatically. The API is implemented with FastAPI and provides interactive OpenAPI documentation at the following paths when the service is running:

Swagger UI: http://127.0.0.1:8000/docs
ReDoc: http://127.0.0.1:8000/redoc

Base URL (development): http://127.0.0.1:8000

Endpoints

GET /health — simple health check that returns { "status": "ok" }.
POST /humanize — humanize text and return the rewritten text plus metrics.

POST /humanize

Description: Protects citations, expands contractions, optionally replaces synonyms, and can add academic transitional phrases. Returns the final humanized text and word/sentence counts.
Request JSON body fields:
- text (string, required): Input text to humanize.
- p_syn (float, optional, 0.0–1.0): Synonym replacement intensity. Default 0.2.
- p_trans (float, optional, 0.0–1.0): Academic transition insertion probability. Default 0.2.
- preserve_linebreaks (bool, optional): Preserve original line breaks. Default true.

Example request (curl):

curl -s -X POST "http://127.0.0.1:8000/humanize" \
   -H "Content-Type: application/json" \
   -d '{"text": "Recent studies (Smith et al., 2020) show promising results. It can't be ignored.", "p_syn": 0.3, "p_trans": 0.2, "preserve_linebreaks": true}'

Example response (truncated):

{
   "humanized_text": "Moreover, Recent studies (Smith et al., 2020) show promising results. It cannot be ignored.",
   "orig_word_count": 11,
   "orig_sentence_count": 2,
   "new_word_count": 13,
   "new_sentence_count": 3,
   "words_added": 2,
   "sentences_added": 1
}

Running the API locally

Install dependencies (ensure fastapi and uvicorn are present in requirements.txt):

pip install -r requirements.txt

Start the API server (development):

python -m uvicorn api.humanize_api:app --host 127.0.0.1 --port 8000 --reload

Open the interactive docs at http://127.0.0.1:8000/docs to try the endpoint with built-in examples.

Programmatic usage (Python example):

import requests

payload = {
      "text": "Recent studies (Smith et al., 2020) show promising results. It can't be ignored.",
      "p_syn": 0.3,
      "p_trans": 0.2,
      "preserve_linebreaks": True,
}

r = requests.post('http://127.0.0.1:8000/humanize', json=payload)
print(r.json()['humanized_text'])

Custom Integration

The utility modules can still be imported for in-process usage (no HTTP):

from utils.ai_detection_utils import classify_text_hf
from utils.humanizer import minimal_rewriting

# AI Detection
classification_map, percentages = classify_text_hf(text)

# Text Humanization
humanized_text = minimal_rewriting(text, p_syn=0.2, p_trans=0.2)

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

Code Style

Follow PEP 8 guidelines
Use type hints where possible
Include docstrings for all functions
Write comprehensive tests

🙏 Acknowledgments

Hugging Face for pre-trained models and transformers library
Streamlit for the excellent web application framework
spaCy and NLTK for NLP capabilities
PyMuPDF team for robust PDF processing
Altair for beautiful visualizations

📞 Support

For support and questions:

Create an issue on GitHub
Check the documentation
Review troubleshooting section

🔮 Roadmap

Multi-language support
Additional citation styles
Real-time collaboration features
Advanced AI model fine-tuning
Mobile application
[ ✅ ] API service deployment
Plugin system for extensibility

Built with ❤️ for the open-source community

Report Bug · Request Feature

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.streamlit		.streamlit
api		api
pages		pages
utils		utils
.gitignore		.gitignore
DejaVuSans.ttf		DejaVuSans.ttf
LICENSE		LICENSE
NotoSans-Regular.ttf		NotoSans-Regular.ttf
Procfile		Procfile
README.md		README.md
main.py		main.py
nltk.txt		nltk.txt
requirements-local.txt		requirements-local.txt
requirements.txt		requirements.txt
setup.sh		setup.sh
vercel.json		vercel.json

Folders and files

Latest commit

History

Repository files navigation