A comprehensive web application that combines AI content detection with text humanization capabilities. Analyze PDF documents for AI-generated content and transform AI-written text into natural, human-like writing while preserving academic integrity.
- Advanced AI Detection: Classify text as Human-written, AI-generated, or hybrid content
- PDF Annotation: Generate color-coded PDFs with visual highlights
- Sentence-level Analysis: Precise classification at the sentence level
- Interactive Visualizations: Charts and metrics for content analysis
- Batch Processing: Handle multiple documents efficiently
- Citation Protection: Automatically detect and preserve academic citations
- Smart Rewriting: Expand contractions, replace synonyms, add transitions
- Customizable Intensity: Adjust transformation levels with sliders
- Real-time Metrics: Track word count and sentence count changes
- Academic Focus: Maintain formal tone while enhancing readability
- Streamlit - Web application framework
- Python 3.8+ - Backend programming language
- PyMuPDF (fitz) - PDF text extraction and annotation
- ReportLab - PDF generation and manipulation
- spaCy - Advanced NLP processing and POS tagging
- NLTK - Tokenization, stemming, and WordNet integration
- Transformers - Hugging Face AI model integration
- Hugging Face Transformers - Pre-trained AI detection models
- scikit-learn - Machine learning utilities
- torch - Deep learning framework
- pandas - Data manipulation and analysis
- altair - Interactive visualizations and charts
- NumPy - Numerical computing
- DejaVu Sans - Open-source font for PDF annotations
- Noto Sans - Unicode-compatible font family
AI-Content-Detector-Humanizer/
│
├── main.py # Main Streamlit application entry point
├── requirements.txt # Python dependencies
├── setup.sh # Environment setup script
├── nltk.txt # NLTK resource requirements
├── README.md # Project documentation
├── .gitignore # Git ignore rules
├── Proofile # Deployment configuration
│
├── pages/ # Streamlit multi-page modules
│ ├── ai_detection.py # PDF detection and annotation page
│ ├── humanize_text.py # Text humanization page
│ └── __pycache__/ # Python bytecode cache
│
└── utils/ # Utility modules and helpers
├── __init__.py # Package initialization
├── ai_detection_utils.py # AI content classification logic
├── citation_utils.py # Citation detection and handling
├── humanizer.py # Text humanization algorithms
├── model_loaders.py # ML model loading utilities
├── pdf_utils.py # PDF processing functions
└── __pycache__/ # Python bytecode cache
│
├── DejaVuSans.ttf # Font file for PDF annotations
├── NotoSans-Regular.ttf # Unicode-compatible font
└── venv/ # Python virtual environment (local)
- Python 3.8 or higher
- pip (Python package manager)
- Git
-
Clone the repository
git clone https://github.com/your-username/ai-content-detector-humanizer.git cd ai-content-detector-humanizer -
Set up virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Download NLTK resources
python -c "import nltk; nltk.download('punkt'); nltk.download('averaged_perceptron_tagger'); nltk.download('wordnet')" -
Download spaCy model
python -m spacy download en_core_web_sm
Run the setup script:
chmod +x setup.sh
./setup.shstreamlit run main.pyThe application will open in your default browser at http://localhost:8501
- Navigate to the "PDF Detection & Annotation" page
- Upload a PDF document (up to 200MB)
- View AI classification results with interactive charts
- Download color-coded annotated PDF
- Analyze extracted text in the expandable section
- Navigate to the "Humanize AI Text" page
- Paste your AI-generated text
- Adjust synonym replacement and transition probabilities
- Click "Humanize Text" to process
- View enhanced text with citation protection
- Download the humanized result
Create a .env file for custom configuration:
HUGGINGFACE_TOKEN=your_hf_token_here
MODEL_CACHE_DIR=./model_cache
MAX_FILE_SIZE=209715200 # 200MB in bytesThe application uses Hugging Face models for AI detection. Configure in utils/model_loaders.py:
DETECTION_MODEL = "model-name"
CONFIDENCE_THRESHOLD = 0.8
BATCH_SIZE = 32Extend AI detection capabilities by modifying utils/ai_detection_utils.py:
def classify_text_custom(text, model_name="your-custom-model"):
# Implement custom classification logic
passAdd new citation patterns in utils/citation_utils.py:
CITATION_PATTERNS = {
'apa': r'your-regex-pattern',
'mla': r'your-regex-pattern',
'chicago': r'your-regex-pattern'
}The application implements Streamlit caching for:
- Model loading and inference
- PDF processing operations
- Text humanization results
- Lazy loading of large models
- Automatic cleanup of temporary files
- Efficient batch processing for large documents
Run the test suite:
python -m pytest tests/ -v- PDF text extraction accuracy
- Citation detection and preservation
- AI classification consistency
- Text humanization quality
Issue: "No text could be extracted from PDF" Solution: Ensure PDF contains selectable text, not scanned images
Issue: "spaCy model not found"
Solution: Run python -m spacy download en_core_web_sm
Issue: "NLTK resources missing" Solution: Run the NLTK download commands in installation steps
Issue: "Model loading timeout" Solution: Check internet connection and Hugging Face token
This repository exposes a small HTTP API for the Humanizer so other services can transform AI-generated text programmatically. The API is implemented with FastAPI and provides interactive OpenAPI documentation at the following paths when the service is running:
- Swagger UI:
http://127.0.0.1:8000/docs - ReDoc:
http://127.0.0.1:8000/redoc
Base URL (development): http://127.0.0.1:8000
Endpoints
GET /health— simple health check that returns{ "status": "ok" }.POST /humanize— humanize text and return the rewritten text plus metrics.
POST /humanize
- Description: Protects citations, expands contractions, optionally replaces synonyms, and can add academic transitional phrases. Returns the final humanized text and word/sentence counts.
- Request JSON body fields:
text(string, required): Input text to humanize.p_syn(float, optional, 0.0–1.0): Synonym replacement intensity. Default 0.2.p_trans(float, optional, 0.0–1.0): Academic transition insertion probability. Default 0.2.preserve_linebreaks(bool, optional): Preserve original line breaks. Default true.
Example request (curl):
curl -s -X POST "http://127.0.0.1:8000/humanize" \
-H "Content-Type: application/json" \
-d '{"text": "Recent studies (Smith et al., 2020) show promising results. It can't be ignored.", "p_syn": 0.3, "p_trans": 0.2, "preserve_linebreaks": true}'Example response (truncated):
{
"humanized_text": "Moreover, Recent studies (Smith et al., 2020) show promising results. It cannot be ignored.",
"orig_word_count": 11,
"orig_sentence_count": 2,
"new_word_count": 13,
"new_sentence_count": 3,
"words_added": 2,
"sentences_added": 1
}Running the API locally
- Install dependencies (ensure
fastapianduvicornare present inrequirements.txt):
pip install -r requirements.txt- Start the API server (development):
python -m uvicorn api.humanize_api:app --host 127.0.0.1 --port 8000 --reload- Open the interactive docs at
http://127.0.0.1:8000/docsto try the endpoint with built-in examples.
Programmatic usage (Python example):
import requests
payload = {
"text": "Recent studies (Smith et al., 2020) show promising results. It can't be ignored.",
"p_syn": 0.3,
"p_trans": 0.2,
"preserve_linebreaks": True,
}
r = requests.post('http://127.0.0.1:8000/humanize', json=payload)
print(r.json()['humanized_text'])The utility modules can still be imported for in-process usage (no HTTP):
from utils.ai_detection_utils import classify_text_hf
from utils.humanizer import minimal_rewriting
# AI Detection
classification_map, percentages = classify_text_hf(text)
# Text Humanization
humanized_text = minimal_rewriting(text, p_syn=0.2, p_trans=0.2)We welcome contributions! Please see our Contributing Guidelines for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
- Follow PEP 8 guidelines
- Use type hints where possible
- Include docstrings for all functions
- Write comprehensive tests
- Hugging Face for pre-trained models and transformers library
- Streamlit for the excellent web application framework
- spaCy and NLTK for NLP capabilities
- PyMuPDF team for robust PDF processing
- Altair for beautiful visualizations
For support and questions:
- Create an issue on GitHub
- Check the documentation
- Review troubleshooting section
- Multi-language support
- Additional citation styles
- Real-time collaboration features
- Advanced AI model fine-tuning
- Mobile application
- [ ✅ ] API service deployment
- Plugin system for extensibility
Built with ❤️ for the open-source community