Skip to content

naman-vefogix/SiteMap-Extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sitemap Extractor API

A Django REST API that extracts URLs from XML sitemaps asynchronously using Celery and Redis.

Features

  • Extract URLs from XML sitemaps
  • Supports sitemap indexes and nested sitemaps
  • Asynchronous processing with Celery
  • Redis-based task queue and caching
  • Task status polling
  • Guest and authenticated user support
  • Pagination for large sitemap results
  • Database persistence of extraction results
  • Automatic expiration of stored results

Tech Stack

  • Django
  • Django REST Framework
  • Celery
  • Redis
  • SQLite (Development)
  • Docker (Optional)

Installation

Clone Repository

git clone <repository-url>
cd <project-name>

Create Virtual Environment

python -m venv .venv

Activate Virtual Environment

Windows

.venv\Scripts\activate

Linux / Mac

source .venv/bin/activate

Install Dependencies

pip install -r requirements.txt

Environment Variables

Create a .env file:

SECRET_KEY=your-secret-key

DEBUG=True

REDIS_URL=redis://localhost:6379/0

Database Migration

python manage.py makemigrations
python manage.py migrate

Running Redis

Docker

docker run -d --name redis -p 6379:6379 redis

Verify:

docker ps

Running Django

python manage.py runserver

Application:

http://127.0.0.1:8000/

Running Celery Worker

Linux / Mac

celery -A sitemap worker -l info

Windows

celery -A sitemap worker -l info --pool=solo

API Endpoints

Start Sitemap Extraction

POST /api/sitemap-extractor/

Request:

{
  "url": "https://example.com"
}

Response:

{
  "task_id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "status": "processing"
}

Check Task Status

GET /api/sitemap-task/<task_id>/

Possible responses:

{
  "status": "PENDING"
}
{
  "status": "STARTED"
}
{
  "status": "FAILURE",
  "error": "Error message"
}

Get Paginated Results

GET /api/sitemap-task/<task_id>/?page=1

Response:

{
  "status": "SUCCESS",
  "count": 2500,
  "page": 1,
  "page_size": 100,
  "total_pages": 25,
  "results": [
    "https://example.com/page-1",
    "https://example.com/page-2"
  ]
}

Guest vs Authenticated Users

Guest Users

  • Maximum URLs: 1,000
  • Cached responses supported

Authenticated Users

  • Maximum URLs: 100,000
  • Results stored and reused until expiration

Caching

Guest user results are cached in Redis:

sitemap:<domain>

Cache timeout:

1 hour

Database Storage

Each extraction stores:

  • User ID
  • IP Address
  • Domain
  • Extracted URLs
  • Total URL Count
  • Status
  • Creation Timestamp
  • Expiration Timestamp

Statuses:

pending
completed
failed

Development Notes

Restart Celery Worker After Task Changes

docker restart <celery-container>

or

celery -A sitemap worker -P threads -c 8 -l info  

Remove pending tasks from celery

celery -A sitemap purge  

Future Improvements

  • Celery Beat cleanup task
  • Rate limiting
  • Export results as CSV
  • Monitoring and metrics

About

sitemap.xml extractor with DRF and AJAX

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors