Tender Extract API

FastAPI microservice that extracts structured tender data from South African government procurement documents using deterministic extraction. Supports PDF, DOCX, DOC, XLSX, XLS, PPTX, ODT, RTF, CSV, TXT, and ZIP archives.

About Tenders-SA

Tenders-SA.org is an AI-powered tender matching and application platform for South African businesses. It aggregates tenders from national, provincial, and municipal government departments, SOEs (Eskom, Transnet, SANRAL), and public entities — sourced directly from official OCDS (Open Contracting Data Standard) feeds.

Role of This Microservice

Government tender documents are the raw source of truth for procurement opportunities. The Tender Extract API converts unstructured documents into structured, queryable data fields such as:

Description and scope of work
Requirements and eligibility criteria
Tender reference numbers and titles
Closing dates and times
Issuing organizations and departments
Estimated values and bid bond requirements
B-BBEE level requirements and preference points
Contact information and briefing session details
Evaluation criteria and returnable documents
Confidence scores for extraction quality

This service is intentionally fast and deterministic — it uses regex-based extraction and format-specific parsers, not AI/LLM processing. When extraction confidence is low, the main application decides whether to run AI fallback enrichment.

Pipeline Architecture

graph LR
    A[Cloudflare Worker] -->|Raw OCDS + Document URL| B[Main App<br/>Ingestion Service]
    B -->|URL| C[Tender Extract API]
    C -->|Structured JSON| B
    B -->|Mapped Data| D[(Main Database)]
    D --> E[AI Enrichment<br/>(if confidence low)]

Features

Multi-format extraction — PDF, DOCX, DOC, XLSX, XLS, PPTX, ODT, RTF, CSV, TXT
ZIP archive support — Recursive extraction of mixed-format bid packs with result merging
Plugin architecture — Each format is a self-contained extractor module implementing BaseExtractor
Content-type detection — Magic byte sniffing > file extension > MIME type priority chain
Regex-first extraction — Deterministic pattern matching tuned for South African government tender formats
SA tender optimized — Patterns cover CIDB gradings, B-BBEE levels, 80/20 and 90/10 preference systems, MBD/SBD returnable document forms, and SA procurement terminology
Comprehensive field extraction — 20+ structured fields including description, requirements, dates, organization, financials, B-BBEE, contact, briefing sessions, evaluation criteria, and returnable documents
Multiple input modes — File upload (multipart) or URL fetch (JSON)
Confidence scoring — Weighted quality indicator (0.0–1.0) for extraction reliability
Full text passthrough — Always returns raw extracted text for AI fallback in the main application
URL validation — Enforced allowlist for trusted document hosts
Zero AI dependencies — No LLM, no OCR, no external APIs.

Quick Start

Local Development

python -m venv venv
venv\Scripts\activate      # Windows
# source venv/bin/activate  # Linux/Mac

pip install -r requirements.txt
uvicorn app.main:app --reload

Visit http://localhost:8000/docs for interactive Swagger UI.

Docker

docker build -t tender-extract .
docker run -p 8080:8080 tender-extract

API Reference

Health Check

GET /health

{
  "status": "healthy",
  "version": "0.1.0"
}

Readiness Probe

GET /ready

Returns 200 when the extractor is initialized and ready for traffic, 503 during startup.

Extract Tender Data

POST /v1/extract

Extracts structured tender information from a document. Accepts either a file upload or a URL. Supports all formats: PDF, DOCX, DOC, XLSX, XLS, PPTX, ODT, RTF, CSV, TXT, ZIP.

Mode A: File Upload

curl -X POST http://localhost:8000/v1/extract \
  -F "file=@tender.pdf"

Mode B: URL Fetch

curl -X POST http://localhost:8000/v1/extract \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.tenders-sa.org/tenders/abc123/specification.pdf"}'

Response

The response shape is identical regardless of input format:

{
  "description": "Supply and delivery of office stationery and consumables...",
  "requirements": [
    "Valid tax clearance certificate required",
    "B-BBEE Level 1 or 2 contributor preferred"
  ],
  "tender_number": "SCM/2026/001",
  "title": "Supply and Delivery of Office Stationery",
  "closing_date": "31 March 2026",
  "closing_time": "11:00",
  "issuing_organization": "Department of Public Works",
  "department": "Supply Chain Management",
  "estimated_value": "R 5,000,000.00",
  "confidence": 0.87,
  "full_text": "DEPARTMENT OF PUBLIC WORKS...",
  ...
}

Response Fields

Field	Type	Description
`description`	`string`	Cleaned plain-text description / scope of work
`requirements`	`list[string]`	Extracted requirements and eligibility criteria
`tender_number`	`string`	Tender/bid reference number
`title`	`string`	Tender title or name
`closing_date`	`string`	Submission deadline date
`closing_time`	`string`	Submission deadline time
`publication_date`	`string`	Date tender was published
`validity_period`	`string`	Bid validity period
`contract_period`	`string`	Contract duration
`issuing_organization`	`string`	Organization issuing the tender
`department`	`string`	Department within the organization
`delivery_location`	`string`	Delivery location for goods/services
`submission_address`	`string`	Physical address for bid submission
`estimated_value`	`string`	Estimated contract value
`bid_bond_required`	`string`	Bid bond / guarantee requirements
`payment_terms`	`string`	Payment terms
`bbbee`	`BBBEEInfo`	B-BBEE requirements and scoring
`contact`	`ContactInfo`	Contact person details
`briefing_session`	`BriefingSession`	Briefing/site inspection details
`evaluation_criteria`	`string`	Evaluation and scoring criteria
`special_conditions`	`string`	Special conditions of contract
`returnable_documents`	`list[string]`	Documents to be submitted
`document_type`	`string`	Classified document type (RFQ, TENDER, EOI, RFP)
`province`	`string`	Province extracted from document
`contract_type`	`string`	Contract framework (NEC3, JBCC, GCC, FIDIC)
`procurement_threshold`	`string`	NT threshold classification
`evaluation_structured`	`object`	Structured evaluation sub-criteria
`confidence`	`float`	Extraction confidence (0.0–1.0)
`pages_used`	`list[int]`	0-indexed pages/sheets analyzed
`raw_text_preview`	`string`	First 500 chars of extracted text
`full_text`	`string`	Complete extracted text (up to 50KB)

Error Codes

Status	Code	Description
415	`UNSUPPORTED_MEDIA_TYPE`	Unsupported document format (wrong magic bytes, extension, or MIME type)
422	`UNPROCESSABLE_ENTITY`	Validation error — file too large, invalid URL, URL fetch failure
501	`NOT_IMPLEMENTED`	Scanned/image-only PDF detected (OCR not supported)
504	`GATEWAY_TIMEOUT`	URL fetch timed out (configurable, default 90s)
500	`INTERNAL_ERROR`	Unexpected extraction failure
503	`SERVICE_UNAVAILABLE`	Extractor not yet initialized

Configuration

Environment Variable	Default	Description
`MAX_FILE_SIZE_BYTES`	`104857600` (100MB)	Maximum document file size
`URL_FETCH_TIMEOUT_SECONDS`	`90.0`	Timeout for URL fetching
`URL_FETCH_MAX_REDIRECTS`	`5`	Maximum redirects to follow
`ALLOWED_URL_HOSTS`	`etenders-api.tenders-sa.org, docs.tenders-sa.org, www.etenders.gov.za, etenders.gov.za`	Comma-separated trusted document URL hosts
`GUNICORN_WORKERS`	`2`	Gunicorn worker processes
`GUNICORN_TIMEOUT`	`120`	Gunicorn worker timeout (seconds)
`LOG_LEVEL`	`info`	Logging level
`PORT`	`8080`	HTTP port

Deployment

Deployed on AWS EC2. See the main project deployment documentation for infrastructure details.

Docker

docker build -t tender-extract .
docker run -p 8080:8080 tender-extract

Project Structure

tender-extract/
├── app/
│   ├── __init__.py           # Package info
│   ├── main.py               # FastAPI app, routes, URL validation, multi-format routing
│   ├── extractor.py          # Shared data types (ExtractionResult, etc.)
│   ├── schemas.py            # Pydantic request/response models
│   └── extractors/           # Plugin-based extractor modules
│       ├── __init__.py       # ExtractorRegistry — maps extensions/MIME/magic bytes
│       ├── base.py           # BaseExtractor abstract class
│       ├── pdf_extractor.py  # PDF (PyMuPDF)
│       ├── docx_extractor.py # DOCX (python-docx)
│       ├── legacy_doc_extractor.py  # Legacy .doc (antiword/catdoc + binary fallback)
│       ├── xlsx_extractor.py # XLSX (openpyxl)
│       ├── xls_extractor.py  # XLS (xlrd)
│       ├── pptx_extractor.py # PPTX (python-pptx)
│       ├── odt_extractor.py  # ODT (odfpy)
│       ├── rtf_extractor.py  # RTF (striprtf)
│       ├── text_extractor.py # CSV/TXT (stdlib)
│       └── zip_extractor.py  # ZIP (zipfile + recursive dispatch + result merging)
├── docs/
│   ├── API_REFERENCE.md      # Detailed API reference
│   ├── INTEGRATION_GUIDE.md  # Ingestion pipeline integration docs
│   └── openapi.json          # OpenAPI 3.0 specification
├── tests/
│   ├── __init__.py
│   ├── test_extractors.py    # 108 tests covering all extractors + registry
│   ├── test_zip_merge.py     # ZIP merger/reconciliation tests
│   └── test_url_policy.py    # URL allowlist + fetch policy tests
├── Dockerfile                # Multi-stage build for deployment
├── gunicorn.conf.py          # Gunicorn server config
├── requirements.txt          # Python dependencies
└── README.md

Development

Testing

pip install pytest pytest-asyncio
pytest tests/ -v

Currently 108 tests covering all extractors, registry dispatch, ZIP merging, and URL policy.

Adding a New Extractor

Create a new file in app/extractors/ implementing BaseExtractor
Register it in ExtractorRegistry in __init__.py (extension, MIME, and/or magic bytes)
All existing tests must still pass

Design Principles

No AI in the extraction path — Deterministic extraction only. AI enrichment is a separate concern handled by the main application when confidence is low.
SA-specific patterns — All regex patterns are designed and tuned for South African government tender formats.
Plugin architecture — Each format is a self-contained module. Adding a new format never modifies existing extractors.
Graceful degradation — Always returns a result, even for poorly formatted documents. Low-confidence results signal the main app to run AI fallback.

Links

Tenders-SA Platform — Main website
Developer Portal — API keys, docs, and pricing
GitHub — Source code & issues
Support — Email support

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
app		app
docs		docs
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
AGENTS.MD		AGENTS.MD
Dockerfile		Dockerfile
README.md		README.md
fly.toml		fly.toml
gunicorn.conf.py		gunicorn.conf.py
render.yaml		render.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tender Extract API

About Tenders-SA

Role of This Microservice

Pipeline Architecture

Features

Quick Start

Local Development

Docker

API Reference

Health Check

Readiness Probe

Extract Tender Data

Mode A: File Upload

Mode B: URL Fetch

Response

Response Fields

Error Codes

Configuration

Deployment

Docker

Project Structure

Development

Testing

Adding a New Extractor

Design Principles

Links

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tender Extract API

About Tenders-SA

Role of This Microservice

Pipeline Architecture

Features

Quick Start

Local Development

Docker

API Reference

Health Check

Readiness Probe

Extract Tender Data

Mode A: File Upload

Mode B: URL Fetch

Response

Response Fields

Error Codes

Configuration

Deployment

Docker

Project Structure

Development

Testing

Adding a New Extractor

Design Principles

Links

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages