SFDA Guidelines Crawler

MVP crawler for drug-related SFDA guidelines, regulations, and linked PDF documents from the official Saudi Food & Drug Authority website.

What It Collects

Title
Sector/category
Document type
Publication/update date when visible
Page URL
PDF URL when available
Language
Source listing page
Crawl timestamp
Downloaded PDF path
PDF SHA-256 checksum
Extracted text path

The crawler starts with:

https://www.sfda.gov.sa/en/Guide?tags=2
https://www.sfda.gov.sa/en/regulations?tags=2

Note: the SFDA navigation currently maps the Drugs sector to tags=2. tags=1 appears in the site navigation as Food.

Setup

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

Optionally copy .env.example to .env and adjust polite crawling settings. The current CLI reads environment variables directly.

Run

python -m src.main --sector Drugs --download-pdfs true --extract-text true --max-pages 5

For a very small metadata-only smoke test:

python -m src.main --sector Drugs --download-pdfs false --extract-text false --max-pages 1 --request-delay 2

Outputs are written to:

data/raw_pdfs/
data/extracted_text/
data/sfda_guidelines.csv
data/sfda_guidelines.jsonl

Tests

python -m pytest

Embeddings and Supabase pgvector

Apply the schema in supabase/migrations/202605160001_sfda_pgvector.sql to your Supabase project. The migration creates:

sfda_documents
sfda_document_chunks
match_sfda_guidelines(...) hybrid search RPC

The vector column is extensions.vector(1536) for text-embedding-3-small. If you change OPENAI_EMBEDDING_DIMENSIONS, update the migration to the same dimension before ingesting.

Configure secrets in .env:

OPENAI_API_KEY=sk-...
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
OPENAI_EMBEDDING_DIMENSIONS=1536
SUPABASE_URL=https://your-project-ref.supabase.co
SUPABASE_SERVICE_ROLE_KEY=...

Preview chunk counts without API calls:

python -m src.embedding_pipeline --metadata data/sfda_guidelines.csv --dry-run

Generate embeddings and upsert documents/chunks:

python -m src.embedding_pipeline --metadata data/sfda_guidelines.csv --batch-size 16

Hybrid search:

python -m src.search "clinical trial drug submission requirements" --sector Drugs --match-count 5

Use the Supabase service role key only in trusted server-side environments. Do not expose it in browser or client-side apps.

Politeness and Robustness

Reads robots.txt before crawling.
Uses a configurable user agent.
Adds request delay between requests.
Retries with exponential backoff.
Applies request timeouts.
Deduplicates page URLs and PDF URLs.
Uses safe normalized filenames.

Limitations

This MVP uses static HTML parsing with httpx and BeautifulSoup. If SFDA moves listing data fully behind JavaScript, add Playwright as a fallback fetcher.
Metadata quality depends on what each listing/detail page exposes consistently.
PDF text extraction can vary for scanned PDFs; OCR is not included.

Next Step: Production Hardening

Add a scheduled crawl job.
Add OCR fallback for scanned PDFs.
Add CI that runs tests on every push.
Add a small API around match_sfda_guidelines for applications.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SFDA Guidelines Crawler

What It Collects

Setup

Run

Tests

Embeddings and Supabase pgvector

Politeness and Robustness

Limitations

Next Step: Production Hardening

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
src		src
supabase/migrations		supabase/migrations
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SFDA Guidelines Crawler

What It Collects

Setup

Run

Tests

Embeddings and Supabase pgvector

Politeness and Robustness

Limitations

Next Step: Production Hardening

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages