A robust, full-stack solution for ingesting PDF files, intelligently extracting Person entities using Google Cloud Document AI, and rigorously validating/cleaning the data for high-quality CSV/Database output.
What is this? This system automates the digitization of unstructure PDF documents containing personal information. It is designed to handle messy inputs (jumbled text, partial addresses, various date formats) and produce clean, standardized records.
What data are we ingesting? The system creates "Person" records containing:
- Full Name (First/Last)
- Mobile Number (Strictly validated Australian
04xxxxxxxxformat) - Address (Cleaned and normalized)
- Email (Validation)
- Date of Birth & Last Seen (Normalized dates)
- Landline (Validated)
Data enters the system via two primary methods:
- Direct Upload: Users upload PDFs directly through the React Frontend.
- Google Cloud Storage (GCS): The system can process files stored in GCS buckets (supported in
processPDFs).
Once a PDF is received:
- Google Document AI: The file is sent to a specific processor trained to identify entities (Person, Name, Address, etc.).
- Raw Entity Extraction: The app parses the AI response, mapping raw text entities to a
Personobject. - Entity Recovery: If an address is "floating" (not linked to a specific person by AI) but overlaps vertically with a person record, the system intelligently "recovers" and assigns it.
Every record goes through a rigorous centralized validation suite (validators.js):
- Cleaning: Removes special characters, collapses whitespace, and trims noise.
- Smart Reordering:
- The system detects if an address is jumbled (e.g., "State Postcode Street").
- Strict Protection: It only attempts to reorder if it finds a Valid Australian State (NSW, VIC, QLD, WA, SA, TAS, ACT, NT). This prevents false positives where words like "Unit" (containing "nit") were mistakenly identified as "NT".
- Normalization: Ensures final format is
[Street] [State] [Postcode].
- Jumble Fix: Can repair numbers like
1234560488by rotating them to find the valid04start. - Format Constraint: Must be exactly 10 digits and start with
04. Non-compliant numbers cause record Rejection.
- Normalization: Converts
20-Aug-2001,2001.08.20, etc., toYYYY-MM-DD. - Validation:
- Enforces years between 1900-2025.
- Validates day/month correctness (e.g., rejects Feb 30).
- Invalid dates are cleared (set to empty) rather than rejecting the whole record.
When multiple records share the same Mobile Number:
- Winner: The record with a valid Address is prioritized.
- Tie-Breaker: If both have addresses (or neither), the one with the most populated fields (Name, Email, etc.) wins.
- Loser: Duplicates are discarded to ensure unique Person entities.
- Database: Valid outcomes are inserted into PostgreSQL.
- Export: Users can download the final clean dataset as CSV or Excel files.
| Component | Tech | Purpose |
|---|---|---|
| Backend | Node.js / Express | API & Orchestration |
| Language | JavaScript (ES Modules) | Logic |
| Database | PostgreSQL | Persistent storage |
| AI Service | Google Document AI | OCR & Entity Extraction |
| Processing | Worker Threads | Offloading heavy CPU validation tasks |
| Queue | p-limit | Concurrency control for API rate limits |
| Frontend | React | User Interface |
| Styling | Tailwind CSS | UI Component styling |
server/
├── src/
│ ├── services/
│ │ ├── documentProcessor.js # Core orchestration (DocAI -> Extraction -> Validation)
│ │ └── validations.worker.js # CPU-bound validation tasks
│ ├── utils/
│ │ └── validators.js # The "Source of Truth" for all regex/cleaning logic
│ ├── config/ # Envs & Constants
│ └── routes/ # API Endpoints
└── ...MAX_WORKERS: Controls parallel processing power (Default: 24).MAX_CONCURRENT_DOCAI_REQUESTS: Throttles calls to Google to avoid quotas.PROJECT_ID/PROCESSOR_ID: Link to the specific Google Cloud resources.
- Start Server:
cd server && npm start - Start Client:
cd client && npm run dev - Upload: Go to
localhost:5173, drag & drop partial or full PDFs. - Monitor: Watch the logs for "Processing file..." and "Success rate..." stats.
- Export: Download the clean list.