This sample demonstrates how to automatically split and categorize large combined PDF documents using Amazon Bedrock Data Automation (BDA). The solution showcases two approaches with different cost and accuracy trade-offs.
Many organizations deal with large combined PDF documents containing multiple document types. This sample provides a solution to:
- Automatically identify document boundaries within combined PDFs
- Categorize different document types (credit reports, loan applications, etc.)
- Flag missing required documents
- Reduce manual review time and improve accuracy
Amazon Bedrock Data Automation (BDA) supports splitting documents when using projects with the Amazon Bedrock API. When enabled, splitting allows BDA to take a PDF containing multiple logical documents and split it into separate documents for processing.
Once splitting is complete, each segment of the split document is processed independently. This means an input document can contain different document types. For example, if you have a PDF containing 3 bank statements and one W2, splitting would attempt to divide it into 4 separate documents that would be processed individually.
- Maximum file size: Up to 3,000 pages per input document
- Individual document limit: Up to 20 pages per split document
- Default setting: Document splitting is disabled by default but can be enabled via API
- Uses BDA Standard Output for document extraction ($0.010/page)
- Employs Foundation models for post-processing and classification
- Best for: High-volume processing with budget constraints
- Uses BDA's custom blueprints with built-in classification ($0.040/page)
- Document splitting: Enabled with blueprint-specific processing
- Higher accuracy with dedicated document type blueprints
- Best for: Maximum accuracy requirements with dedicated budget
- AWS Account with appropriate permissions
- Node.js 18+
- AWS CLI configured
- Docker (optional, for container deployment)
The easiest way to get started is using our interactive setup script:
./scripts/interactive-setup.shThis all-in-one script provides options to:
- Setup and deploy the application
- Run the demo locally
- Update configuration settings
- Clean up AWS resources
- Clone the repository
git clone https://github.com/aws-samples/sample-document-splitting-with-amazon-bedrock-data-automation.git
cd sample-document-splitting-with-amazon-bedrock-data-automation- Install dependencies
npm run install:all- Configure environment
# create sample blueprints using ./scripts/interactive-setup.sh quick start.
cp backend/.env.example backend/.env
# Edit backend/.env with your AWS settings- Start development servers
npm run devThis will start both the backend server (port 8080) and frontend development server (port 3000).
Configure your environment by editing backend/.env:
# AWS Configuration
AWS_REGION=us-east-1
S3_BUCKET=document-splitting-demo
BDA_PROFILE_ARN=arn:aws:bedrock:us-east-1:123456789012:data-automation-profile/your-profile-id
BDA_PROJECT_ARN=arn:aws:bedrock:us-east-1:123456789012:data-automation-project/your-project-id
# Application Configuration
NODE_ENV=development
PORT=8080
FRONTEND_URL=http://localhost:3000
# Processing Configuration
MAX_FILE_SIZE=52428800
MAX_PAGES=200
PROCESSING_TIMEOUT=300000
# Logging
LOG_LEVEL=info
# Demo Mode (set to true for mock responses when AWS services not configured)
DEMO_MODE=true
sample-document-splitting-with-amazon-bedrock-data-automation/
├── backend/ # Node.js API server
│ ├── src/ # Backend source code
│ │ ├── config/ # Configuration files
│ │ ├── handlers/ # Route handlers
│ │ ├── services/ # Business logic
│ │ ├── utils/ # Utilities
│ │ └── index.js # Entry point
│ ├── .env # Environment variables (create from .env.example)
│ └── package.json # Backend dependencies
├── frontend/ # React frontend
│ ├── public/ # Static assets
│ │ └── docs/ # Documentation
│ ├── src/ # Frontend source code
│ │ ├── components/ # UI components
│ │ ├── pages/ # Application pages
│ │ └── App.tsx # Main application component
│ └── package.json # Frontend dependencies
├── samples/ # Sample documents
│ └── documents/ # Sample PDF documents
├── scripts/ # Utility scripts
│ ├── deploy-complete.sh # Complete deployment script
│ ├── interactive-setup.sh # All-in-one interactive setup and demo tool
│ ├── push_to_public_ecr.sh # Push Docker image to ECR Public
│ └── upload-sample-document.sh # Upload sample document to S3
├── cleanup-all.sh # Clean up all AWS resources
├── cloudformation-template.yaml # CloudFormation template
├── Dockerfile # Docker configuration
└── package.json # Root package.json
For a guided deployment experience:
./scripts/interactive-setup.shChoose "Setup & Deploy" from the main menu, then select your preferred deployment option.
./scripts/deploy-complete.shdocker build -t document-splitting:latest .
docker run -p 8080:8080 document-splitting:latestThe system is configured to identify and process:
- Uniform Residential Loan Application (URLA) - 9-page loan application document with detailed borrower information
- Homebuyer Certificates - Certificates issued to participants who completed homebuyer education programs
- Uniform Residential Appraisal Report (Form 1004) - Property appraisal documents with contract prices and borrower details
- Bank Statements - Financial statements with account information and transaction summaries
- Uniform Underwriting and Transmittal Summary (Form 1008) - Multi-page mortgage underwriting forms
- Driver's License - US driver's license documents (using AWS public blueprint)
These document types are defined through custom Amazon Bedrock Data Automation blueprints that specify the exact fields and data structures to extract from each document type.
POST /api/upload- Upload PDF documentsPOST /api/processing/start- Start async document processing (both BDA Standard + Bedrock and Custom Output)GET /api/processing/status/:jobId- Check processing status and progressGET /api/processing/result/:jobId- Get processing results when completedGET /api/processing/models- Get supported Bedrock modelsPOST /api/processing/process- Legacy endpoint (redirects to async flow)
GET /api/analysis/costs- Calculate processing costsGET /api/analysis/comparison- Compare processing methodsGET /api/analysis/documents/:jobId- Get document analysis results
To remove all AWS resources created by this sample:
./scripts/interactive-setup.shChoose "Cleanup Resources" from the main menu.
For production deployments, implement proper authentication and authorization using AWS Cognito, IAM roles with least privilege principles, or integrate with your existing identity provider.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.