Skip to content


Build a receipt and invoice processing pipeline with Amazon Textract

The repository provides a reference architecture to build a invoice automation pipeline that enables extraction, verification, archival and intelligent search.


The following architecture diagram shows the stages of a receipt and invoice processing workflow. It starts with a Document Capture stage to securely collect and store scanned invoices and receipts. The next stage is the extraction phase where we pass the collected invoices and receipts to Amazon Textract’s AnalyzeExpense API to extract financially related relationships between text such as Vendor Name, Invoice Receipt Date, Order Date, Amount Due/Paid, etc. In the next stage, we use few pre-defined expense rules to determine if we should auto-auto approve or reject our receipt. Auto approved and rejected documents go to their respective S3 buckets. For auto-approved documents, you can search all the extracted fields and values using OpenSearch. The indexed metadata can be visualized using OpenSearch dashboard.. Auto-approved documents are also set up to be moved to Glacier Vault for long term archival using S3 lifecycle policies.


Steps to deploy

Clone the repository

git clone

Install dependencies

pip install -r requirements.txt

Deploy InvoiceProcessor stack

# If you are running cdk first time in the account, run `cdk bootstrap` step first
cdk deploy

The deployment takes around 25 minutes with the default configuration settings from the GitHub samples, and creates a Step Functions workflow, which is invoked when a document is put at an Amazon S3 bucket/prefix and subsequently is processed till the content of the document is indexed in an OpenSearch cluster.

The following is a sample output including useful links and information generated from cdk deploy command:

InvoiceProcessorWorkflow.CognitoUserPoolLink =
InvoiceProcessorWorkflow.DocumentQueueLink =
InvoiceProcessorWorkflow.DocumentUploadLocation = s3://invoiceprocessorworkflow-invoiceprocessorbucketf1-lzei1g235krx/uploads/
InvoiceProcessorWorkflow.OpenSearchDashboard =
InvoiceProcessorWorkflow.OpenSearchLink =
InvoiceProcessorWorkflow.RulesTableName = InvoiceProcessorWorkflow-ExpenseValidationRulesTableEB3DAEF1-I1IY5U27MWF7
InvoiceProcessorWorkflow.StepFunctionFlowLink =

This information is also available in the AWS CloudFormation Console.

After the cdk deployment is complete, create a couple of validation rules in Dynamodb table. You can open CloudShell from AWS Console and run these commands:

aws dynamodb execute-statement --statement "INSERT INTO \"$(aws cloudformation list-exports --query 'Exports[?Name==`InvoiceProcessorWorkflow-RulesTableName`].Value' --output text)\" VALUE {'ruleId': 1, 'type': 'regex', 'field': 'INVOICE_RECEIPT_ID', 'check': '(?i)[0-9]{3}[a-z]{3}[0-9]{3}$', 'errorTxt': 'Receipt number is not valid. It is of the format: 123ABC456'}"
aws dynamodb execute-statement --statement "INSERT INTO \"$(aws cloudformation list-exports --query 'Exports[?Name==`InvoiceProcessorWorkflow-RulesTableName`].Value' --output text)\" VALUE {'ruleId': 2, 'type': 'regex', 'field': 'PO_NUMBER', 'check': '(?i)[a-z0-9]+$', 'errorTxt': 'PO number is not present'}"

We also need to create a folder named uploads under the bucket: InvoiceProcessorWorkflow.DocumentLocation. This is where input receipts/invoices are going to be placed. When a new document is placed under InvoiceProcessorWorkflow.DocumentLocation/uploads, a new Step Functions workflow is started for this document.

To check the status of this document, the InvoiceProcessorWorkflow.StepFunctionFlowLink provides a link to the list of StepFunction executions in the AWS Management Console, displaying the status of the document processing for each document uploaded to Amazon S3. The tutorial Viewing and debugging executions on the Step Functions console provides an overview of the components and views in the AWS Console.


  • Empty the S3 bucket
  • Get the cognito user pool id using:
    cognito_user_pool=$(aws cloudformation list-exports --query 'Exports[?Name==`InvoiceProcessorWorkflow-CognitoUserPoolId`].Value' --output text)
    echo $cognito_user_pool
  • Run cdk destroy
    cdk destroy
  • Delete Cognito user pool either from ui or from console
    aws cognito-idp  delete-user-pool --user-pool-id $cognito_user_pool


See CONTRIBUTING for more information.


This library is licensed under the MIT-0 License. See the LICENSE file.


No description, website, or topics provided.



Code of conduct

Security policy





No releases published


No packages published

Contributors 4
