implement Textract workflow with DynamoDB integration #9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implementing textract workflow with dynamoDB table textract line and word output.
🔗 Asana, 4Help, and Related Links
Asana Ticket: link
Originial textract workflow repo: link
✨ What Does This Pull Request Do?
This PR integrates an AWS Textract pipeline into the ingest workflow. It extracts text from JPEG images, saves results to S3 (in json format) and DynamoDB tables, and can be triggered via a flag. Only JPEGs are supported; TIFFs and other formats are not currently supported. Updates to existing Textract metadata are not implemented.
🛠️ Summary of Changes
generic_metadata.py.textract_lambda_handler.pyfor OCR and DynamoDB integration.🧪 How Should PR Be Tested?
Local testing
Open test_event_example.json, save it as test_event.json, and update the following fields:
bucket: Set to your main S3 bucket name (e.g., "i---store").
key: Set to the full path of a JPEG image in the Access folder (e.g., "f--/testextract005001/Access/testextract005001001.jpeg").
Use one of the available test images in the testextract folder:
testextract005001001.jpeg, testextract005002001.jpeg, testextract005003001.jpeg, or testextract005004001.jpeg.
This JSON simulates an SQS event for the Lambda function (textract_lambda_handler.py), which is triggered by new objects in the Textract bucket.
(You can find the line and word tables by searching for 'textract' in DynamoDB, and select the appropriate S3 bucket by searching for 'textract' and choosing t*****-w*****-p****).
Run for the following use cases:
Test 1. Test for Textract Response and DynamoDB Deduplication
For each of the following cases run the sh file: . run_vtdlp-testextract.sh
Check for Existing Textract Response: In the textract bucket t****-w****-p****/testextract/testextract005001001/textractResponse, verify if a Textract response JSON file already exists for your test image.
If the Textract response exists:
If the Textract response does not exist:
Test 2. Test for Deduplication
Test 3. Test for Validation
Test 4: Invalid Image Format Handling
Test 5: Verifying Textract Results and DynamoDB Entries
Confidence level for each line.
Field names in the DynamoDB table vs. the JSON response to ensure they match.
Confirm absence of the updated_at field (only present for initial extraction).
Check hardcoded values: isactive, visibility, and collection_category.
Verify the unique_key (combination of identifier and line/word number) is used for deduplication.
Inspect how boundingbox (Map type) and polygon (List type) are represented under the geometry field.
Ensure the id from the Textract JSON is mapped to output_id in the table.
Confirm that the unique_key is used to scan for existing content in the DynamoDB table and prevents duplicate entries.
Test 6. Testing and verifying Tesseract
Tesseract, via pytesseract, is used to estimate the percentage of text in JPG images using the get_Text_Percentage_Images function. If the detected text exceeds 10%, the AWS Textract service is invoked; otherwise, processing is skipped to avoid unnecessary charges.
In the AWS Lambda environment, the Tesseract binary and pytesseract dependencies are packaged in a Docker container and provided as a Lambda Layer. Due to Lambda Layer size limitations, the English language data file (eng.traineddata) is stored in S3 and downloaded to the tessdata directory at runtime to enable Tesseract OCR functionality.
Testing:
Testing in AWS Lambda Environment
Preparation:
Open the Textract bucket and ensure that textract005004 folder does not exist. This ensures the Lambda will call the Textract service and create a new textractResponse folder.
Note: If the textractResponse folder already exists, copying the same object from the source bucket will not overwrite or remove the existing textractResponse folder. The Lambda will load the existing response, and Textract will not be called again.
Trigger:
After deletion, navigate to the source bucket (e.g., i-----/f-----/testextract/textract005004).
Use the "Actions → Copy" feature to copy the image to the destination Textract bucket (e.g., t--w--p/f-----/testextract/textract005004).
This S3 copy action should trigger the Lambda function.
See the extract_collection_identifier function for this extraction logic.
This approach avoids hardcoding and ensures the Lambda function determines the collection identifier based on the S3 path structure.
There is already a 'testEvent' under 'TEST EVENTS' for the textract-dev-parse-sync Lambda function for your reference. Modify the image object as needed and run the test.
Integrating Textract with DLP-Ingest Workflow:
The generic_metadata.py module supports both AWS Lambda and local Textract processing, controlled by the LOCAL_TEXTRACT environment variable in the shell script.
Testing Instructions:
Local Textract Processing integration (dlp-ingest/src/metadata/generic_metadata.py)
Preparation
Ensure testextract_archive_metadata.csv and testextract_collection_metadata.csv are present in the examples folder (attached).
Update the last line of the script to point to the correct CSV file
Testing
AWS Lambda Triggering integration (dlp-ingest/src/metadata/generic_metadata.py)
In this mode, the run_textract_workflow method in generic_metadata.py will copy the identifier’s contents into the Textract bucket, which will trigger the Lambda function.
Perform Test 1, Test 2, Test 3, Test 4, Test 5 and Test 6 under 'Local Testing' section
Test to ensure Textract workflow is skiiped:
Notes:
Additional Notes:
Find attached testextract archive and collection csv for testing out. These are already tiled.
Interested parties
@whunter
@goynejennifer
@otokama
@ervinkellym
(:star:) Required fields
PROCESS_TEXTRACT,LOCAL_TEXTRACT,TEXTRACT_BUCKET,TEXTRACT_LINE_TABLE,TEXTRACT_WORD_TABLE.###run_vtdlp-textract_example.sh
-For local testing of Textract code as a standalone code. Environment variables:
TEXTRACT_BUCKET,TEXTRACT_LINE_TABLE,TEXTRACT_WORD_TABLE.-test_event.json for local Textract testing.