Skip to content

Conversation

@padma012
Copy link
Contributor

@padma012 padma012 commented Oct 3, 2025

Implementing textract workflow with dynamoDB table textract line and word output.

🔗 Asana, 4Help, and Related Links

Asana Ticket: link

Originial textract workflow repo: link

✨ What Does This Pull Request Do?

This PR integrates an AWS Textract pipeline into the ingest workflow. It extracts text from JPEG images, saves results to S3 (in json format) and DynamoDB tables, and can be triggered via a flag. Only JPEGs are supported; TIFFs and other formats are not currently supported. Updates to existing Textract metadata are not implemented.

🛠️ Summary of Changes

  • Added Textract workflow logic to generic_metadata.py.
  • New textract_lambda_handler.py for OCR and DynamoDB integration.
  • Updated shell scripts with environment variables for configuration: run_vtdlp-ingest_example.sh, run_vtdlp-textract_example.sh (for local testing of textract as a standalone code)

🧪 How Should PR Be Tested?

Local testing

  1. Configure test_event.json with a valid image path:
  • Open test_event_example.json, save it as test_event.json, and update the following fields:

  • bucket: Set to your main S3 bucket name (e.g., "i---store").

  • key: Set to the full path of a JPEG image in the Access folder (e.g., "f--/testextract005001/Access/testextract005001001.jpeg").

  • Use one of the available test images in the testextract folder:
    testextract005001001.jpeg, testextract005002001.jpeg, testextract005003001.jpeg, or testextract005004001.jpeg.

  • This JSON simulates an SQS event for the Lambda function (textract_lambda_handler.py), which is triggered by new objects in the Textract bucket.

  1. Run the test script with required environment variables.
  • Open run_vtdlp-testextract_example.sh and fill in the values for TEXTRACT_LINE_TABLE, TEXTRACT_WORD_TABLE, and TEXTRACT_BUCKET.
    (You can find the line and word tables by searching for 'textract' in DynamoDB, and select the appropriate S3 bucket by searching for 'textract' and choosing t*****-w*****-p****).
  • Save the updated script as run_vtdlp-ingest_testextract.sh.
    Run for the following use cases:

Test 1. Test for Textract Response and DynamoDB Deduplication
For each of the following cases run the sh file: . run_vtdlp-testextract.sh

  • Check for Existing Textract Response: In the textract bucket t****-w****-p****/testextract/testextract005001001/textractResponse, verify if a Textract response JSON file already exists for your test image.

  • If the Textract response exists:

    • The code should skip Textract processing and use the existing response.
    • If the corresponding data is already present in the DynamoDB tables, the code should skip inserting/updating those entries and log a warning.
  • If the Textract response does not exist:

    • The code should preprocess the image (if needed), call the AWS Textract service, and save the results in the textractResponse folder under testextract00500X
    • The results from this json response should also be written to the DynamoDB tables. the code should skip inserting/updating those entries and log a warning.

Test 2. Test for Deduplication

  • Delete the Textract response folder for the identifier and rerun the workflow.
    • Confirm that a new Textract response is created.
    • If identifier does not already exist in the table, then confirm that new entries are added.
  • Delete the DynamoDB table contents for the identifier (in both line and word tables, using the filter 'identifier' contains 'testextract005001001', and rerun the workflow
    • Run this test both with and without an existing Textract response folder.
    • Confirm that the Textract JSON content is inserted into the tables.
    • If the Textract response already exists, confirm you see the "WARNING: Textract response already exists" and "Loaded existing Textract response" messages.

Test 3. Test for Validation

  • Confirm that when the Textract response and DynamoDB entries already exist, the workflow logs a warning and skips redundant processing or insertion.
  • After deleting the response or table entries, verify that the workflow recreates the response and table entries as expected.

Test 4: Invalid Image Format Handling

  1. Upload a TIFF image to i---store/f----/testextract/testextract005001/Access/.
    • You could copy the tiff image from the testss bucket to the testextract/testextract005001/Access folder.
    • Update the image path in test_event.json to reference the .tif file instead of the .jpg file.
  2. Run the Textract workflow.
  3. Expected Result:
  • The code should raise a ValueError indicating an invalid image format.

Test 5: Verifying Textract Results and DynamoDB Entries

  1. After running the workflow on a valid JPEG image:
  • Open the JPEG file, the corresponding textractResponse JSON file in the Access folder, and the line and word DynamoDB tables.
  1. In VS Code, use Shift+Alt+F to format the JSON file for easier reading and comparison.
  2. Compare the contents of the JPEG image with the extracted text in the JSON response and the DynamoDB line and word table entries.
  3. For easier verifications, filter DynamoDB items by the test identifier and the current date, then sort by line_no or word_no.
  4. Review the following details for each entry:
  • Confidence level for each line.

  • Field names in the DynamoDB table vs. the JSON response to ensure they match.

  • Confirm absence of the updated_at field (only present for initial extraction).

  • Check hardcoded values: isactive, visibility, and collection_category.

  • Verify the unique_key (combination of identifier and line/word number) is used for deduplication.

  • Inspect how boundingbox (Map type) and polygon (List type) are represented under the geometry field.

  • Ensure the id from the Textract JSON is mapped to output_id in the table.

  • Confirm that the unique_key is used to scan for existing content in the DynamoDB table and prevents duplicate entries.

Test 6. Testing and verifying Tesseract

Tesseract, via pytesseract, is used to estimate the percentage of text in JPG images using the get_Text_Percentage_Images function. If the detected text exceeds 10%, the AWS Textract service is invoked; otherwise, processing is skipped to avoid unnecessary charges.

In the AWS Lambda environment, the Tesseract binary and pytesseract dependencies are packaged in a Docker container and provided as a Lambda Layer. Due to Lambda Layer size limitations, the English language data file (eng.traineddata) is stored in S3 and downloaded to the tessdata directory at runtime to enable Tesseract OCR functionality.

Testing:

  • Use two JPG images for testing: one blank and one containing text. For convenience, these are located in i***/federated/testextract005004001. For the image with text, Tesseract detects 100% text and triggers the Textract pipeline. For the blank image, Tesseract detected less than 10% text, so the Textract pipeline is skipped as expected.

Testing in AWS Lambda Environment

  1. Deploy Lambda Function
  • The local code (excluding the local handler section) is already copied to the AWS Lambda function textract-dev-parse-sync.
  • Required environment variables are already set in the Lambda configuration under the AWS Console.
  1. Trigger Lambda via S3 Copy
  • Preparation:
    Open the Textract bucket and ensure that textract005004 folder does not exist. This ensures the Lambda will call the Textract service and create a new textractResponse folder.
    Note: If the textractResponse folder already exists, copying the same object from the source bucket will not overwrite or remove the existing textractResponse folder. The Lambda will load the existing response, and Textract will not be called again.

  • Trigger:

  • After deletion, navigate to the source bucket (e.g., i-----/f-----/testextract/textract005004).

  • Use the "Actions → Copy" feature to copy the image to the destination Textract bucket (e.g., t--w--p/f-----/testextract/textract005004).

  • This S3 copy action should trigger the Lambda function.

  1. Verify Execution
  • Check AWS CloudWatch logs for the Lambda function to confirm the trigger and successful execution.
  • Repeat the copy and trigger process as needed to validate results in step 4.
  1. Validate Outputs
  • Perform Test 1, Test 2, Test 3, Test 4, Test 5 and Test 6 as you did during local testing. For AWS-triggered runs, instead of running the code locally, drop the record in the Textract bucket/use the test event in the lambda(already configured) to simulate the trigger and monitor CloudWatch logs for warnings and output.
  1. Collection Identifier Extraction
  • For local runs, the collection identifier is passed from the run_vtdlp-ingest.sh script.
  • For AWS Lambda triggers, the collection identifier is dynamically extracted from the S3 object key: it is the folder immediately after federated and two folders above Access.
    See the extract_collection_identifier function for this extraction logic.
    This approach avoids hardcoding and ensures the Lambda function determines the collection identifier based on the S3 path structure.
  1. Simulate Lambda Trigger with Test Event
  • In the AWS Lambda console, use the "Test" event feature to simulate a trigger by providing a valid event payload.
    There is already a 'testEvent' under 'TEST EVENTS' for the textract-dev-parse-sync Lambda function for your reference. Modify the image object as needed and run the test.
  • Check OUTPUT/CloudWatch logs for successful execution.

Integrating Textract with DLP-Ingest Workflow:

The generic_metadata.py module supports both AWS Lambda and local Textract processing, controlled by the LOCAL_TEXTRACT environment variable in the shell script.

  • When Textract processing is enabled by setting PROCESS_TEXTRACT=true, the run_textract_workflow method at the end of generic_metadata.py scans the Textract bucket for existing identifiers.
  • If an identifier does not exist:
    • If LOCAL_TEXTRACT=false, the collection folder is copied to the Textract bucket, which triggers the Lambda function for processing.
    • If LOCAL_TEXTRACT=true, the event is passed to the local textract_lambda_handler.py for processing on your local machine.
  • If the identifier already exists in the Textract bucket, the workflow (including DynamoDB inserts) is skipped to prevent duplicate processing.

Testing Instructions:

Local Textract Processing integration (dlp-ingest/src/metadata/generic_metadata.py)

Preparation

  • Set environment variables in run_vtdlp-ingest.sh:
LOCAL_TEXTRACT="true"
PROCESS_TEXTRACT="true"
TEXTRACT_BUCKET="your-textract-bucket"
TEXTRACT_LINE_TABLE="your-line-table"
TEXTRACT_WORD_TABLE="your-word-table"
COLLECTION_IDENTIFIER="testextract"
  • Ensure testextract_archive_metadata.csv and testextract_collection_metadata.csv are present in the examples folder (attached).

  • Update the last line of the script to point to the correct CSV file

Testing

  • Run the shell script. If the folders already exist in the Textract bucket, you should see a warning:
WARNING: Identifier 'testextract005004' already exists in Textract bucket. Skipping Textract workflow.
  • Testing Deduplication, Reprocessing, Validation and Verification:
    • Perform Test 1, Test 2, Test 3, Test 4, Test 5 and Test 6 under 'Local Testing' section

AWS Lambda Triggering integration (dlp-ingest/src/metadata/generic_metadata.py)

  • Set:
LOCAL_TEXTRACT="false"
PROCESS_TEXTRACT="true"
  • In this mode, the run_textract_workflow method in generic_metadata.py will copy the identifier’s contents into the Textract bucket, which will trigger the Lambda function.

  • Perform Test 1, Test 2, Test 3, Test 4, Test 5 and Test 6 under 'Local Testing' section

Test to ensure Textract workflow is skiiped:

  • Set:
LOCAL_TEXTRACT="false"
PROCESS_TEXTRACT="false"
  • Run the shell script to confirm that the Textract workflow is skipped.

Notes:

  • Ensure all environment variables are set correctly including UPDATE_METADATA flag before running the tests.
  • For AWS Lambda triggering, monitor CloudWatch logs to verify execution and output.
  • Adjust CSV file paths and bucket names as needed for your environment.

Additional Notes:

Find attached testextract archive and collection csv for testing out. These are already tiled.

Interested parties

@whunter
@goynejennifer
@otokama
@ervinkellym

(:star:) Required fields

  • Added environment variables: PROCESS_TEXTRACT, LOCAL_TEXTRACT, TEXTRACT_BUCKET, TEXTRACT_LINE_TABLE, TEXTRACT_WORD_TABLE.
    ###run_vtdlp-textract_example.sh
    -For local testing of Textract code as a standalone code. Environment variables: TEXTRACT_BUCKET, TEXTRACT_LINE_TABLE, TEXTRACT_WORD_TABLE.
    -test_event.json for local Textract testing.

…ages with <10% text, inclusion of tesseract binary and layers on Lambda
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants