This tool is a python script package for uploading files, checking processing status, downloading and deleting files via the IBM Automation Document Processing (ADP) APIs. Users can get started by collecting all the files into one folder, setting the configuration file, and trigger the processing. Then the tool will automatically uploading all the files to the server, checking the processing status, downloading the JSON output of each file to the output folder and remove the resoutces finally.
-
config.json - Input file of your configuration settings
-
start.py - Starting point of the tool that will upload, download, and delete
-
reupload.py - Starting point of the tool that will redo the failed or unfinished processing: re-upload, download, and delete
-
readConfigJSON.py - Verify the configuration file
-
checkToken.py - Check or generate the token for authentication
-
uploadFiles.py - Call directly just to do uploads
-
downloadFiles.py - Call directly just to do downloads
-
deleteFiles.py - Call directly just to do deletes
-
updateReport.py - Writes out the output.json
-
loggingHandler.py - Writes to the processing.log
-
reUploadUnfinished.py - Call directly to reupload the failed or unfinished files, this overrides the previous output.json file.
- You must have access to Automation Document Processing project.
- You might want to access the Automation Document Processing Knowledge Center web page as a reference. The link is in the Related Links section below.
- You will need to provide Zen host, username, and password to generate a token for authentication
- You should know what subset of JSON options you want to be included. Enter all or see documentation for details.
Update config.json with your server connection and options information as follows.
- directory_path: The directory containing the files to be processed, supports nested directory files
- output_directory_path: The directory to write the output files (JSON) after processing. If the directory does not exist, the script will create it.
- zen_host: The URL to the server for generating the Zen token and sending the file processing reqests
- zen_username: The username that belongs to the appropriate group such as
captureadminsorprojectadmins. For more information on roles, please refer to here - zen_password: Password
- adp_project_id: Automation Document Processing project ID
- output_options: Output options. ADP only support json. The output will contain the extracted key-value pair information
- json_options: List of json options. Available values : ocr, dc, kvp, sn, hr, th, mt, ai, ds, char (case does not matter)
- ssl_verification: Boolean whether your system uses SSL certificates. Default is boolean False
- file_type: This is optional but can be specified if user requires specific file types to be uploaded and not all the BACA accepted file types ('pdf', 'jpg', 'jpeg', 'tif', 'tiff', 'png', 'doc', 'docx')
- download_json_option: This option is for downloaing file json, CA supports two version:
verboseandbasicfrom 22.0.2 release. If user don't specify this option, downloaded file json is verbose.
Note:
-
Command to get
zen_host:oc get route cpd -o jsonpath="{.spec.host}"Please check Related Links section for the commands to get more details. -
zen_hostandzen_passwordis used to generate the zen token for the authentication. If you would like to call one single API directly, you can usezen_api_key, which can be easily obtained in the homepage of IBM Cloud Park Automation and it will not be expired. After getting the zen api key, encode it with username using base64, like this, : => , and then add prefixZenApiKeyfor the authentication, Authorization: ZenApiKey . Please check the following link for more details.
{
"zen_host": "route-host-without-https",
"zen_username": "admin",
"zen_password": "password",
"adp_project_id": "your-project-name",
"directory_path": "/sample/input",
"output_directory_path": "/sample/output",
"output_options": "json",
"json_options": "ocr,dc,kvp,sn,hr,th,mt,ai,ds",
"ssl_verification": false,
"file_type": [
"pdf"
],
"download_json_option": "basic"
}
Install the latest python3, pip and these packages:
python -m pip install --upgrade pip
python -m pip install requests
python -m pip install python-dateutil
The tool will upload all the files found in the input directory and check for processing status. As the output files are ready, they will be downloaded to your output directory. Then the output files will be deleted from the server.
- Update the config.json with your configuration settings
- Make sure the directory_path contains all the files you want to process. Files in nested subdirectories will also be processed
- Run the script from the terminal command line:
python start.py - Monitor the console log.
If there are any errors and files did not get processed, you can call reupload.py to redo the upload, download, and delete. The script will reprocess the unfinished files listed in the output.json. It will upload the files again in the input directory and check for processing status. As the output files are ready, they will be downloaded to your output directory. Then the output files will be deleted from the server.
- You may want to backup the processing.log and the output.json and delete these files before calling reupload.
- Run the script from the terminal command line:
python reupload.py - Monitor the console log.
You may want to rerun individual Python scripts, for example to download the output files again or to clean up the files on the server. These scripts rely on previous uploads and references the output.json file that was generated.
- Run the scripts from the terminal command line:
python downloadFiles.pypython deleteFiles.py - Monitor the console log.
- Terminal console
- Check the processing.log for processing details (same as console log)
- Check the output.json for upload and download results in json format, including HTTP return codes and errors
- Check the output_directory for the output files
- The output files are immediately deleted from the server
This code is sample code created by IBM Corporation. IBM grants you a nonexclusive copyright license to use this sample code example. This sample code is not part of any standard IBM product and is provided to you solely for the purpose of assisting you in the development of your applications. This example has not been thoroughly tested under all conditions. IBM, therefore cannot guarantee nor may you imply reliability, serviceability, or function of these programs. The code is provided "AS IS", without warranty of any kind. IBM shall not be liable for any damages arising out of your or any other parties use of the sample code, even if IBM has been advised of the possibility of such damages. If you do not agree with these terms, do not use the sample code.
Copyright IBM Corp. 2022 All Rights Reserved.