PDF Table Extractor

Use Case

PDF-Table-Extractor was created using a specific set of PDF files that get delievered monthly and contain data used in business reporting. Instead of having to manually extract the data each month, this script will do it automatically and export the tables to CSV so that they can then be imported in data visualization tools and get transformed/reported on. There is also an option to lightly clean the resulting CSVs, supporting setting different start and end points of the data, filtering a column, and removing NaN values.

While this was built for a specific type of PDF and for a specific use case, it can likely be modified to suit other needs - if the base way it was set up isn't compatible with a given PDF.

How This Works

The main driver behind this project is Camelot. It is a Python library that allows for extracting tables from PDFs. If you need to modify this code to suit your needs, reading through Camelot's documentation will be extremely helpful.

Set Up

Requires Python 3.7 and up
Requires the PDF interpreter Ghostscript. Download and install.
In a terminal/CMD window, navigate to the folder where the requirements.txt file is (ex. cd Drive:\Path\To\PDF Extractor Folder) and run pip install -r requirements.txt to install required packages
Run pdf_data_extract.py
The script will generate a config.yaml file and exit
Edit the confg.yaml file, following the comments within it to ensure the correct values are being enetered
Re-run pdf_data_extract.py when ready to extract

Additional Info

The pages being extracted from the PDF are declared in the configuration file, but there may be times when the pages need to be temporarily changed. One possibility is the PDF may be missing a page it normally contains, thowing off the page numbers where the tables usually are. If this happens, it is easy to adjust without updating the config file by using the pdf_extract_fix.py script. When running this script, instead of using the value from the configuration, it will ask you to input the page numbers instead. Ensure you are entering the page numbers on a single line, separated with commas like so: 5,7,9,15
If the config file ever gets accidentally permanently deleted or corrupted, a new one can be generated by re-running pdf_data_extract.py. Just ensure the config.yaml file is not in the script folder.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.gitignore		.gitignore
README.md		README.md
config_template.yaml		config_template.yaml
pdf_data_extract.py		pdf_data_extract.py
pdf_extract_fix.py		pdf_extract_fix.py
pdf_extract_utils.py		pdf_extract_utils.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Table Extractor

Use Case

How This Works

Set Up

Additional Info

About

Releases

Packages

Languages

timothy-fyi/PDF-Table-Extractor

Folders and files

Latest commit

History

Repository files navigation

PDF Table Extractor

Use Case

How This Works

Set Up

Additional Info

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages