PDF-Table-Extractor was created using a specific set of PDF files that get delievered monthly and contain data used in business reporting. Instead of having to manually extract the data each month, this script will do it automatically and export the tables to CSV so that they can then be imported in data visualization tools and get transformed/reported on. There is also an option to lightly clean the resulting CSVs, supporting setting different start and end points of the data, filtering a column, and removing NaN values.
While this was built for a specific type of PDF and for a specific use case, it can likely be modified to suit other needs - if the base way it was set up isn't compatible with a given PDF.
The main driver behind this project is Camelot. It is a Python library that allows for extracting tables from PDFs. If you need to modify this code to suit your needs, reading through Camelot's documentation will be extremely helpful.
- Requires Python 3.7 and up
- Requires the PDF interpreter Ghostscript. Download and install.
- In a terminal/CMD window, navigate to the folder where the requirements.txt file is (ex. cd Drive:\Path\To\PDF Extractor Folder) and run
pip install -r requirements.txt
to install required packages - Run
pdf_data_extract.py
- The script will generate a
config.yaml
file and exit - Edit the
confg.yaml
file, following the comments within it to ensure the correct values are being enetered - Re-run
pdf_data_extract.py
when ready to extract
- The pages being extracted from the PDF are declared in the configuration file, but there may be times when the pages need to be temporarily changed. One possibility is the PDF may be missing a page it normally contains, thowing off the page numbers where the tables usually are. If this happens, it is easy to adjust without updating the config file by using the
pdf_extract_fix.py
script. When running this script, instead of using the value from the configuration, it will ask you to input the page numbers instead. Ensure you are entering the page numbers on a single line, separated with commas like so:5,7,9,15
- If the config file ever gets accidentally permanently deleted or corrupted, a new one can be generated by re-running
pdf_data_extract.py
. Just ensure the config.yaml file is not in the script folder.