Skip to content

Extracts tables from PDFs and exports them to CSV files

Notifications You must be signed in to change notification settings

timothy-fyi/PDF-Table-Extractor

Repository files navigation

PDF Table Extractor

Use Case

PDF-Table-Extractor was created using a specific set of PDF files that get delievered monthly and contain data used in business reporting. Instead of having to manually extract the data each month, this script will do it automatically and export the tables to CSV so that they can then be imported in data visualization tools and get transformed/reported on. There is also an option to lightly clean the resulting CSVs, supporting setting different start and end points of the data, filtering a column, and removing NaN values.

While this was built for a specific type of PDF and for a specific use case, it can likely be modified to suit other needs - if the base way it was set up isn't compatible with a given PDF.

How This Works

The main driver behind this project is Camelot. It is a Python library that allows for extracting tables from PDFs. If you need to modify this code to suit your needs, reading through Camelot's documentation will be extremely helpful.

Set Up

  1. Requires Python 3.7 and up
  2. Requires the PDF interpreter Ghostscript. Download and install.
  3. In a terminal/CMD window, navigate to the folder where the requirements.txt file is (ex. cd Drive:\Path\To\PDF Extractor Folder) and run pip install -r requirements.txt to install required packages
  4. Run pdf_data_extract.py
  5. The script will generate a config.yaml file and exit
  6. Edit the confg.yaml file, following the comments within it to ensure the correct values are being enetered
  7. Re-run pdf_data_extract.py when ready to extract

Additional Info

  • The pages being extracted from the PDF are declared in the configuration file, but there may be times when the pages need to be temporarily changed. One possibility is the PDF may be missing a page it normally contains, thowing off the page numbers where the tables usually are. If this happens, it is easy to adjust without updating the config file by using the pdf_extract_fix.py script. When running this script, instead of using the value from the configuration, it will ask you to input the page numbers instead. Ensure you are entering the page numbers on a single line, separated with commas like so: 5,7,9,15
  • If the config file ever gets accidentally permanently deleted or corrupted, a new one can be generated by re-running pdf_data_extract.py. Just ensure the config.yaml file is not in the script folder.

About

Extracts tables from PDFs and exports them to CSV files

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages