This tool is designed to look through interactive websites on the hunt for downloadable information.
The scraper recursively follows these general steps:
- download the current state of a website
- save all links that directly lead to .pdf files
- feed the state of the website into a prompted LLM
- let the LLM rank elements of the page by how interesting they look
- click on an interesting element and go back to 1. for the new state of the website
It should track which states of the website it has already seen. After reaching a defined depth it stops.
Install UV if you havent already
curl -LsSf https://astral.sh/uv/install.sh | shSource the virtual environment
uv venv
source .venv/bin/activateInstall the required packages
uv pip install requests beautifulsoup4 playwrightInstall the browser binary for playwright
playwright installSimply run it with python. make sure to use the correct link you want to try out in the main-py file.
python main.py