ScrapeGPT is a web scraper builder that uses GPT-4 to automatically generate Python scripts for scraping websites based on user input. The app is built using Streamlit and allows users to input a URL and describe the data they want to scrape. The app then:
- Uses GPT-3.5 to iterate through the HTML to find the relevant information
- Summarizes its learnings
- Writes a web scraper
- Self-heals to debug itself
scrapegpt.mp4
- Clone the repository:
git clone https://github.com/davidhershey/ScrapeGPT
- Change to the project directory:
cd ScrapeGPT
- Install the required dependencies:
pip install -r requirements.txt
- Set the
OPENAI_API_KEY
environment variable to your OpenAI API key. You can also use a.env
file, which should look like this:
OPENAI_API_KEY="YOUR_KEY"
- [optionally set up Promptlayer] Set the
PROMPTLAYER_API_KEY
environment variable to your PROMPTLAYER_API_KEY API key. You can also use a.env
file, which should look like this:
PROMPTLAYER_API_KEY="YOUR_KEY"
- Run the Streamlit app:
streamlit run app.py
- Open the app in your browser using the URL provided in the terminal.
- Enter a sample URL in the "Sample URL" field.
- Describe the data you want to scrape in the "Describe what you want to scrape from this page" field.
- Click the "Get Started!" button to generate a function name for your scraper.
- Optionally, edit the function name in the "Name of the function to generate" field.
- Click the "Build a scraper!" button to generate the scraper code.
- Edit the generated code if necessary.
- Use the "Automatic Debugger" section to iteratively debug the code. You can choose how many debugging rounds the model should do, and which model should be used.
- Once the code is working, the final scraper code will be displayed in the "Final Code" section.
There are a few examples of successful runs in the successes folder
- All of the web scrapers have a dependency on code that is used to annotate and simplify the HTML. That code is in
simplify.py
, and needs to be available to your scraper code. (The simplification is used to reduce the number of tokens sent to GPT-4). - Right now can write scrapers for a single page
- Generate unit tests by looking at the simplified dom which should provide better auto-debugging
- I relied heavily on TaxyAI's browser-extension for the HTML simplification logic. I translated their logic from typescript to Python using GPT-4 to save some time :)
- Promptlayer made my life a lot easier when debugging the long chains of prompts used in this project