This codebase allows you to scrape any website and extract relevant data points easily using OpenAI Functions and LangChain.
Create a schema in schemas.py, pick a url, and use them with scrape_with_playwright() in main.py to start scraping.
Tip: each website has the bulk of content either in <p>, <span> or <h> tags. For best performance, choose a combination of tags that work for you.
-
Define the schema of the website you want to scrape in
schemas.py(Pydantic class or dictionary are both fine):class SchemaNewsWebsites(BaseModel): news_headline: str news_short_summary: str
-
To start scraping, in
main.py, run something like this:asyncio.run(scrape_with_playwright( url="https://www.bbc.com", tags=["span"], schema_pydantic=SchemaNewsWebsites ))
python -m venv virtual-env or python3 -m venv virtual-env (Mac)
py -m venv virtual-env (Windows 11)
.\virtual-env\Scripts\activate (Windows)
source virtual-env/bin/activate (Mac)
Run poetry install --sync or poetry install
playwright installOPENAI_API_KEY=XXXXXX
python main.py-
Add onto this a FastAPI server to serve this as an API endpoint for ease of use.
-
Use caution when scraping. Don't do anything I wouldn't do (illegal)
-
P.S I've added this functionality to LangChain in this PR. You can read the official docs here.