This repository is a starting point to test strategies for extracting structured information from a given URL using a language model.
You can visit this page to see an example dataset that is used for evaluation of this agent together with an experimental run using the default agent.
- A bare minimum implementation that can be used to extract structured information from a given URL (
src/agent
folder). - A dataset and evaluation script to evaluate the performance of the agent doing the extraction (
src/eval
folder).
The agent is a LangGraph agent that uses a language model to extract structured information from a given URL. The agent is implemented in the src/agent
folder.
The agent does the following:
- Accepts a URL and a JSON schema as input from a user.
- Fetches the HTML content of a given URL.
- Parses the HTML content into text.
- Uses a vanilla chat model capable of tool calling to extract structured information from the text that matches the schema.
Install the langgraph CLI:
pip install "langgraph-cli[inmem]==0.1.61"
Install dependencies:
pip install -e .
Load API keys into the environment for the LangSmith SDK and OpenAI API:
export LANGSMITH_API_KEY=<your_langsmith_api_key>
# Or configure another chat model
export OPENAI_API_KEY=<your_openai_api_key>
Launch the agent:
langgraph dev
If all is well, you should see the following output:
Ready!
Docs: http://127.0.0.1:2024/docs
LangGraph Studio Web UI: https://smith.langchain.com/studio/?baseUrl=http://127.0.0.1:2024
You can try to improve the extraction strategy in a variety of ways. For example,
- Improving the HTML parsing strategy.
- Adding handling of large html documents and deduplication of extracted information.
- Adding reflection steps.
- Extend this to work with data URLs and accept other file formats like PDFs. (The
src/agent/parsing
already has functionality to parse PDFs, you just need to hook it up.)
Prior to engaging in any optimization, it is important to establish a baseline performance. This repository includes:
- A dataset consisting of a list of URLs and the expected structured information to be extracted from each URL.
- An evaluation script that can be used to evaluate the agent on this dataset.
Make sure you have the LangSmith CLI installed:
pip install langsmith
And set your API keys:
export LANGSMITH_API_KEY=<your_langsmith_api_key>
# We're using an LLM as a judge, so will need an API key
export OPENAI_API_KEY=<your_langsmith_api_key>
A score between 0 and 1 is assigned to each extraction result by an LLM model that acts as a judge.
The model assigns the score based on how closely the extracted information matches the expected information.
Create a new dataset in LangSmith using the code in the eval
folder:
python eval/create_dataset.py
To run the evaluation, you can use the run_eval.py
script in the eval
folder. This will create a new experiment in LangSmith for the dataset you created in the previous step.
python eval/run_eval.py --experiment-prefix "My custom prefix" --agent-url http://localhost:2024
- You can deploy it using LangGraph Platform.
- If you're deploying this agent yoruself and the container is not network isolated (e.g., can access other network resources), you should configure a proxy for using in web requests.