This repo contains code for the Getting started with BAML for graphs workshop, hosted by GraphGeeks.
We'll use BAML, a domain-specific language to get structured outputs from LLMs, to extract nodes and relationships from the unstructured data via LLMs. And we'll also use Kuzu, an embedded graph database, to store and query the knowledge graph in Cypher.
Imagine you are a developer at a financial asset company, tasked with tracking the merger and acquisition (M&A) activity of companies in the mining industry. You have access to a news API feed that updates daily with articles about the various mergers and acquisitions occurring in the mining industry.
Due to the nature of the problem, you're inherently interested in the relationships between companies and their associated entities in the data. However, the raw data is not in a structured format, so you will need to use LLMs to help extract, transform and organize the data into a knowledge graph. The knowledge graph must contain nodes and edges that connect companies, commodities, countries, and the time periods in which the M&A events occur, so that you can power downstream applications like a signal alert bot or a question-answering system (based on Graph RAG).
Because the raw data (described below) is not in a structured format, we first need to create a knowledge graph that captures the relationships between the entities described in the articles.
The steps are as follows:
- Sketch a schema for the knowledge graph.
- Use a series of BAML functions to prompt an LLM to extract structured data from the raw articles.
- Function 1: Classify a news article as either a merger, acquisition, or neither.
- Function 2: Extract the information from articles that are about mergers into
data/mergers.json
. - Function 3: Extract the information from articles that are about acquisitions into
data/acquisitions.json
. - Function 4: Extract the commodities that are produced by the companies & countries mentioned in the articles.
- Create a knowledge graph in Kuzu using the data extracted by the LLMs.
- Query the knowledge graph using Cypher.
The following diagram shows the sequence of steps.
Each BAML function is orchestrated via Python logic. More sophisticated error handling and fallbacks ("agentic workflows") can be implemented as necessary by using your framework of choice on top of BAML -- however, for a lot of cases like this, simple Python code is all you need.
The data consists of publicly available news articles about M&A activity in the mining industry, from popular websites like mining.com, australianmining.com and northernminer.com. The data is semi-structured and in JSON format, with metadata about each article, as well as the article's content.
See the file data/articles.json
for the input data used for this project.
The graph schema being modelled is shown in the following diagram:
It's recommended to install uv to manage the dependencies.
uv sync
Alternatively, you can install the dependencies manually via pip.
pip install -r requirements.txt
To run the code, the default (recommended) way is to use the uv run
command. Alternatively,
you can run the code by replacing uv run
with python
in the commands below.
The BAML extraction pipeline is in 00_extractor.py
. See the baml_src/
directory for the
BAML prompts used in each stage.
uv run 00_extractor.py
The graph schema is defined in 01_create_schema.py
. This initializes a new, empty Kuzu database
into which we will load the data extracted in the previous step.
uv run 01_create_schema.py
The graph is created in stages, starting with the merger
subgraph.
uv run 02_create_merger_subgraph.py
The next step is to create the acquisition
subgraph.
uv run 03_create_acquisition_subgraph.py
The unstructured data is now transformed and stored as a knowledge graph in Kuzu!
That's it! We now have a knowledge graph stored in Kuzu that we can use to answer queries such as the following:
// Which companies produce gold and where are they located?
MATCH (c:Company)-[:PRODUCES]->(ct:Commodity {name: "Gold"}),
(c)-[:IS_FROM_COUNTRY]->(country:Country)
RETURN c.name, country.name
┌─────────────────────────┬──────────────┐
│ c.name ┆ country.name │
│ --- ┆ --- │
│ str ┆ str │
╞═════════════════════════╪══════════════╡
│ Northern Star Resources ┆ Australia │
│ De Grey Mining ┆ Australia │
│ Ramelius Resources ┆ Australia │
│ Newcrest Mining ┆ Australia │
│ Westgold ┆ Australia │
│ Calibre Mining ┆ Canada │
│ Pretium Resources ┆ Canada │
│ Equinox Gold ┆ Canada │
│ Newmont ┆ USA │
└─────────────────────────┴──────────────┘
There are many more queries that can be run on the graph, and we will explore them, as well as the methodology in much more detail in the workshop. See you there!