This project focuses on the automatic extraction of SHACL constraints from textual guides using Large Language Models (LLMs). By leveraging advanced prompt techniques and language models like LLaMA or GPT, it aims to generate accurate SHACL shapes for knowledge graphs, facilitating ontology validation and ensuring data consistency. The system supports various prompting strategies and provides a multi-agent framework to improve constraint extraction quality.
-
chroma/
Stores the vector embeddings generated from the PDF guide fragments (if you have already processed them). -
content/
Contains the source guide, both in PDF and in HTML format. The final version for use isRINF_Application_guide_V1.6.1.html. -
output/
Contains all generated SHACL constraints in Turtle (.ttl) format. Filenames follow the pattern:
{filename}_{model}_{temperature}_{prompting_technique}.ttl. -
plots/
Includes the code needed to generate plots and statistics related to the experiments. -
prompts/
Holds all prompt templates used during the project, organized in JSON files by prompting technique. -
validation/
Contains code and data used to validate the generated SHACL constraints. -
auxiliary_ontology_functions.py
Utility functions for ontology processing. -
cloudflare.py
Helper functions for interacting with Cloudflare R2. -
era-shapes.ttl
Gold standard of SHACL shapes that serves as the expected target. -
main.py
The main entry point to run the project. -
multiagent.py
Implements a multi-agent system to generate constraints collaboratively. -
ollama_functions.py
Helper functions to interface with the Ollama server. -
ontology.ttl
Base ontology used as the starting point to generate constraints. -
preprocess_html.py
Script to preprocess the HTML file converted from the PDF. (There is no need to run it again since the final HTML has already been generated.) -
prompts.py
Includes a function to load prompt templates from JSON files. -
rag.py
Implements the RAG (Retrieval-Augmented Generation) technique to build a document retriever. -
requirements.txt
Project dependencies. -
run_experiments.sh
Shell script to execute the entire pipeline, including experiments.
To install all dependencies and get the project running:
pip install -r requirements.txtIn addition, the following conditions and configurations are required:
-
Running Redis server
An active Redis server is essential for the proper functioning of certain modules in the project. -
.envconfiguration file
A.envfile must be created containing the following environment variables, required for interacting with Cloudflare R2:ACCOUNT_ID: Cloudflare account identifier.R2_ACCESS_KEY: Cloudflare R2 access key.R2_SECRET_KEY: Cloudflare R2 secret key.R2_BUCKET: Name of the R2 bucket, which must be set to public.PUB_URL: Public URL of the R2 bucket.
-
Ollama server (for the open-source version)
If using the open-source version of the system, an active Ollama server is required, and thellama3:8bmodel must be downloaded. -
OpenAI API (optional)
If you prefer to use the OpenAI API, the API key must be added to your shell environment. This can be done by adding the following line to your~/.bashrcfile:export OPENAI_API_KEY=your_openai_api_key_here
To validate the constraints against specific RDF data, the file ES.zip_combined-new.nq is used.
This file is not included in the repository due to security and privacy reasons.
If you need access to it, please request it by email at:
📧 [email protected]
To run the entire experimentation process, use:
chmod +x run_experiments.sh
./run_experiments.shYou can also run a single extraction using the main script with the following command-line arguments:
python3 main.py <file_path> [options]| Argument | Description | Default | Options |
|---|---|---|---|
file |
Path to the text or PDF file to be processed. | N/A | N/A |
--force_process |
Forces reprocessing of the PDF even if it has been processed before. | False | Flag (no value needed) |
--model |
LLM model to use for constraint extraction. | llama |
llama, gpt |
--temperature |
Temperature setting for the LLM, controls randomness in generation. | 0 |
Any float value |
--prompting_technique |
Prompting technique to use for the LLM query. | basic |
v1, basic, few-shot, cot, grounded-citing, all |
python3 main.py content/RINF_Application_guide_V1.6.1.html --model gpt --temperature 0.5 --prompting_technique few-shot