Skip to content

Latest commit

 

History

History

evaluation

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Evaluation

Evaluation config

For running the evaluation, you will need to create an evaluation script. Some examples can be found in the config directory. Each test suite such as Wikidata_Tekgen or DBpedia_WebNLG contains a set of ontologies and the config file contains parametrized path patterns to files corresponding to each ontology in the test suite.

The following shows one example.

{
  "onto_list" : [
    "1_movie",  "2_music", "3_sport", "4_book", "5_military", 
    "6_computer", "7_space", "8_politics", "9_nature", "10_culture"
  ],
  "path_patterns": {
    "sys": "../../data/wikidata_tekgen/baselines/Vicuna-13B/llm_responses/ont_$$onto$$_llm_responses.jsonl",
    "gt":"../../data/wikidata_tekgen/ground_truth/ont_$$onto$$_ground_truth.jsonl",
    "selected_ids": "../../data/wikidata_tekgen/manually_verified_sentences/selected_ont_$$onto$$.txt",
    "onto": "../../data/wikidata_tekgen/ontologies/$$onto$$_ontology.json",
    "output": "../../data/wikidata_tekgen/baselines/Vicuna-13B/eval_metrics/ont_$$onto$$_llm_stats.jsonl"
  },
  "avg_out_file": "../../data/wikidata_tekgen/baselines/Vicuna-13B/eval_metrics/ont_llm_avg_stats.jsonl"
}

Here is the description of each of the parameters in the config.

Parameter Description
onto_list List of ids of ontologies. These are used for converting file path patterns to absolute paths.
path_patterns/sys The path pattern to system outputs for test sentences in each ontology.
path_patterns/gt The path pattern to ground truth triples for each ontology.
path_patterns/selected_ids (Optional) The path pattern to a selected list of manually validated test cases if applicable.
path_patterns/onto The path pattern to the ontology file.
path_patterns/output The path pattern for the detailed output file with metrics for each individual test sentence in each ontology.
avg_out_file The path pattern for average metrics at the ontology level and globally for the whole dataset.

Running the evaluation script

In order to run the run_eval.py script, set the working directory to the Text2KGBench\src\evaluation directory:

cd Text2KGBench\src\evaluation
|   README.md
|   run_eval.py
|
\---config
        tekgen_vicuna_config.json
        tikgen_alpaca_config.json
        tikgen_unseen_alpaca_config.json
        tikgen_unseen_vicuna_config.json
        tikgen_vicuna_config.json
        webnlg_alpaca_config.json
        webnlg_vicuna_config.json

If we run the run_eval.py script with a --help parameter it will output the following response:

python run_eval.py -h

It will generate an output similar to the following.

usage: run_eval.py [-h] --eval_config_path EVAL_CONFIG_PATH

 options:
 
  -h, --help            show this help message and exit
  
  --eval_config_path EVAL_CONFIG_PATH

To run the evaluation, we need an evaluation configuration file as discussed in the previous section. You can find evaluation configurations for various setups in config directory.

We run the run_eval.py script with a configuration file as a parameter:

python run_eval.py --eval_config_path config/tekgen_vicuna_config.json

It will generate a results file for each ontology and a results file with aggregated average results for each ontology and globally. You can find examples of the generated files in data\wikidata_tekgen\baselines\Vicuna-13B\eval_metrics. The output directory is also defined in the configuration file.

File
ont_1_movie_llm_stats.jsonl
ont_2_music_llm_stats.jsonl
ont_3_sport_llm_stats.jsonl
ont_4_book_llm_stats.jsonl
ont_5_military_llm_stats.jsonl
ont_6_computer_llm_stats.jsonl
ont_7_space_llm_stats.jsonl
ont_8_politics_llm_stats.jsonl
ont_9_nature_llm_stats.jsonl
ont_10_culture_llm_stats.jsonl
ont_llm_avg_stats.jsonl

The individual ontology results file contains the results at each test sentence level.

{
   "id":"ont_1_movie_test_1",
   "precision":"1.00",
   "recall":"0.50",
   "f1":"0.67",
   "onto_conf":"1.00",
   "rel_halluc":"0.00",
   "sub_halluc":"0.00",
   "obj_halluc":"0.00",
   "llm_triples":[
      [
         "Bleach: Hell Verse",
         "director",
         "Noriyuki Abe"
      ]
   ],
   "filtered_llm_triples":[
      [
         "Bleach: Hell Verse",
         "director",
         "Noriyuki Abe"
      ]
   ],
   "gt_triples":[
      [
         "Bleach : Hell Verse",
         "director",
         "Noriyuki Abe"
      ],
      [
         "Bleach : Hell Verse",
         "publication date",
         "01 January 2010"
      ]
   ],
   "sent":"Bleach: Hell Verse (Japanese: BLEACH , Hepburn: Bur\u00c4\u00abchi Jigoku-Hen) is a 2010 Japanese animated film directed by Noriyuki Abe."
}, 
...

Each generated file contains the following entries:

  • id: the ontology test identifier i.e. "ont_1_movie_test_1"
  • precision: precision metrics P = correct_triples/predicted_triples
  • recall: recall metrics R = correct_triples/gold_triples
  • f1: F1 score metrics F1 = harmonic mean of precision and recall. F1 = 2 * ((P * R) / (P + R)).
  • onto_conf (OC): ontology conformance (OC) metrics which is the number of system triple relations in the ontology divided by the total number of system triples
  • rel_halluc (RH): relation hallucination (RH) metrics which it is inversely related to OC and equal to RH = 1 - OC which is the number of system triple relations that are not in the ontology divided by the total number of system triples
  • sub_halluc (SH): subject hallucination (SH) metrics calculated by checking if pre-processed triple subject is referred in the test sentence and/or ontology concepts
  • obj_halluc (OH): object hallucination (OH) metrics checking if pre-processed triple object is referred in the test sentence and/or ontology concepts
  • llm_triples: LLM system triples i.e. [["Bleach: Hell Verse", "director", "Noriyuki Abe"]]
  • filtered_llm_triples: filtered LLM triples i.e. [["Bleach: Hell Verse", "director", "Noriyuki Abe"]]
  • gt_triples: ground truth triples i.e. [["Bleach : Hell Verse", "director", "Noriyuki Abe"], ["Bleach : Hell Verse", "publication date", "01 January 2010"]],
  • sent: original test sentence for extracting facts i.e. "Bleach: Hell Verse (Japanese: BLEACH , Hepburn: Bur\u00c4\u00abchi Jigoku-Hen) is a 2010 Japanese animated film directed by Noriyuki Abe."}

The average results file contains the results at each ontology level.

{
   "onto":"1_movie",
   "type":"all_test_cases",
   "avg_precision":"0.33",
   "avg_recall":"0.23",
   "avg_f1":"0.25",
   "avg_onto_conf":"0.89",
   "avg_sub_halluc":"0.26",
   "avg_rel_halluc":"0.11",
   "avg_obj_halluc":"0.26"
}, ...

The total avg metrics file contains the following fields:

  • onto: the ontology identifier i.e. "1_movie",
  • type: type of average calculation either "all_test_cases" or "selected_test_cases" or "global(global average figures)"
  • avg_precision (AP): total average precision for the ontology
  • avg_recall (AR): total average recall for the ontology
  • avg_f1 (AF1): total average F1 score for the ontology
  • avg_onto_conf (AOC): total average ontology conformance for the ontology
  • avg_sub_halluc (ASH): total average subject hallucination metrics for the ontology
  • avg_rel_halluc (ARH): total average relation hallucination metrics for the ontology ARH = 1 - AOC
  • avg_obj_halluc (AOH): total average object hallucination metrics for the ontology