GPFL (paper) is a probabilistic logical rule learner optimized to mine instantiated rules that contain constants from knowledge graphs. This repository contains code necessary to run the GPFL system.
Features:
- Significantly faster than existing systems at mining quality instantiated rules
- Provide a validation mechanism filtering out overfitting rules
- Fully implemented on Neo4j graph database
- Adaptive to machines with different specifications through parameter tuning
- Toolkits for data preparation and rule analysis
- Java >= 1.8
- Gradle >= 5.6.4
GPFL generates rules by abstracting paths sampled from knowledge graphs. A closed path from figure above is:
ADVISED_BY(person314, person415), PUBLISHES(person314, title107), PUBLISHES(person415, title107)
which can be abstracted into rule:
ADVISED_BY(X, Y), PUBLISHES(X, A), PUBLISHES(Y, A)
and used to explain concept ADVISED_BY
when translated in logic term:
ADVISED_BY(X, Y) <- PUBLISHES(X, A), PUBLISHES(Y, A)
where PUBLISHES(X, A), PUBLISHES(Y, A)
is the premises and ADVISED_BY(X, Y)
the consequence. This kind of rules that do not contain constants are known as abstract rules.
GPFL also generates instantiated rules that contain constants. For instance, from path:
ADVISED_BY(person314, person415), PUBLISHES(person415, title0), PUBLISHES(person240, title0)
we can derive an instantiated rule specifying the correlation pattern (used as constraints in inference) between person314
and person240
as:
ADVISED_BY(person314, Y) <- PUBLISHES(Y, A), PUBLISHES(person240, A)
We start by learning rules from the UWCSE knowledge graph, a small graph in the academia domain. In data/UWCSE
folder you can find the inputs GPFL requires for learning:
data/<train/test/valid>.txt
: triple files for training, test and validation.data/<annotated_train/test/valid>.txt
: as GPFL runs on Neo4j database, we use indexing to optimize data querying. These files contain annotated training, test, validation triples with Neo4j ids.data/databases
: contains the Neo4j database. It can also be conveniently used for EDA with Cypher and its Data Science ecosystem.config.json
: the GPFL configuration file.
Now we give an introduction on some options in a GPFL configuration file:
home
: home directory of your data.out
: output directory.ins_depth
: max length of instantiated rules.car_depth
: max length of closed abstract rules.conf
: confidence threshold.support
: support (number of correct predictions) threshold.head_coverage
: head coverage threshold.saturation
: template saturation threshold.batch_size
: size of batch over which the saturation is evaluated.thread_number
: number of running threads. Please note as each thread is responsible for specializing a template or grounding a rule, employing large number of threads might cause out of memeory issue.
To learn rules for UWCSE, run:
gradle run --args="-c data/UWCSE/config.json -r"
where option -c
specifies the location of the GPFL configuration file, and -r
executes the chain of rule learning, application and evaluation for link prediction.
Once the program finishes, results will be saved at folder data/UWCSE/ins3-car3
. Now navigate to the result folder, file rules.txt
records all learned rules. To get the top rules, run following command to sort rules by quality:
gradle run --args="-or data/UWCSE/ins3-car3"
In the sorted rules.txt
file, each line has values:
Type Rule Conf HC VP Supp BG
CAR ADVISED_BY(X,Y) <- PUBLISHES(X,V1), PUBLISHES(Y,V1) 0.09333 0.31343 0.03015 21 220
where conf
is the confidence, HC
head coverage, VP
validation precision, supp
support, and BG
body grounding (total predictions).
To check the quality/type/length distribution of the learned rules, run:
gradle run --args="-c data/UWCSE/config.json -ra"
To find explanations about predicted and existing facts (test triples) in terms of rules, check verifications.txt
file in the result folder, where an entry looks like this:
Head Query: person211 PUBLISHES title88
Top Answer: 1 person415 PUBLISHES title88
BAR PUBLISHES(person415,Y) <- PUBLISHES(V1,Y), PUBLISHES(V1,title12) 0.40426
BAR PUBLISHES(person415,Y) <- PUBLISHES(V1,Y), PUBLISHES(V1,V2), PUBLISHES(person211,V2) 0.40299
BAR PUBLISHES(person415,Y) <- PUBLISHES(V1,Y), PUBLISHES(V1,title182) 0.39583
Top Answer: 2 person211 PUBLISHES title88
BAR PUBLISHES(person211,Y) <- PUBLISHES(V1,Y), PUBLISHES(V1,V2), PUBLISHES(person284,V2) 0.35294
BAR PUBLISHES(person211,Y) <- PUBLISHES(V1,Y), PUBLISHES(V1,title259) 0.325
BAR PUBLISHES(person211,Y) <- PUBLISHES(V1,Y), PUBLISHES(V1,title241) 0.325
Top Answer: 3 person240 PUBLISHES title88
BAR PUBLISHES(person240,Y) <- PUBLISHES(V1,Y), PUBLISHES(V1,V2), PUBLISHES(person161,V2) 0.2459
BAR PUBLISHES(person240,Y) <- PUBLISHES(V1,Y), PUBLISHES(V1,title268) 0.2381
BAR PUBLISHES(person240,Y) <- PUBLISHES(V1,Y), PUBLISHES(V1,V2), PUBLISHES(person415,V2) 0.17647
Correct Answer: 2 person211 PUBLISHES title88
BAR PUBLISHES(person211,Y) <- PUBLISHES(V1,Y), PUBLISHES(V1,V2), PUBLISHES(person284,V2) 0.35294
BAR PUBLISHES(person211,Y) <- PUBLISHES(V1,Y), PUBLISHES(V1,title259) 0.325
BAR PUBLISHES(person211,Y) <- PUBLISHES(V1,Y), PUBLISHES(V1,title241) 0.325
Head Query: person211 PUBLISHES title88
means that GPFL corrupts the known fact person211 PUBLISHES title88
into a head query ? PUBLISHES title88
, and asks the learned rules to suggest candidates to replace ?
. If person211
is proposed in the answer set, it is considered as a correct answer. In this example, the correct answer ranks 2 as in:
Top Answer: 2 person211 PUBLISHES title88
BAR PUBLISHES(person211,Y) <- PUBLISHES(V1,Y), PUBLISHES(V1,V2), PUBLISHES(person284,V2) 0.35294
BAR PUBLISHES(person211,Y) <- PUBLISHES(V1,Y), PUBLISHES(V1,title259) 0.325
BAR PUBLISHES(person211,Y) <- PUBLISHES(V1,Y), PUBLISHES(V1,title241) 0.325
where the following rules are top rules that suggest candidate person211
. Therefore, these rules can be used to explain in a data-driven way why person211
publishes paper title88
.
To find detailed evaluation results, please refer to the eval_log.txt
file in the result folder.
In this section, we provide recipes for different scenarios. To print help info about GPFL, run:
gradle run --args="-h"
gradle shadowjar
The generated jar file will be at build/libs
.
Given home folder Foo
, place your triple file in Foo/data/
and rename the triple file to train.txt
, then run:
gradle run --args="-sbg Foo"
the generated database will be at Foo/databases
.
Given home folder Foo
that contains configuration file config.json
with home
option set to Foo
, place your splits in Foo/data/
and name training file train.txt
, test file test.txt
and validation file valid.txt
, then run:
gradle run --args="-c Foo/config.json -bg"
the generated database will be at Foo/databases
.
Given home folder Foo
that contains configuration file config.json
with home
option set to Foo
, if you want to create training/test/validation splits in 6:2:2 ratio, add line "split_ratio": [0.6,0.2]
to config.json
, and run:
gradle run --args="-c Foo/config.json -sg"
the splits will be at Foo/data
.
Given home folder Foo
that contains configuration file config.json
with home
option set to Foo
, if you only want to learn rules for a collection of n
random targets, add line "randomly_selected_relations": n
to config.json
, and run:
gradle run --args="-c Foo/config.json -r"
Given home folder Foo
that contains configuration file config.json
with home
option set to Foo
, if you only want to learn rules for specific targets, e.g., target1
and target2
, add line "target_relation": ["target1", "target2"]
to config.json
, and run:
gradle run --args="-c Foo/config.json -r"
Given home folder Foo
that contains configuration file config.json
with home
option set to Foo
, if you want to sample n
relationship types as targets from the database, add line "randomly_selected_relations": n
to config.json
, and run:
gradle run --args="-c Foo/config.json -st"
Given home folder Foo
that contains configuration file config.json
with home
option set to Foo
, run:
gradle run --args="-c Foo/config.json -l"
Given home folder Foo
that contains configuration file config.json
with home
option set to Foo
and out
to output
, and the predictions.txt
that contains predictions made by GPFL is placed in output
, run:
gradle run --args="-c Foo/config.json -a"
Given your predictions.txt
produced by GPFL is placed in Foo
, run:
gradle run --args="-e Foo"
Given your predictions.txt
produced by AnyBURL is placed in Foo
, run:
gradle run --args="-ea Foo"
Given home folder Foo
that contains configuration file config.json
with home
option set to Foo
, and you have produced result folders r1
, r2
and r3
(each should contain eval_log.txt
file) and placed them in Foo
, if you want to run in ensemble mode to aggregate the best performing rules over different configurations, add line "ensemble_bases": ["r1", "r2", "r3"]
and optional change out
option to ensemble
, run:
gradle run --args="-c Foo/config.json -en"
The rules.txt
file produced by running -r
and -l
contains overfitting rules regardless of the value of the overfitting_factor
, only when evaluating the rules for precision or link prediction, the overfitting rules in rules.txt
will be removed (in memory, still persistent in the file). To create a view of non-overfitting rules, set overfitting_factor
to a value > 0, then run:
gradle run --args="-c Foo/config.json -ovf"
the generated view will be at Foo/out/refined.txt
.
GPFL is memory-intensive in that the volume of instantiated rules is often inordinate, and the intention of discovering top rules to explain existing facts requires saving rules in memory. Rule learning is also time-consuming on large knowledge graphs. GPFL allows the use of time and space constraints to adapt the system to various task requirements. Here we introduce options you can tune in config.json
file if you find the runtime is too long and out-of-memory happens too often.
learn_groundings
: max number of groundings for evaluating a rule during learning.apply_groundings
: max number of groundings for evaluating a rule during application.random_walkers
: number of random walkers used to sample paths.ins_rule_cap
: max number of instantiated rules that can be derived from a template.suggestion_cap
: max number of predictions a rule can make during application.gen_time
: max time (in seconds) to run generalization procedure (creating templates and CARs).essential_teim
: max time (in seconds) to run essential rule generation procedure (creating instantiated rules of length 1).spec_time
: max time (in seconds) to run specialization procedure (creating instantiated rules).thread_number
: number of running threads.
All experiments reported in the paper is carried out on AWS EC2 r5.2xlarge instances. Please download experiment datasets here, and unzip into data
folder.
gradle run --args="-c data/<dataset>/config.json -ert"
gradle run --args="-c data/<dataset>/config.json -p"
gradle run --args="-c data/<dataset>/config.json -ov"
For evaluating FB15K-237 in the default setting, run:
gradle run --args="-c data/FB15K-237/config.json -r"
On WN18RR, run:
gradle run --args="-c data/WN18RR/config.json -r"
To evaluate the impact of removal of overffiting rules on link prediction performance, we have re-split FB15K-237
and WN18RR
in a 6:2:2 ratio for larger validation sets. The corresponding re-splits can be found in folder data/FB15K-237-LV
and WN18RR-LV
. To evaluate FB15K-237 in the random setting with validation, change the value of option overfitting_factor
from 0 to 0.1 in file data/FB15K-237-LV/config.json
, then run:
gradle run --args="-c data/FB15K-237-LV/config.json -r"
For evaluation with validation on WN18RR, change the value of overfitting_factor
to 0.1 in file data/WN18RR-LV/config.json
, then run:
gradle run --args="-c data/WN18RR-LV/config.json -r"
If you use this codebase, please cite our paper:
@misc{gu2020learning,
title={Towards Learning Instantiated Logical Rules from Knowledge Graphs},
author={Yulong Gu and Yu Guan and Paolo Missier},
year={2020},
eprint={2003.06071},
archivePrefix={arXiv},
primaryClass={cs.AI}
}