Hello~
It seem like that the following code in preprocess.py (lines 116-121) may inadvertently leak test anchor pairs in the training set. The variable ILL contains all anchor pairs (labels) loaded from icews_wiki/ref_pairs, which has already been devided into training and testing sets, saved in file icews_wiki/sup_pair and icews_wiki/ref_pairs, respectively.
However, the redivided train may include anchor pairs that belong to the testing set icews_wiki/ref_pairs, potentially leading to data leakage.
train = ILL[:1500]
test = ILL[1500:]
same_name = {}
for id_1,id_2 in train:
name = id_1+"-"+id_2
same_name[name] = [id_1,id_2]
Here, train is used to create same_name, which subsequently generates node2same to assign identical structure embeddings to a pair of anchor nodes in train (in get_deep_emb,py). In short, the anchor pairs in train should be given and should not include any testing data.
To prevent this, the correct code should be modified as follows:
train = load_file(self.path + 'sup_pairs')
test = load_file(self.path + 'ref_pairs')
Hello~
It seem like that the following code in preprocess.py (lines 116-121) may inadvertently leak test anchor pairs in the training set. The variable
ILLcontains all anchor pairs (labels) loaded fromicews_wiki/ref_pairs, which has already been devided into training and testing sets, saved in fileicews_wiki/sup_pairandicews_wiki/ref_pairs, respectively.However, the redivided
trainmay include anchor pairs that belong to the testing seticews_wiki/ref_pairs, potentially leading to data leakage.Here,
trainis used to createsame_name, which subsequently generatesnode2sameto assign identical structure embeddings to a pair of anchor nodes intrain(inget_deep_emb,py). In short, the anchor pairs intrainshould be given and should not include any testing data.To prevent this, the correct code should be modified as follows: