Skip to content

Whether there is a risk of leaking test sets in preprocess.py #1

Description

@Longmeix

Hello~
It seem like that the following code in preprocess.py (lines 116-121) may inadvertently leak test anchor pairs in the training set. The variable ILL contains all anchor pairs (labels) loaded from icews_wiki/ref_pairs, which has already been devided into training and testing sets, saved in file icews_wiki/sup_pair and icews_wiki/ref_pairs, respectively.

However, the redivided train may include anchor pairs that belong to the testing set icews_wiki/ref_pairs, potentially leading to data leakage.

train = ILL[:1500]
test = ILL[1500:]
same_name = {}
for id_1,id_2 in train:
    name = id_1+"-"+id_2
    same_name[name] = [id_1,id_2]

Here, train is used to create same_name, which subsequently generates node2same to assign identical structure embeddings to a pair of anchor nodes in train (in get_deep_emb,py). In short, the anchor pairs in train should be given and should not include any testing data.

To prevent this, the correct code should be modified as follows:

train = load_file(self.path + 'sup_pairs')
test = load_file(self.path + 'ref_pairs')

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions