This branch offers
- An initial test set having a small number of test examples for each dataset, together with their labels in
existcolumn. Note that this test set only serves for development purposes. So- The intermediate and final dataset will not contain the
existcolumn. - This is not the intermediate dataset we will be using for ranking solutions.
- The intermediate and final dataset will not contain the
- A simple baseline that trains on both datasets.
Download links to initial test set: Dataset A Dataset B
The baseline is only a minimal working example for both datasets, and it is certainly not optimal. You are encouraged to tweak it or propose your own solutions from scratch!
Here we summarize our baseline:
The baseline is an RGCN-like GNN model trained on the entire graph.
Event timestamps on the graph are encoded by decomposing the 10-digit decimal integers into 10-dimensional vectors, each element representing a digit.
We train the model as binary classification using a negative-sampling-like strategy.
Given a ground truth event (s, d, r, t) with source node s, destination node d, event type r and timestamp t, we perturb t to obtain a new value t'.
We label the quadruplet with 1 if the new timestamp is larger than the original timestamp, and 0 otherwise. The model is essentially trained to
predict p(t < t' | s, d, r), i.e. the probability that an edge with type r exists from source s and destination d before timestamp t'.
To use the baseline you need to install DGL.
You also need at least 64GB of CPU memory. GPU is not required.
-
Convert csv file to DGL graph objects.
python csv2DGLgraph.py --dataset [A or B] -
Training.
python base_pipeline.py --dataset [A or B]
The baseline got AUC of 0.511 on Dataset A and 0.510 on Dataset B.