Non-rigorous test score calculation #102

MengjieZhao · 2024-08-30T15:42:13Z

In the test score calculation, the authors use the test distribution to scale the test score per node. This is not a common practice for anomaly detection as the test statistics are usually unknown at the time of evaluation.

I refactored the code to use the statistics of the validation set to scale both the validation score and test score, however, there is a huge performance drop, as shown in the attached screenshot. The blue one is the original implementation with the test score scaled by the test distribution. The purple one is scaled by the validation statistics. The black one is first scaled by the first 1000 test samples' statistics (roughly 1/9), generally increased to 8/9, but there is almost no change in terms of the f1 performance. The final jump of the black line occurred because I changed to use the entire test statistics instead of 8/9.

The problem persists with the ROC-AUC as well, with the test score scaled by the validation statistics, the AUC is around 0.52 for the best epoch.

This work is the foundation of many GNN for time series papers, I urge attention to this issue.

scores = get_err_scores(test_re_list, val_re_list)
normal_dist = get_err_scores(val_re_list, val_re_list)

def get_err_scores(test_res, val_res):
    test_predict, test_gt = test_res
    val_predict, val_gt = val_res

    n_err_mid, n_err_iqr = get_err_median_and_iqr(test_predict, test_gt)

    test_delta = np.abs(np.subtract(
                        np.array(test_predict).astype(np.float64), 
                        np.array(test_gt).astype(np.float64)
                    ))
    epsilon=1e-2

    err_scores = (test_delta - n_err_mid) / ( np.abs(n_err_iqr) +epsilon)

    smoothed_err_scores = np.zeros(err_scores.shape)
    before_num = 3
    for i in range(before_num, len(err_scores)):
        smoothed_err_scores[i] = np.mean(err_scores[i-before_num:i+1])

    
    return smoothed_err_scores

The text was updated successfully, but these errors were encountered:

hepingf100 · 2024-09-24T09:41:06Z

In what environment did you conduct the experiments? Could you share the versions of the setup?

MengjieZhao · 2024-09-24T11:43:09Z

In what environment did you conduct the experiments? Could you share the versions of the setup?

 python==3.9
 torch==1.12.1
 torch_geometric==2.2.0

hepingf100 · 2024-09-24T12:14:13Z

Did you encounter any issues during the run?for example index = broadcast(index, src, dim)
File "/home/hepingfu/anaconda3/envs/TGDN/lib/python3.8/site-packages/torch_scatter/utils.py", line 12, in broadcast
src = src.expand(other.size())
RuntimeError: The expanded size of the tensor (1) must match the existing size (192000) at non-singleton dimension 1. Target sizes: [192000, 1, 128]. Tensor sizes: [1, 192000, 1]

hepingf100 · 2024-09-24T12:18:59Z

Did you modify the code? Could you share a copy with me? Thank you very much! my email: [email protected]

MengjieZhao · 2024-09-24T13:18:37Z

Did you modify the code? Could you share a copy with me? Thank you very much! my email: [email protected]

I can run the code as it is with the torch and torch_geometric versions mentioned above. It looks like a package version problem

hepingf100 · 2024-09-26T08:53:37Z

This environment cannot run on an RTX 4090 with CUDA 12.2.

peerschuett · 2025-01-07T16:38:34Z

Hi @MengjieZhao, what parts of the code are you exactly talking about? Is it get_err_median_and_iqr?

MengjieZhao · 2025-01-07T16:44:20Z

Hi @MengjieZhao, what parts of the code are you exactly talking about? Is it get_err_median_and_iqr?

Yes, exactly.

In the line of n_err_mid, n_err_iqr = get_err_median_and_iqr(test_predict, test_gt)

For the anomaly score calculation, one should take n_err_mid, n_err_iqr = get_err_median_and_iqr(val_predict, val_gt) to get the median and the IQR for scaling.

peerschuett · 2025-01-08T08:55:32Z

But why should we do this? At time of the calculation, the network has predicted the sensor values for each of the test time steps and we have measured the ground truth. Therefore, we can calculate their difference and smooth based on that.

Then, we use the anomaly threshold calculated on the smoothed validation set error (https://github.com/d-ailin/GDN/blob/main/evaluate.py#L109) and make the prediction on the smoothed test set error.

MengjieZhao · 2025-01-08T10:25:51Z

But why should we do this? At time of the calculation, the network has predicted the sensor values for each of the test time steps and we have measured the ground truth. Therefore, we can calculate their difference and smooth based on that.

Then, we use the anomaly threshold calculated on the smoothed validation set error (https://github.com/d-ailin/GDN/blob/main/evaluate.py#L109) and make the prediction on the smoothed test set error.

From the nature of the task of anomaly detection, one would like to have the prediction in real time, and in real time, one only has the test data until the observed time. Using the statistics from the entire test dataset makes applying the model practically useless.

peerschuett · 2025-01-08T10:41:41Z

Ah okay, I understand that reason! But then we should use the statistics from the train and validation set combined, because they are our measure for normality.

MengjieZhao · 2025-01-08T12:14:50Z

Ah okay, I understand that reason! But then we should use the statistics from the train and validation set combined, because they are our measure for normality.

One has to account for the generalization error as well, since the anomaly score is based on the forecasting error, that of the training set will naturally be lower.

peerschuett · 2025-01-08T16:21:55Z

Good point, tanks for the insight! How big is your validation set? Maybe it did not cover enough of the data distribution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-rigorous test score calculation #102

Non-rigorous test score calculation #102

MengjieZhao commented Aug 30, 2024 •

edited

Loading

hepingf100 commented Sep 24, 2024

MengjieZhao commented Sep 24, 2024

hepingf100 commented Sep 24, 2024

hepingf100 commented Sep 24, 2024

MengjieZhao commented Sep 24, 2024

hepingf100 commented Sep 26, 2024

peerschuett commented Jan 7, 2025

MengjieZhao commented Jan 7, 2025

peerschuett commented Jan 8, 2025

MengjieZhao commented Jan 8, 2025

peerschuett commented Jan 8, 2025

MengjieZhao commented Jan 8, 2025

peerschuett commented Jan 8, 2025

Non-rigorous test score calculation #102

Non-rigorous test score calculation #102

Comments

MengjieZhao commented Aug 30, 2024 • edited Loading

hepingf100 commented Sep 24, 2024

MengjieZhao commented Sep 24, 2024

hepingf100 commented Sep 24, 2024

hepingf100 commented Sep 24, 2024

MengjieZhao commented Sep 24, 2024

hepingf100 commented Sep 26, 2024

peerschuett commented Jan 7, 2025

MengjieZhao commented Jan 7, 2025

peerschuett commented Jan 8, 2025

MengjieZhao commented Jan 8, 2025

peerschuett commented Jan 8, 2025

MengjieZhao commented Jan 8, 2025

peerschuett commented Jan 8, 2025

MengjieZhao commented Aug 30, 2024 •

edited

Loading