Filtering train set in `test_avg_metrics` #20

bkj · 2018-06-15T18:50:33Z

Hi all --

In some other recommender systems, there's a flag to filter the items in the training set from the test metrics -- is there something like that in qmf?

That is, it doesn't make sense to compute p@k on the test set if we allow the top-k predictions to contain items that we observed in the train set, and therefore know won't appear in the test set.

Thanks

albietz · 2018-06-15T19:07:33Z

Hi @bkj

I'm not sure I understand precisely what you're asking, but it wouldn't make sense in matrix factorization models to use items in test metrics that do not appear in training, since then you wouldn't have any latent factors to make test predictions.

Instead, the test dataset should contain (user, item) pairs that do not appear in training data, and evaluation considers ranking metrics per user on this subset of the data (filtering out all rows where user / item did not appear in the training data). The metrics are averaged over all test users, and there's an option to use a smaller number of test users, since this can be costly when there are many users.

Hope this helps.

Alberto

bkj · 2018-06-15T19:29:09Z

To compute p@k, you take the top K predictions and look at the overlap between those predictions and the actual observed values in the test set. However, the top K predictions usually contain elements that were observed in the train set, so are by definition not in the test set. Usually I take the top K predictions AFTER filtering user-items that appear in the training set, otherwise the p@k is artificially reduced. Does that make sense?

albietz · 2018-06-15T20:03:46Z

Gotcha, I wasn't aware of this optimization. Do you have pointers to papers/implementations discussing this?

-Alberto

bkj · 2018-06-15T20:08:01Z

No papers off the top of my head, but I know they do it in dsstne (and probably other places). A script to do the filtering is here.

On the example I'm running (movielen-1m), doing this filtering increases p@10 from ~0.1 to ~0.25 -- so it's a nontrivial improvement and I think the right way to do evaluation.

~ Ben

albietz · 2018-06-15T20:28:17Z

Hmm this might be worth including, but at the same time I'm not convinced that it's the right way to do evaluation either, e.g. it might artificially boost the p@k of different users in different ways depending on how many positive items appear for the user (because you would only filter positive items, and not negatives that the user may have seen).

I'd be curious to know if there is a way to estimate P@k on held-out data that is theoretically justified.

-Alberto

innerNULL mentioned this issue Jul 8, 2022

dev: Integrates glog/gflags with FetchContent style. #35

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering train set in `test_avg_metrics` #20

Filtering train set in `test_avg_metrics` #20

bkj commented Jun 15, 2018

albietz commented Jun 15, 2018

bkj commented Jun 15, 2018 via email

albietz commented Jun 15, 2018

bkj commented Jun 15, 2018

albietz commented Jun 15, 2018

Filtering train set in test_avg_metrics #20

Filtering train set in test_avg_metrics #20

Comments

bkj commented Jun 15, 2018

albietz commented Jun 15, 2018

bkj commented Jun 15, 2018 via email

albietz commented Jun 15, 2018

bkj commented Jun 15, 2018

albietz commented Jun 15, 2018

Filtering train set in `test_avg_metrics` #20

Filtering train set in `test_avg_metrics` #20