Skip to content

data leakage: liveothello and wthor #8

@SimJeg

Description

@SimJeg

Hi,

I recently downloaded liveothello (11k games) and wthor (132k games) and noticed that all wthor transcripts start with the move f5. Once taking symmetries into account (there are 4 symmetries in Othello), the overlap between the 2 datasets is 8k games (72% of liveothello is in wthor). Without symmetries the overlap is 3k (27%).

The paper mentions

They [wthor and liveothello games] are combined and split randomly by 8 : 2 into training and validation sets

Hence I think there is a small data leakage between the training and validation set (x4 larger if you take symmetries into account).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions