You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was doing some sanity checking and found a duplicate item in the train and test set:
DBRD/train/neg/2074_2.txt
DBRD/test/neg/20602_2.txt
Content-wise they are identical, with the only difference being that the file in the train set has more newlines. But we filter out these new lines anyway during the training of our models (or at least I do and replace them with single spaces).
This seems important enough to have a revised version 3.1 where the duplicate is removed, as it impacts model training. Together with language filtering (#2), this might even be warranting a v4. Alternatively, I can make a fork and rework the whole thing - of course with acknowledgments to this repo.
The text was updated successfully, but these errors were encountered:
Hi
I was doing some sanity checking and found a duplicate item in the train and test set:
Content-wise they are identical, with the only difference being that the file in the train set has more newlines. But we filter out these new lines anyway during the training of our models (or at least I do and replace them with single spaces).
This seems important enough to have a revised version 3.1 where the duplicate is removed, as it impacts model training. Together with language filtering (#2), this might even be warranting a v4. Alternatively, I can make a fork and rework the whole thing - of course with acknowledgments to this repo.
The text was updated successfully, but these errors were encountered: