Skip to content

0th is tokenized instead of 4th, 5th, 6th etc.. #7

@tbrodbeck

Description

@tbrodbeck

Here is an example of 0th instead of 5th: (2nd line of the tifu_all_tokenized_and_filtered.json)

"selftext_html": "[...] Confuse a 5th grade girl for a boy in front of half of her class. Kids are mean. Sorry Sandra.</strong></p>\n</div><!-- SC_ON -->",
"tldr_tokenized": [
    "confuse",
    "a",
    "0th",
    "grade",
    "girl",
    "for",
    "a",
    "boy",
    "in",
    "front",
    "of",
    "half",
    "of",
    "her",
    "class",
    "kids",
    "are",
    "mean",
    "sorry",
    "sandra",
    "*"
  ],

I guess this is an error or is this intended for some reason?

PS: Additionally, I just realized that the * is erroneous as well, isn't it? It is probably because of the bold text in the original string (see https://www.reddit.com/r/tifu/comments/1ggydk/tifu_by_genderstereotyping/)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions