-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inaccurate action word recognition #6
Comments
Of course, ten sentences containing three instances (with two of them belonging together) is not enough to get a robust estimate for the tagging accuracy. Nevertheless, let's take a closer look at the data: Two out of three Of the seven word forms involved, only one is known to the tagger (freu), the others are unkown, i.e they do not occur in the training data (haeh, schluchz, heul, gabs, obwohls, bswp). Furthermore, a whole phenomenon (:haeh:) is unknown to the tagger since the training data do not contain textual representations of emoticons in this format. Another phenomenon occurs only once (*schluchz, heul*): The only instance of comma-separated action words in the training data is *rupf, zerr, reiss, mich losmach*. How could we improve performance? Ideally by providing the tagger with more training data. A quicker solution might be a custom post-processor. If you are reasonably sure that a token between colons is always a textual representation of an emoticon and that a token between asterisks is always an action word in your data, you could assign the corresponding tags in a post-processing step. (Ideally that should be a pre-processing step, enabling the tagger to incorporate that information into the further analysis. Unfortunately, SoMeWeTa cannot tag partially annotated input at the moment – although it can be trained and evaluated on partially annotated data.) A sample post-processor for STTS_IBK is available in utils/STTS_IBK_postprocessor. In a future version of SoMeWeTa, phenomena like the ones in that post-processor script (i.e phenomena that can be deterministically recognized with high accuracy) might be dealt with by a model-specific pre-processor that is incorporated into the tagger model. |
SoMeWeta uses the Tagset STTS_IBK for tagging. One of the differences between STTS and STTS_IBK is the Tag Action words (AKW), e.g. for German lach (Beißwenger, Bartz, Storrer und Westpfahl, 2015).
I tested the accuracy of AKW-tagging with a small sample of tokens. As you can see from the attached results, the accuracy is about 33 %.
You can reproduce the wrong tagging with the following minimal working example containing 10 sample sentences:
The output list
akws
contains two right action words ('heul' and 'freu'). 'Haeh' is an emoticon, 'gabs' and 'obwohls' are in fact contractions. 'bswp' is used as abbreviation for German 'beispielsweise'.Is this serious enough to be considered as an issue or have i implemented something wrong? As far as I see, this error is not part of the error table 4 in Proisl (2018, p. 668).
Cited sources:
The text was updated successfully, but these errors were encountered: