-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Having problem with "yetersizliği" #8
Comments
The problem is both in TRmorph, and in foma server mode. Foma uses UDP, so the packet size is limited to 64KB (give or take a few 100 bytes). So, one solution would be to modify flookup to use TCP (or implement some other mechanism that can handle multiple packets). The problem on TRmorph side is it just generates too many analyses for the word. Main trouble is the fact that -(s)I suffix can be deted after -sIz in some (small number of) cases. For example 'at arabası' + sız can surface as 'at arabasız'. So, in your example TRmorph hallucinates a -(s)I after sIz, which triples the number of analyses. Besides that, 'yeter' and 'yetersiz' are also listed in the lexicon as adjectives. Equivalent derivations are also among the analyses, but I think these are rather lexicalized, so they should stay there. For now, here are a few workarounds that may solve the problem (at least for this word):
... and recompile TRmorph. in case you do not need full analysis, e.g., if you are only stemming, you should probably use the relevant .fst which will probably produce a lot smaller output. If you need analyses, eventually, you will hit some form that generates large enough analyses to cause the same problem in flookup. I will notify you in case I have a better solution. |
Hi Çağrı Thanks for quick response,
Yes, I need only stemming and I am using trmorph.fst flookup -S -A 127.0.0.1 trmorph.fst Which .fst shall I use ? stem.fst ? Erol Akarsu On Fri, Sep 19, 2014 at 11:38 AM, Çağrı Çöltekin [email protected]
|
If you are compiling the FST files with Makefile,
You can get rid of the part of speech tags too, if you set the relevant switch in |
Çağrı, Excellent. Stemmer module is much better.
The only concern I have is to find stem of word. Here I have tested several Thanks for your help eakarsu@ubuntu:~/SolrTurkihsAnalysers/TRmorph-master$ echo "Fındıklı" eakarsu@ubuntu:~/SolrTurkihsAnalysers/TRmorph-master$ echo eakarsu@ubuntu:~/SolrTurkihsAnalysers/TRmorph-master$ echo "Sütlü" eakarsu@ubuntu:~/SolrTurkihsAnalysers/TRmorph-master$ echo "girdiler" eakarsu@ubuntu:~/SolrTurkihsAnalysers/TRmorph-master$ echo "omurgasız" eakarsu@ubuntu:~/SolrTurkihsAnalysers/TRmorph-master$ echo "omurgasızlar" On Fri, Sep 19, 2014 at 1:48 PM, Çağrı Çöltekin [email protected]
|
Unfortunately there is no easy way. The analyzer (and the stemmer) tries to produce all possible forms. The results should be disambiguated outside the finete-state tools. TRmorph distribution has a simple python script to select the most likely analysis ( To get the most likely analysis (for a definition of most likely analysis) one needs to analyze the input, pick the highets scoring analysis, and strip off the analysis symbols. The python script provided can be modified to do that, or if needed the disambiguation code is rather simple, porting to another language should not take much time. |
Çağrı, Thanks, Is it possible to create multithreaded foma? I anticipate that I will have Thanks Erol Akarsu On Fri, Sep 19, 2014 at 4:01 PM, Çağrı Çöltekin [email protected]
|
I think it shouldn't be very difficult to modify foma UDP server code to make it multi-threaded, but I do not know whether foma libraries are thread safe or not. As far as I can see, the documentation does not mention it. |
I am testing TRMorph with server mode like this:
flookup -S -A 127.0.0.1 trmorph.fst
When I send "yetersizliği" word from UDP client, server is getting problem:
sendto() failed: Message too long
client hung
The text was updated successfully, but these errors were encountered: