Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes language detection when there are mixing of more then one language #1

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

pravj
Copy link

@pravj pravj commented Mar 4, 2014

actually the previous method to detect language was checking only first letter in a word, hence was giving wrong result in case of a word with having mixed languages.
for example : in previous method

from langdetect import detect_lang
detect_lang(u'यह ಎಂದ') => {u'\u0c8e\u0c82\u0ca6': 'kn_IN', u'\u092f\u0939': 'hi_IN'}
detect_lang(u'यहಎಂದ') => {u'\u092f\u0939\u0c8e\u0c82\u0ca6': 'kn_IN'}

so I tried to fix this, and this is the change

from langdetect import detect_lang
detect_lang(u'यहಎಂದ') => {u'\u092f\u0939\u0c8e\u0c82\u0ca6': 'mixing of more then one language found'}

@jishnu7
Copy link
Member

jishnu7 commented Mar 4, 2014

can you make sure that this won't break anything ? (I think this will)

I think we should detect the language which has more characters, rather than returning error message. What do you think @copyninja ?

@copyninja
Copy link
Member

It sure will, detect_lang function is suppose to return a dictionary of words mapped to language here words are separated by space or punctuations.

detect_lang(u'यहಎಂದ') => {u'\u092f\u0939\u0c8e\u0c82\u0ca6': 'kn_IN'}

This isn't a valid word! either in Hindi or in Kannada. What happened here is mixing.

@santhoshtr being the original author of the module I think you can give more insight on this issue. Can you please share your thoughts here :-)

@jishnu7 What do you mean by language which has more characters?

@santhoshtr
Copy link
Member

I would say that u'यहಎಂದ' is not a valid test case. If it returns kn_IN or hi_IN , it is not terribly wrong. So I wont recommend returning error.

But now that you pointed out this case, a related valid case we need to test is Kannada(just example) text surrounded by punctuation like quotes, parenthesis etc. I guess our current code need some improvement there.

@pravj
Copy link
Author

pravj commented Mar 5, 2014

yes you all are right, it will break some modules as 'langdetect.py' is used in almost all of them.
actually I was trying out the module 'katapayadi' and got stuck there...problem was that when there are mixed letter of more then one languages, it was giving wrong result. for example -
katapayadi('ഭിപാല') will return 314 but katapayadi('ഭിसचिनപാല') will return 3100004, it adds 0 in place of mix letters.
I can suggest that for a particular word 'detect_lang' should return 'dict' object with two keys 'matched language(majority)' (as @jishnu7 suggested) and 'error' (None if no mixing occured)

@pravj
Copy link
Author

pravj commented Mar 5, 2014

for every module which depends on 'langdetect', the breaking can be handeled with that 'error' dict key

@copyninja
Copy link
Member

@pravj before jumping to implementation I would suggest read the reply from @santhoshtr . As I said its not really a valid word in either of language and as @santhoshtr said returning error is not recommended.

But since you brought up this test case consider testing the case suggested by @santhoshtr i.e. language text surrounded by punctuations paranthesis etc. and see if you can fix that.

@pravj
Copy link
Author

pravj commented Mar 5, 2014

as @santhoshtr mentioned about punctuation but module 'langdetect' handles that fine already..
https://github.com/Project-SILPA/silpa-common/blob/master/silpa_common/langdetect.py#L40
but I tried to fix 'langdetect' because its existing version was getting failed in case of that mix characters, while use in module 'katapayadi'. please try test cases, I mentioned for 'katapayadi'.
I think 'katapayadi' module needs to returns error in case of 'mix language letters' and it don't do so, as of now...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants