fixes language detection when there are mixing of more then one language #1

pravj · 2014-03-04T16:35:54Z

actually the previous method to detect language was checking only first letter in a word, hence was giving wrong result in case of a word with having mixed languages.
for example : in previous method

from langdetect import detect_lang
detect_lang(u'यह ಎಂದ') => {u'\u0c8e\u0c82\u0ca6': 'kn_IN', u'\u092f\u0939': 'hi_IN'}
detect_lang(u'यहಎಂದ') => {u'\u092f\u0939\u0c8e\u0c82\u0ca6': 'kn_IN'}

so I tried to fix this, and this is the change

from langdetect import detect_lang
detect_lang(u'यहಎಂದ') => {u'\u092f\u0939\u0c8e\u0c82\u0ca6': 'mixing of more then one language found'}

jishnu7 · 2014-03-04T23:57:19Z

can you make sure that this won't break anything ? (I think this will)

I think we should detect the language which has more characters, rather than returning error message. What do you think @copyninja ?

copyninja · 2014-03-05T03:38:08Z

It sure will, detect_lang function is suppose to return a dictionary of words mapped to language here words are separated by space or punctuations.

detect_lang(u'यहಎಂದ') => {u'\u092f\u0939\u0c8e\u0c82\u0ca6': 'kn_IN'}

This isn't a valid word! either in Hindi or in Kannada. What happened here is mixing.

@santhoshtr being the original author of the module I think you can give more insight on this issue. Can you please share your thoughts here :-)

@jishnu7 What do you mean by language which has more characters?

santhoshtr · 2014-03-05T04:05:31Z

I would say that u'यहಎಂದ' is not a valid test case. If it returns kn_IN or hi_IN , it is not terribly wrong. So I wont recommend returning error.

But now that you pointed out this case, a related valid case we need to test is Kannada(just example) text surrounded by punctuation like quotes, parenthesis etc. I guess our current code need some improvement there.

pravj · 2014-03-05T04:12:35Z

yes you all are right, it will break some modules as 'langdetect.py' is used in almost all of them.
actually I was trying out the module 'katapayadi' and got stuck there...problem was that when there are mixed letter of more then one languages, it was giving wrong result. for example -
katapayadi('ഭിപാല') will return 314 but katapayadi('ഭിसचिनപാല') will return 3100004, it adds 0 in place of mix letters.
I can suggest that for a particular word 'detect_lang' should return 'dict' object with two keys 'matched language(majority)' (as @jishnu7 suggested) and 'error' (None if no mixing occured)

pravj · 2014-03-05T04:14:45Z

for every module which depends on 'langdetect', the breaking can be handeled with that 'error' dict key

copyninja · 2014-03-05T04:20:10Z

@pravj before jumping to implementation I would suggest read the reply from @santhoshtr . As I said its not really a valid word in either of language and as @santhoshtr said returning error is not recommended.

But since you brought up this test case consider testing the case suggested by @santhoshtr i.e. language text surrounded by punctuations paranthesis etc. and see if you can fix that.

pravj · 2014-03-05T13:34:45Z

as @santhoshtr mentioned about punctuation but module 'langdetect' handles that fine already..
https://github.com/Project-SILPA/silpa-common/blob/master/silpa_common/langdetect.py#L40
but I tried to fix 'langdetect' because its existing version was getting failed in case of that mix characters, while use in module 'katapayadi'. please try test cases, I mentioned for 'katapayadi'.
I think 'katapayadi' module needs to returns error in case of 'mix language letters' and it don't do so, as of now...

pravj added 2 commits March 4, 2014 21:59

fixes language detection when there are mixing of more then one language

c98ca9b

made changes to make python into pep8 style

456f565

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixes language detection when there are mixing of more then one language #1

fixes language detection when there are mixing of more then one language #1

pravj commented Mar 4, 2014

jishnu7 commented Mar 4, 2014

copyninja commented Mar 5, 2014

santhoshtr commented Mar 5, 2014

pravj commented Mar 5, 2014

pravj commented Mar 5, 2014

copyninja commented Mar 5, 2014

pravj commented Mar 5, 2014

fixes language detection when there are mixing of more then one language #1

Are you sure you want to change the base?

fixes language detection when there are mixing of more then one language #1

Conversation

pravj commented Mar 4, 2014

jishnu7 commented Mar 4, 2014

copyninja commented Mar 5, 2014

santhoshtr commented Mar 5, 2014

pravj commented Mar 5, 2014

pravj commented Mar 5, 2014

copyninja commented Mar 5, 2014

pravj commented Mar 5, 2014