-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixes language detection when there are mixing of more then one language #1
base: master
Are you sure you want to change the base?
fixes language detection when there are mixing of more then one language #1
Conversation
can you make sure that this won't break anything ? (I think this will) I think we should detect the language which has more characters, rather than returning error message. What do you think @copyninja ? |
It sure will, detect_lang(u'यहಎಂದ') => {u'\u092f\u0939\u0c8e\u0c82\u0ca6': 'kn_IN'} This isn't a valid word! either in Hindi or in Kannada. What happened here is mixing. @santhoshtr being the original author of the module I think you can give more insight on this issue. Can you please share your thoughts here :-) @jishnu7 What do you mean by language which has more characters? |
I would say that u'यहಎಂದ' is not a valid test case. If it returns kn_IN or hi_IN , it is not terribly wrong. So I wont recommend returning error. But now that you pointed out this case, a related valid case we need to test is Kannada(just example) text surrounded by punctuation like quotes, parenthesis etc. I guess our current code need some improvement there. |
yes you all are right, it will break some modules as 'langdetect.py' is used in almost all of them. |
for every module which depends on 'langdetect', the breaking can be handeled with that 'error' dict key |
@pravj before jumping to implementation I would suggest read the reply from @santhoshtr . As I said its not really a valid word in either of language and as @santhoshtr said returning error is not recommended. But since you brought up this test case consider testing the case suggested by @santhoshtr i.e. language text surrounded by punctuations paranthesis etc. and see if you can fix that. |
as @santhoshtr mentioned about punctuation but module 'langdetect' handles that fine already.. |
actually the previous method to detect language was checking only first letter in a word, hence was giving wrong result in case of a word with having mixed languages.
for example : in previous method
so I tried to fix this, and this is the change