Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenization of the data #10

Open
bhagat02 opened this issue Apr 2, 2018 · 1 comment
Open

tokenization of the data #10

bhagat02 opened this issue Apr 2, 2018 · 1 comment

Comments

@bhagat02
Copy link

bhagat02 commented Apr 2, 2018

Hi,

When i am running this command for this particular file only. The error is arising as you can see below.Can you please tell. me why is this so only for this particular file.

file name : data.ps_decldesc.train

all the files got tokenized very easily,but this file was not able to do so .why?

error:
gauravs-MBP:~ g$ /Downloads/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en </code-docstring-corpus/parallel-corpus/data_ps.decldesc.train >~/code-docstring-corpus/parallel-corpus/data_ps.decldesc.train1
Tokenizer Version 1.1
Language: en
Number of threads: 1
utf8 "\xFF" does not map to Unicode at /Users/gaurav/Downloads/mosesdecoder/scripts/tokenizer/tokenizer.perl line 180, line 1133.
Malformed UTF-8 character: \xff\xff\xff\xff\x5c\x27\x29\x20\x44\x43\x4e\x4c\x20 (overflows) in substitution (s///) at /Users/gaurav/Downloads/mosesdecoder/scripts/tokenizer/tokenizer.perl line 240, line 1133.
Malformed UTF-8 character: \xff\xff\xff\xff\x5c\x27\x29\x20\x44\x43\x4e\x4c\x20 (unexpected non-continuation byte 0xff, immediately after start byte 0xff; need 13 bytes, got 1) in substitution (s///) at /Users/gaurav/Downloads/mosesdecoder/scripts/tokenizer/tokenizer.perl line 240, line 1133.
Malformed UTF-8 character (fatal) at /Users/gaurav/Downloads/mosesdecoder/scripts/tokenizer/tokenizer.perl line 240, line 1133.

Thank you so much.

@kurtabela
Copy link

I have the same issue, did you ever manage to fix this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants