tokenization of the data #10

bhagat02 · 2018-04-02T11:19:05Z

Hi,

When i am running this command for this particular file only. The error is arising as you can see below.Can you please tell. me why is this so only for this particular file.

file name : data.ps_decldesc.train

all the files got tokenized very easily,but this file was not able to do so .why?

error:
gauravs-MBP:~ g$ ~~/Downloads/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en <~~/code-docstring-corpus/parallel-corpus/data_ps.decldesc.train >~/code-docstring-corpus/parallel-corpus/data_ps.decldesc.train1
Tokenizer Version 1.1
Language: en
Number of threads: 1
utf8 "\xFF" does not map to Unicode at /Users/gaurav/Downloads/mosesdecoder/scripts/tokenizer/tokenizer.perl line 180, line 1133.
Malformed UTF-8 character: \xff\xff\xff\xff\x5c\x27\x29\x20\x44\x43\x4e\x4c\x20 (overflows) in substitution (s///) at /Users/gaurav/Downloads/mosesdecoder/scripts/tokenizer/tokenizer.perl line 240, line 1133.
Malformed UTF-8 character: \xff\xff\xff\xff\x5c\x27\x29\x20\x44\x43\x4e\x4c\x20 (unexpected non-continuation byte 0xff, immediately after start byte 0xff; need 13 bytes, got 1) in substitution (s///) at /Users/gaurav/Downloads/mosesdecoder/scripts/tokenizer/tokenizer.perl line 240, line 1133.
Malformed UTF-8 character (fatal) at /Users/gaurav/Downloads/mosesdecoder/scripts/tokenizer/tokenizer.perl line 240, line 1133.

Thank you so much.

kurtabela · 2022-01-31T11:35:51Z

I have the same issue, did you ever manage to fix this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenization of the data #10

tokenization of the data #10

bhagat02 commented Apr 2, 2018

kurtabela commented Jan 31, 2022

tokenization of the data #10

tokenization of the data #10

Comments

bhagat02 commented Apr 2, 2018

kurtabela commented Jan 31, 2022