Skip to content

Latest commit

 

History

History

parallel-corpus

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Dataset Descriptions

Special Tokens Used in the dataset

DCNL stands for newline

DCSP stands for a space that is either in the leading identation of a line(one token per nesting level) or inside a string constatn

Reasons for using special tokens

Most seq2seq tools and their tokenizers expect one example per line and collapse consecutive whitespaces, while in Python each code snippet can span multiple lines and consecutive whitespaces in the identation and inside string constants are syntactically or semantically different and therefore should not be collapsed.

Measures to escape the tokens above

There are cases that the token string appear in the source code themselves. In such case, a script is used for pre-processing: https://github.com/EdinburghNLP/code-docstring-corpus/blob/master/scripts/extract_funcdefs_and_docstrings.py#L30

Letter d and q escape in the data_ps.all dataset file

In this dataset, d' and 'd are used for mark the start and end of the description. In this way, we need to escape all the letter d in the dataset so that we can successfully split the dataset into a fixed number of columns.

Here we use qz to escape all the letter d in the dataset. You may wish to restore it back after splitting this dataset using 'd and d' as tokens. Since we acknowledge that string qz will appear in very low probability both in Python scripts and in common English language, we are safe to use this escape.

In addition, to make the conversion reversible, q is escaped to qq as well. For details of the conversion, you may refer to Line 5 in code-docstring-corpus/scripts/prepare_data_ps.sh and Line 3 in code-docstring-corpus/scripts/prepare_repo_split.sh.