Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification in the jsonify.py code #4

Open
kurtespinosa opened this issue May 11, 2017 · 2 comments
Open

Clarification in the jsonify.py code #4

kurtespinosa opened this issue May 11, 2017 · 2 comments

Comments

@kurtespinosa
Copy link

kurtespinosa commented May 11, 2017

Dear Butsugiri,

Thank you for sharing your code. I have a question about the input dataset which I would need to jsonify. I download the dataset and used the respective data partitions, for example, WikiQA-test.tsv for test set which has a sample file entry below.

QuestionID Question DocumentID DocumentTitle SentenceID Sentence Label
Q0 HOW AFRICAN AMERICANS WERE IMMIGRATED TO THE US D0 African immigration to the United States D0-0 African immigration to the United States refers to immigrants to the United States who are or were nationals of Africa . 0

Now, I'm confused because in the jsonify code, the question would point to D0-0 which is the sentenceID. It seems that the question_id and the question were interchanged, am I right or did I miss out anything?

question_id = data[1]
....
question = data[-3]
answer = data[-2]
....
....
'question': question.lower().split(" "),
'answer': answer.lower().split(" "),

should have been the following?

question = data[1]
.....
question_id = data[-3]
answer = data[-2]
....
....
'question': question.lower().split(" "),
'answer': answer.lower().split(" "),

Cheers,
Kurt

@butsugiri
Copy link
Owner

Hi,

The indexing for some variables like question and queston_id seem interchanged, because jsonify.py requires some extra preprocessing beforehand (and I am sorry that it is not provided on this repo).
It is basically for removing the questions that do not contain correct answer in it, as described on the original paper.
So please fix the code if you think it is necessary.

After preprocessing, the file should look like:

{"label": "0", "sentence_id": "D11-0", "question": ["how", "big", "is", "bmc", "software", "in", "houston", ",", "tx"], "title": "BMC Software", "answer": ["bmc", "software", ",", "inc.", "is", "an", "american", "company", "specializing", "in", "business", "service", "management", "(", "bsm", ")", "software", "."], "document_id": "D11", "question_id": "Q11"}
{"label": "0", "sentence_id": "D11-1", "question": ["how", "big", "is", "bmc", "software", "in", "houston", ",", "tx"], "title": "BMC Software", "answer": ["headquartered", "in", "houston", ",", "texas", ",", "bmc", "develops", ",", "markets", "and", "sells", "software", "used", "for", "multiple", "functions", ",", "including", "it", "service", "management", ",", "data", "center", "automation", ",", "performance", "management", ",", "virtualization", "lifecycle", "management", "and", "cloud", "computing", "management", "."], "document_id": "D11", "question_id": "Q11"}

Each line contains one QA pair in json format.

@kurtespinosa
Copy link
Author

Thank you for taking time to answer my question. This clarifies it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants