-
Notifications
You must be signed in to change notification settings - Fork 13
User Documentation: Importing a Dataset
In order for the import system to read your dataset without having to write code the documents must be in a certain format. The command-line utility takes a folder, or directory, of where your documents are located. That folder must contain a file named dataset_metadata.txt
and a directory named documents
. The format of each is specified below.
The dataset_metadata.txt
file specifies metadata about the dataset itself, not to be confused with document metadata. Each line in the file represents a metadata key-value pair. The key is text naming the data and the value is text, a date, or a number. The value is separated from the key by a colon (":"). Note that any other colons on the same line are part of the value and not the key; also, white space will be removed from either side of the key and the value. There are only two keys that are special: readable_name
and description
. The readable_name
key specifies the name of the dataset as it will appear on the website. The description
key specifies the text that will be displayed describing the details of the dataset. For example:
readable_name: My Example Dataset
description: This is just a description. This dataset is a trivial example.
a number: 42
a date: 12/31/2011
The documents
directory must contain files, each file represents a document. Each document file must be in ASCII or UTF-8 format, otherwise some characters may be dropped during the import process. The top of each document must contain the document's metadata. After a blank line the text of the document begins. For example:
metadata: this is some metadata
The document text starts here.