Skip to content

User Documentation: Importing a Dataset

craig8196 edited this page Feb 9, 2015 · 1 revision

In order for the import system to read your dataset without having to write code the documents must be in a certain format. The command-line utility takes a folder, or directory, of where your documents are located. That folder must contain a file named dataset_metadata.txt and a directory named documents. The format of each is specified below.

Dataset Metadata Format

The dataset_metadata.txt file specifies metadata about the dataset itself, not to be confused with document metadata. Each line in the file represents a metadata key-value pair. The key is text naming the data and the value is text, a date, or a number. The value is separated from the key by a colon (":"). Note that any other colons on the same line are part of the value and not the key; also, white space will be removed from either side of the key and the value. There are only two keys that are special: readable_name and description. The readable_name key specifies the name of the dataset as it will appear on the website. The description key specifies the text that will be displayed describing the details of the dataset. For example:

readable_name: My Example Dataset
description: This is just a description. This dataset is a trivial example.
a number: 42
a date: 12/31/2011

Documents Directory and Format

The documents directory must contain files, each file represents a document. Each document file must be in ASCII or UTF-8 format, otherwise some characters may be dropped during the import process. The top of each document must contain the document's metadata. After a blank line the text of the document begins. For example:

metadata: this is some metadata

The document text starts here.