A Django app for the Litteraturlabbet project at CDH
The data for this application was collected from the Litteraturbanken data as .html
as files, which were subsequently processed and analyzed with the Passim tool. The data is organized as Work
s belonging to Author
s. Each Work
has multiple Pages
.
The original data was delivered as directories corresponding to a work, or book, containing each page as a separate .html
file.
├── lb7598
│ ├── res_000000.html
│ ├── res_000001.html
│ ├── res_000002.html
│ ├── res_000003.html
│ ├── res_000004.html
│ ├── res_000005.html
│ └── ...
├── lb7598
└── ...
Methods to parse this structure is found in the data/litteraturbanken.py
file.
To upload the works to Diana, use the methods in the data/upload.py
file.
If creating the database from scratch, to optimize the full-text search in the works, consider using the custom Django migration in the custom_migrations
folder, which when applied. It should already be functional when re-uploading material.
To extract the metadata, use the Litteraturbanken API: https://litteraturbanken.se/api/get_work_info?lbworkid=<lbworkid>
, where lbworkid
is a (non-unique) identifier of a certain work in their collections. See data/upload.py
for an example.
It is highly recommended to read the instructions and examples of Passim before using it. Also consult the model specification of this application to understand how to upload the results.
To run Passim, it is important that the data has a correct input format. The scope of the text should be the minimal context where reuse can be identified, which is normally at page level of works.
To make sure Passim works, use create a single file, e.g. in.json
. It uses a non-standard JSON format, where each entry is a piece of text represented as a JS array, followed by a newline, like follows:
{"id": 0, "page": 497812, "series": "lb1", "work": "lb1", "text": "..."}
{"id": 1, "page": 497811, "series": "lb1", "work": "lb1", "text": "..."}
{"id": 2, "page": 497810, "series": "lb1", "work": "lb1", "text": "..."}
{"id": 3, "page": 497810, "series": "lb2", "work": "lb2", "text": "..."}
{"id": 4, "page": 497810, "series": "lb3", "work": "lb3", "text": "..."}
Each array has the following attributes:
id
: A unique ID for the entrypage
: A unique ID for the pageseries
: The scope where not to look for reuse, e.g. within a book. Most often the ID of the work, e.g.lbworkid
.work
: The work IDtext
: The raw text on the page
To run Passim, please consult the instructions. If you have a powerful computer, consider launching it with
SPARK_SUBMIT_ARGS='--master local[12] --driver-memory 32G --executor-memory 8G' passim in.json out/
The resulting output is organized in the out/out.json/*.json
directory as multiple json files. The results in each file are found cases of reuse, so called segments, which look like the following:
{"uid":-6649366733325186173,"cluster":298,"size":68,"bw":0,"ew":140,"id":255145,"page":500053,"series":"lb1","text":"...","work":"lb1","gid":6696630315996178597,"begin":0,"end":1148}
{"uid":-1031951694183583480,"cluster":298,"size":68,"bw":0,"ew":135,"id":255195,"page":509037,"series":"lb2","text":"...","work":"lb2","gid":276275582612511478,"begin":0,"end":1097}
with these attributes:
uid
: Unique identifier for the segmentcluster
: Cluster which the segment belongs tosize
: Number of segments in clusterbw
: Word location beginningew
: Word location endid
: Incremental unique identifierpage
: ID of the page where the segment was foundseries
: Scope where the segment was foundtext
: The extracted textual reusegid
: Unique identifier for the segmentbegin
: Character location beginningend
: Character location end
The resulting data is converted to Django and Diana with simple mappings. One segment in the resulting files corresponds to a row in the Segment
table in Diana. Each segment belongs to a Cluster
, with an ID and size. A Cluster
belongs to a Page
, and a Page
belongs to a Work
. Each Work
then has a foreign key to an Author
.