(Best displayed w markdown formatting on)
Currently, the following refers to the "coal" subset of data, although most of the structure is replicated and the scripts are the same.
"Civility" data is kept separate even though some of the records may be the same because space is not an issue downloading can occur in the background with little extra input. With just two datasets about rather different topics, this is more straightforward than devising a new structure to keep them together (e.g. by date) but callable separately at will. In future, this should be considered but probably with a proper query structure (SQL-like).
The subfolders are structured in order of execution:
recordsobtained from aph website as per search stringfull_textraw downloads as rectangular dataframes by year including recordsprocessedcleaned up version offull_textmodel_inputsgenerated for the model, includes the combined full textscan_parametersproduced in bulk
The first three correspond to scripts in scripts/download
The last two correspond to scripts in scripts/modeling
dtmcontains raw output from Dynamic Topic Modellingscancontains raw output from SCANcleanedcontains processed output from either model in CSV format, which is used to more easily calculate coherencescan_coherence_*.csvcalculations
All scripts generating the above are stored under scripts/modeling.
Files needed to run SCAN, Python requirements.txt and R Project data.
Run scan.sh to run the scan modelling pipeline. It's in $PATH so can be called
anywhere and will run with the settings in the corresponding scripts.