Readme v1.1

Predicting Radiology Prioritisation and Protocols Using Natural Language Processing and Machine Learning

Author: Stephen Lyen
Date: 18 August 2024

Dissertation for MSc in Computer Science University of Bath

Submitted Files (Engage)

Summarised list and brief description of the files submitted on Engage. All of these files are also available on github: <Github link>. Google Drive link available in submitted Readme.md file.

Notebooks.zip
Details in subsequent Python Notebooks section.
1. DataProcessing.ipynb
2. DataAugmentation.ipynb
3. UrgencyBert.ipynb
4. UrgencyRoberta.ipynb
5. ProtocolBert.ipynb
6. ProtocolRoberta.ipynb
7. ProtocolBert-CNN.ipynb
8. ProtocolRoberta-CNN.ipynb
9. SVM.ipynb
10. ModelTester.ipynb
Sample.xlsx
This ia sample dataset file, as the full datasets exceed Engage Bath upload limits. This is not intended to be run with the provided notebooks.
Readme file

--PLEASE READ BEFORE RUNNING CODE--

1. Extract notebooks

Extract the contents of Notebooks.zip to your chosen folder.

2. Download files

Full anonymised datasets, study models and pre-trained RoBERTa model are available for download from Google Drive: (Google Drive link available in submitted Readme.md file.)

Models (Folder)
Data (Folder)
RoBERTa-base-PM-M3-Voc-distill-align-hf (Folder)

IMPORTANT: In order for code to run please save the above inside of the folder containing the .ipynb files extracted from Notebooks.zip.

RoBERTa-base-PM-M3-Voc-distill-align-hf contains the pre-trained RoBERTa model, downloaded from github under a Creative Commons License NonCommercial 4.0 (Lewis et al. 2020, link: https://github.com/facebookresearch/bio-lm). Please see manuscript Methods 3.7.1 for description and reference.

As the study was performed using Google Colaboratory, the data has been temporarily retained on Google Drive to keep the number of external data storage locations to a minimum, and to facilitate future research.

Dependencies

All code was run on Google Colaboratory using Python 3.10.12.
External pre-trained RoBERTa library provided in download link folder as above, otherwise all models and libraries are readily available.

Models

File list:

urgency_bert.pth
urgency_roberta.pth
urgency_bert_cnn.pth
urgency_roberta_cnn.pth
protocol_bert.pth
protocol_roberta.pth
protocol_bert_cnn.pth
protocol_roberta_cnn.pth

(Google Drive link available in submitted Readme.md file.)

Before Running Code: please copy the Models folder to the same location as the .ipynb files.

This folder contains all of the final fine-tuned pytorch models that obtained the best performance metrics.

The file name formats are task_modeltype_cnn. Tasks are urgency or protocol classification and the model types are RoBERTa or BERT. For example, urgency_roberta_cnn.pth is the RoBERTa-CNN model applied to urgency classification. All provided protocol classification models have been trained on augmented datasets and urgency classification models on non-augmented datasets.

ModelTester.ipynb

A ModelTester notebook has been provided for transformer and transformer-CNN models.

Before Running Code: there is a cell labelled 'SET PARAMETERS' following the imports and function declarations, where the user can specify parameters.

set_label : 'Protocol' or 'UrgBin'
set_cnn: True will apply transformer-CNN, False will apply transformer only
set_model: 'roberta' or 'bert'
dir_path: Please replace with the file path to the submission folder

Data

File list:

Non-augmented:

Xtrain_70-30_no-aug.csv
Xtest_70-30_no-aug.csv
ytrain_70-30_no-aug.csv
ytest_70-30_no-aug.csv

Augmented:

Xtrain_aug_1000-400.csv
Xtest_aug_1000-400.csv
ytrain_aug_1000-400.csv
ytest_aug_1000-400.csv

Misc

nlp_languages.csv
SampleRaw.xlsm

(Google Drive link available in submitted Readme.md file.)

Before Running Code: please save the Data folder in the folder containing the .ipynb files.

This folder contains the full datasets for this project. These are divided into Xtrain, Xtest, ytrain, and ytest .csv files. These are either non-augmented (labelled no-aug), or augmented with threshold-number.

For simplicity, only the optimal augmentation level has been provided, i.e. 1000-400 which is 400 cases below 1000 threshold. Please see manuscript Methods 3.6 for details.

nlp_languages.csv contains a further list of language codes for the nlpaug backtranslation functions.

Python Notebooks

DataProcessing.ipynb

Cleans the pre-anonymised data and extracts data from the semi-structured 'Comment' column. Full details of the different datatypes are in the manuscript Methods section 3.5.

Note that the full raw data input for this notebook has not been provided, only an example spreadsheet containing 100 cases labelled SampleRaw.xlsm. The full dataset is available upon request.

Before Running Code: Replace file_path with the path to folder containing dataset file. If using sample data, set_sample = True.

DataAugmentation.ipynb

Augments the training dataset generated from DataProcessing.ipynb.

This notebook was executed in two sections due to GPU memory constraints, although depending on the user's GPU this may not apply. If necessary, please restart kernel and run section 2.

Before Running Code: Replace file_path with the path to folder containing dataset file. If using sample data, set_sample = True.

Transformer and Transformer-CNN Notebooks

File List:

UrgencyRoberta.ipynb
UrgencyBert.ipynb
ProtocolRoberta.ipynb
ProtocolBert.ipynb
ProtocolRoberta-CNN.ipynb
ProtocolBert-CNN.ipynb

Notebooks used to train, validate and test models. The file name formats are task-model, e.g. ProtocolRoberta-CNN would be the urgency classification task using the RoBERTa-CNN model.

Before Running Code: For all files, replace dir_path with the path to folder containing dataset file.

Load Dataset and Parameters Section:
set_aug = True to load augmented data, False non-augmented.

Urgency Classification with Roberta-CNN and Bert-CNN

To run Roberta-CNN and Bert-CNN models at the urgency classification task, set_cnn = True in the UrgencyRoBERTa and UrgencyBert notebooks respectively.

SVM Notebook

Before Running Code: Replace dir_path with the path to folder containing dataset file.

Load Dataset and Parameters Section:
set_aug = True to load augmented data, False non-augmented.
set_label = 'Protocol' or 'UrgBin'

Contact

If any issues are encountered, please contact the author Stephen Lyen.
email: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Notebooks.zip		Notebooks.zip
Readme.md		Readme.md
SampleData.xlsx		SampleData.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Readme v1.1

Predicting Radiology Prioritisation and Protocols Using Natural Language Processing and Machine Learning

Contents

Submitted Files (Engage)

--PLEASE READ BEFORE RUNNING CODE--

1. Extract notebooks

2. Download files

Dependencies

Models

ModelTester.ipynb

Data

Python Notebooks

DataProcessing.ipynb

DataAugmentation.ipynb

Transformer and Transformer-CNN Notebooks

Urgency Classification with Roberta-CNN and Bert-CNN

SVM Notebook

Contact

About

Uh oh!

Releases

Packages

phillippakeogh/University_of_Bath

Folders and files

Latest commit

History

Repository files navigation

Readme v1.1

Predicting Radiology Prioritisation and Protocols Using Natural Language Processing and Machine Learning

Contents

Submitted Files (Engage)

--PLEASE READ BEFORE RUNNING CODE--

1. Extract notebooks

2. Download files

Dependencies

Models

ModelTester.ipynb

Data

Python Notebooks

DataProcessing.ipynb

DataAugmentation.ipynb

Transformer and Transformer-CNN Notebooks

Urgency Classification with Roberta-CNN and Bert-CNN

SVM Notebook

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages