Multivariate logistic regression for multi-label classification of analytes of interest using gas chromatography–mass spectrometry (GCMS) data and an in-house collected library. Context of data collected:
- 3 dimensional data (x, y, z) axes
- Specific mass spectra (m/z) and total ion current (TIC) distribution at a specific retention time (RT) allows for identification of compound
Graphical illustration of the 3 dimensions of data collected can be seen below:
Main parts of the code include:
- Preprocessing of data (data extraction from raw data files, normalisation within same scan number, one-hot encoding for classification of in-house library, etc)
- Feature engineering of data (time-bin encoding -> 35 minutes run duration / 0.25 time interval -> 140 time bins * 300 m/z -> 42000 features per row per sample)
- Modelling using logistic regression
Features of code include:
- Creating a new model from data
- Using an existing trained model
- Adding more data to existing models
An example of the generated report can be seen below where each compound's models will give a probability of their respective compound existing in the sample:
Disclaimer: In-house data was removed entirely and compound names replaced with arbitary names to ensure privacy of contents.