Skip to content
Miroslav Pakanec edited this page Mar 29, 2021 · 2 revisions

Config description:

  1. Fishnchips config files can be found at /configs, e.g. see test_config.json
  2. Most scripts (such as run_fishnchips.py, train_fishnchips, etc.) expect config with the --config/-c flag

Structure:

Model

  • signal_window_size - defines how the read signal will be segmented (e.g. 300 long windows)
  • label_window_size - defines the label window size cap for each signal window (e.g. 100)

! make sure that no corresponding label of a signal window exceeds this cap

  • attention_blocks - number of transformer encoder/decoder blocks (e.g. if set to 4, there will be 4 encoder and 4 decoder blocks)
  • cnn_blocks - number of CNN blocks (each block has 3 convolution layers, relu activation, residual connection and dropout layers)
  • maxpool_kernel - maxpool kernel size

! signal_window_size % maxpool_kernel == 0

  • maxpool_idx - defines the position of the maxpool layer between CNN blocks (e.g. if set to 3, there are 3 CNN blocks followed by a max pool layer, followed by the rest of the CNN blocks)

! maxpool_idx < cnn_blocks

  • d_model - depths of the model (if set to 250, each signal point becomes a 250 vector)
  • dff - the size of the point-wise feed-forward network after the self attention
  • num_heads - number of attention heads

! d_model % num_heads == 0

  • dropout_rate

Training

  • data - path to training data hdf5 file
  • epochs
  • patience - number of epochs to train without improvement (when validation accuracy improves, patience is reset). e.g. 300
  • warmup - warmup epochs ( patience is reset after each epoch )
  • batches - defines how many batches make an epoch (e.g. 1000 )

This should be adjusted with signal window size due to performance

  • batch_size
  • buffer_size - how many reads should be loaded, segmented into windows, and shuffled to create training batches
  • lr_mult - see appendix of Improving base calling accuracy with Transofrmers
  • signal_window_stride - the overlap of signal windows during segmentation

Validation

  • data - path to validation data hdf5 file
  • batch_size
  • buffer_size - legacy parameter that can be emited and does not influence any workflow
  • signal_window_stride - For validation, set this to signal_window_size

Testing

  • batch_size
  • buffer_size - legacy parameter that can be emited and does not influence any workflow
  • signal_window_stride - overlap, which determines how predictions are assembled
  • signal_window_stride < signal_window_size => assembler will be used to compute alignment and consensus of predicted sequences
  • signal_window_stride == signal_window_size => predications witll be concatinated
  • signal_window_stride > signal_window_size => error
  • save_predictions - save predications as a fasta file
  • reads - number of reads to run within a data directory
  • bacteria - describes test data (this list will be enumerated, such that e.g. if reads parameter is 20, 20 reads will be inferred for each element):
  • name - will be used for reports, and evaluation filenames
  • data - folder containing fast5 files
  • reference - path to the reference genome of this particular bacteria

Clone this wiki locally