Merge pull request #116 from MitraDarja/improve_doc

Improve doc
seqan · Sep 17, 2021 · f5b53e5 · f5b53e5
2 parents b3cb1c5 + e0694b1
commit f5b53e5
Show file tree

Hide file tree

Showing 14 changed files with 194 additions and 219 deletions.
diff --git a/.gitignore b/.gitignore
@@ -35,10 +35,6 @@
 test/data/IBF_0.000000
 test/data/IBF_0.500000
 test/data/IBF_2.000000
-test/data/Genome_mean/
-test/data/Genome_median/
-test/data/mean/
-test/data/median/
 test/data/IBF_Level_0
 test/data/IBF_Level_1
 test/data/IBF_Level_2

diff --git a/LICENSE.md b/LICENSE.md
@@ -1,6 +1,6 @@
 BSD 3-Clause License
 
-Copyright (c) 2020, Mitra Darvish
+Copyright (c) 2021, Mitra Darvish
 All rights reserved.
 
 Redistribution and use in source and binary forms, with or without

diff --git a/README.md b/README.md
@@ -1,8 +1,11 @@
 ## Needle
-Needle provides a space-efficient data structure to index a large amount of NGS data and allows fast searches through these indices.
-Due to the space-efficiency of one index, it is affordable to create multiple indices with different expression rates. Therefore, a semi-quantitative analysis of the data becomes possible. Needle is based on Interleaved Bloom Filters, which is a compact and efficient structure to store multiple Bloom Filters. Furthermore, Needle uses a windowing scheme (also called minimisers) to reduce the amount of data to store.
 
-## Build
+[![Build Status](https://github.com/seqan/app-template/workflows/App%20CI/badge.svg)](https://github.com/seqan/needle/actions?query=branch%3Amaster+workflow%3A%22App+CI%22)
+
+Needle is a tool for semi-quantitative analysis of very large collections of nucleotide sequences.
+Needle stores its data in multiple interleaved Bloom filter, a fast and space efficient probabilistic data structure and uses a windowing scheme (also called minimisers) to reduce the amount of data to store. How many interleaved Bloom filter are used is defined by the user. Each interleaved Bloom filter has a so called expression threshold and stores minimisers with an occurrence greater than or equal to its own expression threshold and smaller than the next biggest expression threshold (if there is no bigger expression threshold, all greater than or equal to the threshold are stored). These expression thresholds are then used during the query (called estimate) to approximate the expression values of given transcripts.
+
+## Install
 
 Needle can be built by following these commands:
 
@@ -21,26 +24,21 @@ make test
 
 If you are interested in building the documentation, just use the command: `make doc`
 
-## Create an IBF
-In order to create an IBF a number of sequence files have to be given. All sequence file formats from seqan3 are accepted as an input (fasta, fastq, embl,... and their compressed forms). With the parameter m can be defined, which of these sequence files belong together, either because they are the result of paired-end sequencing or they are multiple replicates of the same experiment. If no specification with m is given, every sequence file is seen as one experiment. For paired-end experiments one can use the flag '--paired' to indicate this, so two consecutive sequence files are seen as belonging together. (This is equivalent to using -m 2 for all experiments.)
-Besides, the false positive rate of the IBF has to be specified with parameter f.
-Use -h/--help for more information and to see further parameters.
+## Build
+In order to build a Needle index a number of sequence files have to be given. All sequence file formats supported by seqan3 are accepted as an input (fasta, fastq, embl,... and their compressed forms). The flag `--paired` in the example below indicates that the given sequence files are paired-end experiments. Furthermore, the false positive rate has to be specified with the parameter `f`.
+Use -h/--help for more information and to see further parameters. The flag `-c` can be used to build a compressed Needle index.
 
-The following example creates an IBF for two experiments for the expression levels 4 and 32. Both experiments had two replicates, therefore m is used to specify this. With c a compressed IBF is created.
+The following example creates a compressed Needle index for two paired-end experiments for the expression thresholds 4 and 32.
 
 ```
-./bin/needle ibf ../needle/test/data/exp_*.fasta --samples 2 --samples 2 -e 4 -e 32 -f 0.3 -c -o example
-
-// Or with flag paired
 ./bin/needle ibf ../needle/test/data/exp_*.fasta --paired -e 16 -e 32 -f 0.3 -c -o example
 ```
 
-## Calculate Minimisers
-In case one is only interested in the minimisers or wants to preprocess the data first before creating an IBF, the function minimiser can be used. It calculates the minimisers of given experiments and stores their hash values and their occurrences in a binary file named ".minimiser". Furthermore, a txt file is created where all used arguments are stored (like used k-mer size or window size), the used expression levels and the minimiser counts per expression level.
+Although, this works. It is recommended to calculate the minimisers beforehand by using the option `minimisers`. It calculates the minimisers of given experiments and stores their hash values and their occurrences in a binary file named ".minimiser".
 
 The following command calculates the minimisers in the two experiments.
 ```
-./bin/needle minimiser ../needle/test/data/exp_*.fasta -samples 2 -samples 2
+./bin/needle minimiser ../needle/test/data/exp_*.fasta --paired
 ```
 
 A minimiser file is a binary file containing the following data:
@@ -52,13 +50,13 @@ A minimiser file is a binary file containing the following data:
 - shape (uint64_t), if flag is false
 - all minimiser hashes (uint64_t) with their occurrences (uint16_t)
 
-Based on a minimiser file the ibfs can be computed by using the following command:
+Based on the minimiser files the Needle index can be computed by using the following command:
 ```
 ./bin/needle ibfmin exp*.minimiser -e 16 -e 32  -f 0.3 -c -o example
 ```
 
 ## Estimate
-To estimate the expression value of one transcript a sequence file has to be given. Use the parameter "-i" to define where the IBFs can be found (should be equal with "-o" in the previous commands).
+To estimate the expression value of one transcript a sequence file has to be given. Use the parameter "-i" to define where the Needle index can be found (should be equal with "-o" in the previous commands).
 Use -h/--help for more information and to see further parameters.
 The following example searches for one gene, which is expressed in the first experiment with expression 6 and in the second with expression 37. Therefore, it should be found only in the second experiment but not the first when using expression levels of 16 and 32.
 

diff --git a/include/ibf.h b/include/ibf.h
@@ -13,8 +13,8 @@
 
 struct minimiser_arguments
 {
-    std::filesystem::path include_file; // Needs to be defined when only minimizers appearing in this file should be stored
-    std::filesystem::path exclude_file; // Needs to be defined when minimizers appearing in this file should NOT be stored
+    std::filesystem::path include_file; // Needs to be defined when only minimisers appearing in this file should be stored
+    std::filesystem::path exclude_file; // Needs to be defined when minimisers appearing in this file should NOT be stored
     std::vector<int> samples{}; // Can be used to indicate that sequence files belong to the same experiment
     bool paired = false; // If true, than experiments are seen as paired-end experiments
     std::vector<uint8_t> cutoffs{};
@@ -33,42 +33,6 @@ struct RandomGenerator {
 	}
 };
 
-/*!\brief Calculate best bin size based on number of elements maximal inserted, false positive rate and number of
- *        hash functions. See: https://hur.st/bloomfilter/
- * \param count     The number of elements to be stored.
- * \param fpr       The false positive rate to use.
- * \param num_hash  The number of hash functions to use.
- * \returns bin_size
- */
-inline uint64_t get_bin_size(uint64_t count, float fpr, size_t num_hash)
-{
-    return std::ceil((count * std::log(fpr)) / std::log(1 / std::pow(2, std::log(2))));
-}
-
-/*!\brief Gets all sequences from a specified number of sequence files.
- * \param sequence_files A vector of paths to the sequence files.
- * \param sequences      The data strucuture, where the sequenecs should be stored in.
- * \param min_len        The minimum length a sequence should have to be stored.
- * \param first          The first position of in the sequence file vector, which should be considered. Default: 0.
- * \param num_exp        The number of sequence files that should be considered. Default: 1.
- */
-void get_sequences(std::vector<std::filesystem::path> const & sequence_files,
-                   seqan3::concatenated_sequences<seqan3::dna4_vector> & sequences, uint16_t min_len,
-                   unsigned first = 0, unsigned num_exp = 1);
-
-/*!\brief Gets all minimisers from all sequences in a given vector.
-* \param args                      The minimiser arguments to use (seed, shape, window size).
-* \param sequences                 The data strucuture, where the sequenecs are stored.
-* \param hash_table                The hash table, where minimisers should be stored.
-* \param genome_set_table          The minimisers found in a genome mask.
-* \param genome_file               The file to the genome mask. Default: "".
-* \param only_genome               True, if only minimisers found in the genome mask should be stored. Default: False.
-*/
-void get_minimisers(min_arguments const & args, seqan3::concatenated_sequences<seqan3::dna4_vector> const & sequences,
-                    robin_hood::unordered_node_map<uint64_t, uint16_t> & hash_table,
-                    robin_hood::unordered_set<uint64_t> const & genome_set_table,
-                    std::filesystem::path const & genome_file = "", bool only_genome = false);
-
 /*!\brief Get the concrete expression values (= median of all counts of one transcript) for given experiments.
 *         This function can be used to estimate how good the median approach can be, if all count values are available.
 * \param args               The minimiser arguments to use (seed, shape, window size).
@@ -95,31 +59,31 @@ void read_binary(std::filesystem::path filename, robin_hood::unordered_node_map<
 */
 void read_binary_start(min_arguments & args, std::filesystem::path filename, uint64_t & num_of_minimisers);
 
-/*! \brief Create IBF.
+/*! \brief Creates IBFs.
  * \param sequence_files  A vector of sequence file paths.
  * \param ibf_args        The IBF specific arguments to use (bin size, number of hash functions, ...). See
  *                        struct ibf_arguments.
  * \param minimiser_args  The minimiser specific arguments to use.
  * \param fpr             The average false positive rate that should be used.
- * \param expression_by_genome_file File that contains the only minimisers that should be comnsidered for the
- *                                  determination of the expression_levels.
+ * \param expression_by_genome_file File that contains the only minimisers that should be considered for the
+ *                                  determination of the expression thresholds.
  * \param num_hash        The number of hash functions to use.
- *  \returns The normalized expression values per experiment.
+ *  \returns The expression thresholds per experiment.
  */
 std::vector<uint16_t> ibf(std::vector<std::filesystem::path> const & sequence_files, estimate_ibf_arguments & ibf_args,
                           minimiser_arguments & minimiser_args, std::vector<double> & fpr,
                           std::filesystem::path const expression_by_genome_file = "",
                           size_t num_hash = 1);
 
-/*! \brief Create IBF based on the minimiser and header files
- * \param minimiser_files  A vector of minimiser file paths.
- * \param ibf_args         The IBF specific arguments to use (bin size, number of hash functions, ...). See
- *                         struct ibf_arguments.
+/*! \brief Creates IBFs based on the minimiser files
+ * \param minimiser_files A vector of minimiser file paths.
+ * \param ibf_args        The IBF specific arguments to use (bin size, number of hash functions, ...). See
+ *                        struct ibf_arguments.
  * \param fpr             The average false positive rate that should be used.
  * \param expression_by_genome_file File that contains the only minimisers that should be comnsidered for the
- *                                  determination of the expression_levels.
+ *                                  determination of the expression_thresholds.
  * \param num_hash        The number of hash functions to use.
- *  \returns The normalized expression values per experiment.
+ *  \returns The expression thresholds per experiment.
  */
 std::vector<uint16_t> ibf(std::vector<std::filesystem::path> const & minimiser_files,
                           estimate_ibf_arguments & ibf_args, std::vector<double> & fpr,
@@ -128,7 +92,7 @@ std::vector<uint16_t> ibf(std::vector<std::filesystem::path> const & minimiser_f
 
 /*! \brief Create minimiser and header files.
 * \param sequence_files  A vector of sequence file paths.
-* \param args             The minimiser arguments to use (seed, shape, window size).
+* \param args            The minimiser arguments to use (seed, shape, window size).
 * \param minimiser_args  The minimiser specific arguments to use.
 */
 void minimiser(std::vector<std::filesystem::path> const & sequence_files, min_arguments const & args,

diff --git a/include/shared.h b/include/shared.h
@@ -34,8 +34,8 @@ struct min_arguments : all_arguments
 struct estimate_ibf_arguments : min_arguments
 {
     bool compressed = false;
-    std::vector<uint16_t> expression_levels{}; // Expression levels which should be created
-    uint8_t number_expression_levels{}; // If set, the expression levels are determined by the program.
+    std::vector<uint16_t> expression_thresholds{}; // Expression levels which should be created
+    uint8_t number_expression_thresholds{}; // If set, the expression levels are determined by the program.
     bool samplewise{false};
 
     template<class Archive>
@@ -46,8 +46,8 @@ struct estimate_ibf_arguments : min_arguments
         archive(s.get());
         archive(shape);
         archive(compressed);
-        archive(number_expression_levels);
-        archive(expression_levels);
+        archive(number_expression_thresholds);
+        archive(expression_thresholds);
         archive(samplewise);
     }
 
@@ -59,8 +59,8 @@ struct estimate_ibf_arguments : min_arguments
         archive(s.get());
         archive(shape);
         archive(compressed);
-        archive(number_expression_levels);
-        archive(expression_levels);
+        archive(number_expression_thresholds);
+        archive(expression_thresholds);
         archive(samplewise);
     }
 };

diff --git a/src/estimate.cpp b/src/estimate.cpp
@@ -21,6 +21,8 @@
 #include <seqan3/io/sequence_file/all.hpp>
 
 #include "estimate.h"
+
+// Actual estimation
 template <class IBFType, bool last_exp, bool normalization, typename exp_t>
 void check_ibf(min_arguments const & args, IBFType const & ibf, std::vector<uint16_t> & estimations_i,
                seqan3::dna4_vector const seq, std::vector<uint32_t> & prev_counts,
@@ -156,13 +158,13 @@ void estimate(estimate_ibf_arguments & args, IBFType & ibf, std::filesystem::pat
     read_levels<double>(fprs, estimate_args.path_in.string() + "IBF_FPRs.fprs");
 
     // Make sure expression levels are sorted.
-    sort(args.expression_levels.begin(), args.expression_levels.end());
+    sort(args.expression_thresholds.begin(), args.expression_thresholds.end());
 
     // Initialse last expression
     if constexpr (samplewise)
-        load_ibf(ibf, estimate_args.path_in.string() + "IBF_Level_" + std::to_string(args.number_expression_levels-1));
+        load_ibf(ibf, estimate_args.path_in.string() + "IBF_Level_" + std::to_string(args.number_expression_thresholds-1));
     else
-        load_ibf(ibf, estimate_args.path_in.string() + "IBF_" + std::to_string(args.expression_levels[args.expression_levels.size()-1]));
+        load_ibf(ibf, estimate_args.path_in.string() + "IBF_" + std::to_string(args.expression_thresholds[args.expression_thresholds.size()-1]));
     counter.assign(ibf.bin_count(), 0);
     counter_est.assign(ibf.bin_count(), 0);
 
@@ -180,27 +182,27 @@ void estimate(estimate_ibf_arguments & args, IBFType & ibf, std::filesystem::pat
     {
         if constexpr (samplewise & normalization_method)
             check_ibf<IBFType, true, true>(args, ibf, estimations[i], seqs[i], prev_counts[i],
-                                           expressions,args.number_expression_levels - 1,
-                                           fprs[args.number_expression_levels - 1]);
+                                           expressions,args.number_expression_thresholds - 1,
+                                           fprs[args.number_expression_thresholds - 1]);
         else if constexpr (samplewise)
             check_ibf<IBFType, true, false>(args, ibf, estimations[i], seqs[i], prev_counts[i],
-                                            expressions, args.number_expression_levels - 1,
-                                            fprs[args.number_expression_levels - 1]);
+                                            expressions, args.number_expression_thresholds - 1,
+                                            fprs[args.number_expression_thresholds - 1]);
         else
             check_ibf<IBFType, true, false>(args, ibf, estimations[i], seqs[i], prev_counts[i],
-                                            args.expression_levels[args.expression_levels.size() - 1], prev_expression,
-                                            fprs[args.expression_levels.size() - 1]);
+                                            args.expression_thresholds[args.expression_thresholds.size() - 1], prev_expression,
+                                            fprs[args.expression_thresholds.size() - 1]);
     }
 
     if constexpr (!samplewise)
-        prev_expression = args.expression_levels[args.expression_levels.size() - 1];
+        prev_expression = args.expression_thresholds[args.expression_thresholds.size() - 1];
 
-    for (int j = args.number_expression_levels - 2; j >= 0; j--)
+    for (int j = args.number_expression_thresholds - 2; j >= 0; j--)
     {
         if constexpr (samplewise)
             load_ibf(ibf, estimate_args.path_in.string() + "IBF_Level_" + std::to_string(j));
         else
-            load_ibf(ibf, estimate_args.path_in.string() + "IBF_" + std::to_string(args.expression_levels[j]));
+            load_ibf(ibf, estimate_args.path_in.string() + "IBF_" + std::to_string(args.expression_thresholds[j]));
 
         // Go over the sequences
         #pragma omp parallel for
@@ -214,11 +216,11 @@ void estimate(estimate_ibf_arguments & args, IBFType & ibf, std::filesystem::pat
                                           expressions, j, fprs[j]);
             else
                 check_ibf<IBFType, false, false>(args, ibf, estimations[i], seqs[i], prev_counts[i],
-                                          args.expression_levels[j], prev_expression, fprs[j]);
+                                          args.expression_thresholds[j], prev_expression, fprs[j]);
         }
 
         if (!samplewise)
-            prev_expression = args.expression_levels[j];
+            prev_expression = args.expression_thresholds[j];
     }
 
     std::ofstream outfile;
@@ -235,6 +237,7 @@ void estimate(estimate_ibf_arguments & args, IBFType & ibf, std::filesystem::pat
 
 }
 
+// Calls the correct form of estimate
 void call_estimate(estimate_ibf_arguments & args, estimate_arguments & estimate_args)
 {
     load_args(args, std::string{estimate_args.path_in} + "IBF_Data");