Skip to content

Commit

Permalink
Merge pull request #116 from MitraDarja/improve_doc
Browse files Browse the repository at this point in the history
Improve doc
  • Loading branch information
MitraDarja authored Sep 17, 2021
2 parents b3cb1c5 + e0694b1 commit f5b53e5
Show file tree
Hide file tree
Showing 14 changed files with 194 additions and 219 deletions.
4 changes: 0 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,6 @@
test/data/IBF_0.000000
test/data/IBF_0.500000
test/data/IBF_2.000000
test/data/Genome_mean/
test/data/Genome_median/
test/data/mean/
test/data/median/
test/data/IBF_Level_0
test/data/IBF_Level_1
test/data/IBF_Level_2
Expand Down
2 changes: 1 addition & 1 deletion LICENSE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
BSD 3-Clause License

Copyright (c) 2020, Mitra Darvish
Copyright (c) 2021, Mitra Darvish
All rights reserved.

Redistribution and use in source and binary forms, with or without
Expand Down
30 changes: 14 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
## Needle
Needle provides a space-efficient data structure to index a large amount of NGS data and allows fast searches through these indices.
Due to the space-efficiency of one index, it is affordable to create multiple indices with different expression rates. Therefore, a semi-quantitative analysis of the data becomes possible. Needle is based on Interleaved Bloom Filters, which is a compact and efficient structure to store multiple Bloom Filters. Furthermore, Needle uses a windowing scheme (also called minimisers) to reduce the amount of data to store.

## Build
[![Build Status](https://github.com/seqan/app-template/workflows/App%20CI/badge.svg)](https://github.com/seqan/needle/actions?query=branch%3Amaster+workflow%3A%22App+CI%22)

Needle is a tool for semi-quantitative analysis of very large collections of nucleotide sequences.
Needle stores its data in multiple interleaved Bloom filter, a fast and space efficient probabilistic data structure and uses a windowing scheme (also called minimisers) to reduce the amount of data to store. How many interleaved Bloom filter are used is defined by the user. Each interleaved Bloom filter has a so called expression threshold and stores minimisers with an occurrence greater than or equal to its own expression threshold and smaller than the next biggest expression threshold (if there is no bigger expression threshold, all greater than or equal to the threshold are stored). These expression thresholds are then used during the query (called estimate) to approximate the expression values of given transcripts.

## Install

Needle can be built by following these commands:

Expand All @@ -21,26 +24,21 @@ make test

If you are interested in building the documentation, just use the command: `make doc`

## Create an IBF
In order to create an IBF a number of sequence files have to be given. All sequence file formats from seqan3 are accepted as an input (fasta, fastq, embl,... and their compressed forms). With the parameter m can be defined, which of these sequence files belong together, either because they are the result of paired-end sequencing or they are multiple replicates of the same experiment. If no specification with m is given, every sequence file is seen as one experiment. For paired-end experiments one can use the flag '--paired' to indicate this, so two consecutive sequence files are seen as belonging together. (This is equivalent to using -m 2 for all experiments.)
Besides, the false positive rate of the IBF has to be specified with parameter f.
Use -h/--help for more information and to see further parameters.
## Build
In order to build a Needle index a number of sequence files have to be given. All sequence file formats supported by seqan3 are accepted as an input (fasta, fastq, embl,... and their compressed forms). The flag `--paired` in the example below indicates that the given sequence files are paired-end experiments. Furthermore, the false positive rate has to be specified with the parameter `f`.
Use -h/--help for more information and to see further parameters. The flag `-c` can be used to build a compressed Needle index.

The following example creates an IBF for two experiments for the expression levels 4 and 32. Both experiments had two replicates, therefore m is used to specify this. With c a compressed IBF is created.
The following example creates a compressed Needle index for two paired-end experiments for the expression thresholds 4 and 32.

```
./bin/needle ibf ../needle/test/data/exp_*.fasta --samples 2 --samples 2 -e 4 -e 32 -f 0.3 -c -o example
// Or with flag paired
./bin/needle ibf ../needle/test/data/exp_*.fasta --paired -e 16 -e 32 -f 0.3 -c -o example
```

## Calculate Minimisers
In case one is only interested in the minimisers or wants to preprocess the data first before creating an IBF, the function minimiser can be used. It calculates the minimisers of given experiments and stores their hash values and their occurrences in a binary file named ".minimiser". Furthermore, a txt file is created where all used arguments are stored (like used k-mer size or window size), the used expression levels and the minimiser counts per expression level.
Although, this works. It is recommended to calculate the minimisers beforehand by using the option `minimisers`. It calculates the minimisers of given experiments and stores their hash values and their occurrences in a binary file named ".minimiser".

The following command calculates the minimisers in the two experiments.
```
./bin/needle minimiser ../needle/test/data/exp_*.fasta -samples 2 -samples 2
./bin/needle minimiser ../needle/test/data/exp_*.fasta --paired
```

A minimiser file is a binary file containing the following data:
Expand All @@ -52,13 +50,13 @@ A minimiser file is a binary file containing the following data:
- shape (uint64_t), if flag is false
- all minimiser hashes (uint64_t) with their occurrences (uint16_t)

Based on a minimiser file the ibfs can be computed by using the following command:
Based on the minimiser files the Needle index can be computed by using the following command:
```
./bin/needle ibfmin exp*.minimiser -e 16 -e 32 -f 0.3 -c -o example
```

## Estimate
To estimate the expression value of one transcript a sequence file has to be given. Use the parameter "-i" to define where the IBFs can be found (should be equal with "-o" in the previous commands).
To estimate the expression value of one transcript a sequence file has to be given. Use the parameter "-i" to define where the Needle index can be found (should be equal with "-o" in the previous commands).
Use -h/--help for more information and to see further parameters.
The following example searches for one gene, which is expressed in the first experiment with expression 6 and in the second with expression 37. Therefore, it should be found only in the second experiment but not the first when using expression levels of 16 and 32.

Expand Down
62 changes: 13 additions & 49 deletions include/ibf.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@

struct minimiser_arguments
{
std::filesystem::path include_file; // Needs to be defined when only minimizers appearing in this file should be stored
std::filesystem::path exclude_file; // Needs to be defined when minimizers appearing in this file should NOT be stored
std::filesystem::path include_file; // Needs to be defined when only minimisers appearing in this file should be stored
std::filesystem::path exclude_file; // Needs to be defined when minimisers appearing in this file should NOT be stored
std::vector<int> samples{}; // Can be used to indicate that sequence files belong to the same experiment
bool paired = false; // If true, than experiments are seen as paired-end experiments
std::vector<uint8_t> cutoffs{};
Expand All @@ -33,42 +33,6 @@ struct RandomGenerator {
}
};

/*!\brief Calculate best bin size based on number of elements maximal inserted, false positive rate and number of
* hash functions. See: https://hur.st/bloomfilter/
* \param count The number of elements to be stored.
* \param fpr The false positive rate to use.
* \param num_hash The number of hash functions to use.
* \returns bin_size
*/
inline uint64_t get_bin_size(uint64_t count, float fpr, size_t num_hash)
{
return std::ceil((count * std::log(fpr)) / std::log(1 / std::pow(2, std::log(2))));
}

/*!\brief Gets all sequences from a specified number of sequence files.
* \param sequence_files A vector of paths to the sequence files.
* \param sequences The data strucuture, where the sequenecs should be stored in.
* \param min_len The minimum length a sequence should have to be stored.
* \param first The first position of in the sequence file vector, which should be considered. Default: 0.
* \param num_exp The number of sequence files that should be considered. Default: 1.
*/
void get_sequences(std::vector<std::filesystem::path> const & sequence_files,
seqan3::concatenated_sequences<seqan3::dna4_vector> & sequences, uint16_t min_len,
unsigned first = 0, unsigned num_exp = 1);

/*!\brief Gets all minimisers from all sequences in a given vector.
* \param args The minimiser arguments to use (seed, shape, window size).
* \param sequences The data strucuture, where the sequenecs are stored.
* \param hash_table The hash table, where minimisers should be stored.
* \param genome_set_table The minimisers found in a genome mask.
* \param genome_file The file to the genome mask. Default: "".
* \param only_genome True, if only minimisers found in the genome mask should be stored. Default: False.
*/
void get_minimisers(min_arguments const & args, seqan3::concatenated_sequences<seqan3::dna4_vector> const & sequences,
robin_hood::unordered_node_map<uint64_t, uint16_t> & hash_table,
robin_hood::unordered_set<uint64_t> const & genome_set_table,
std::filesystem::path const & genome_file = "", bool only_genome = false);

/*!\brief Get the concrete expression values (= median of all counts of one transcript) for given experiments.
* This function can be used to estimate how good the median approach can be, if all count values are available.
* \param args The minimiser arguments to use (seed, shape, window size).
Expand All @@ -95,31 +59,31 @@ void read_binary(std::filesystem::path filename, robin_hood::unordered_node_map<
*/
void read_binary_start(min_arguments & args, std::filesystem::path filename, uint64_t & num_of_minimisers);

/*! \brief Create IBF.
/*! \brief Creates IBFs.
* \param sequence_files A vector of sequence file paths.
* \param ibf_args The IBF specific arguments to use (bin size, number of hash functions, ...). See
* struct ibf_arguments.
* \param minimiser_args The minimiser specific arguments to use.
* \param fpr The average false positive rate that should be used.
* \param expression_by_genome_file File that contains the only minimisers that should be comnsidered for the
* determination of the expression_levels.
* \param expression_by_genome_file File that contains the only minimisers that should be considered for the
* determination of the expression thresholds.
* \param num_hash The number of hash functions to use.
* \returns The normalized expression values per experiment.
* \returns The expression thresholds per experiment.
*/
std::vector<uint16_t> ibf(std::vector<std::filesystem::path> const & sequence_files, estimate_ibf_arguments & ibf_args,
minimiser_arguments & minimiser_args, std::vector<double> & fpr,
std::filesystem::path const expression_by_genome_file = "",
size_t num_hash = 1);

/*! \brief Create IBF based on the minimiser and header files
* \param minimiser_files A vector of minimiser file paths.
* \param ibf_args The IBF specific arguments to use (bin size, number of hash functions, ...). See
* struct ibf_arguments.
/*! \brief Creates IBFs based on the minimiser files
* \param minimiser_files A vector of minimiser file paths.
* \param ibf_args The IBF specific arguments to use (bin size, number of hash functions, ...). See
* struct ibf_arguments.
* \param fpr The average false positive rate that should be used.
* \param expression_by_genome_file File that contains the only minimisers that should be comnsidered for the
* determination of the expression_levels.
* determination of the expression_thresholds.
* \param num_hash The number of hash functions to use.
* \returns The normalized expression values per experiment.
* \returns The expression thresholds per experiment.
*/
std::vector<uint16_t> ibf(std::vector<std::filesystem::path> const & minimiser_files,
estimate_ibf_arguments & ibf_args, std::vector<double> & fpr,
Expand All @@ -128,7 +92,7 @@ std::vector<uint16_t> ibf(std::vector<std::filesystem::path> const & minimiser_f

/*! \brief Create minimiser and header files.
* \param sequence_files A vector of sequence file paths.
* \param args The minimiser arguments to use (seed, shape, window size).
* \param args The minimiser arguments to use (seed, shape, window size).
* \param minimiser_args The minimiser specific arguments to use.
*/
void minimiser(std::vector<std::filesystem::path> const & sequence_files, min_arguments const & args,
Expand Down
12 changes: 6 additions & 6 deletions include/shared.h
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@ struct min_arguments : all_arguments
struct estimate_ibf_arguments : min_arguments
{
bool compressed = false;
std::vector<uint16_t> expression_levels{}; // Expression levels which should be created
uint8_t number_expression_levels{}; // If set, the expression levels are determined by the program.
std::vector<uint16_t> expression_thresholds{}; // Expression levels which should be created
uint8_t number_expression_thresholds{}; // If set, the expression levels are determined by the program.
bool samplewise{false};

template<class Archive>
Expand All @@ -46,8 +46,8 @@ struct estimate_ibf_arguments : min_arguments
archive(s.get());
archive(shape);
archive(compressed);
archive(number_expression_levels);
archive(expression_levels);
archive(number_expression_thresholds);
archive(expression_thresholds);
archive(samplewise);
}

Expand All @@ -59,8 +59,8 @@ struct estimate_ibf_arguments : min_arguments
archive(s.get());
archive(shape);
archive(compressed);
archive(number_expression_levels);
archive(expression_levels);
archive(number_expression_thresholds);
archive(expression_thresholds);
archive(samplewise);
}
};
Expand Down
31 changes: 17 additions & 14 deletions src/estimate.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@
#include <seqan3/io/sequence_file/all.hpp>

#include "estimate.h"

// Actual estimation
template <class IBFType, bool last_exp, bool normalization, typename exp_t>
void check_ibf(min_arguments const & args, IBFType const & ibf, std::vector<uint16_t> & estimations_i,
seqan3::dna4_vector const seq, std::vector<uint32_t> & prev_counts,
Expand Down Expand Up @@ -156,13 +158,13 @@ void estimate(estimate_ibf_arguments & args, IBFType & ibf, std::filesystem::pat
read_levels<double>(fprs, estimate_args.path_in.string() + "IBF_FPRs.fprs");

// Make sure expression levels are sorted.
sort(args.expression_levels.begin(), args.expression_levels.end());
sort(args.expression_thresholds.begin(), args.expression_thresholds.end());

// Initialse last expression
if constexpr (samplewise)
load_ibf(ibf, estimate_args.path_in.string() + "IBF_Level_" + std::to_string(args.number_expression_levels-1));
load_ibf(ibf, estimate_args.path_in.string() + "IBF_Level_" + std::to_string(args.number_expression_thresholds-1));
else
load_ibf(ibf, estimate_args.path_in.string() + "IBF_" + std::to_string(args.expression_levels[args.expression_levels.size()-1]));
load_ibf(ibf, estimate_args.path_in.string() + "IBF_" + std::to_string(args.expression_thresholds[args.expression_thresholds.size()-1]));
counter.assign(ibf.bin_count(), 0);
counter_est.assign(ibf.bin_count(), 0);

Expand All @@ -180,27 +182,27 @@ void estimate(estimate_ibf_arguments & args, IBFType & ibf, std::filesystem::pat
{
if constexpr (samplewise & normalization_method)
check_ibf<IBFType, true, true>(args, ibf, estimations[i], seqs[i], prev_counts[i],
expressions,args.number_expression_levels - 1,
fprs[args.number_expression_levels - 1]);
expressions,args.number_expression_thresholds - 1,
fprs[args.number_expression_thresholds - 1]);
else if constexpr (samplewise)
check_ibf<IBFType, true, false>(args, ibf, estimations[i], seqs[i], prev_counts[i],
expressions, args.number_expression_levels - 1,
fprs[args.number_expression_levels - 1]);
expressions, args.number_expression_thresholds - 1,
fprs[args.number_expression_thresholds - 1]);
else
check_ibf<IBFType, true, false>(args, ibf, estimations[i], seqs[i], prev_counts[i],
args.expression_levels[args.expression_levels.size() - 1], prev_expression,
fprs[args.expression_levels.size() - 1]);
args.expression_thresholds[args.expression_thresholds.size() - 1], prev_expression,
fprs[args.expression_thresholds.size() - 1]);
}

if constexpr (!samplewise)
prev_expression = args.expression_levels[args.expression_levels.size() - 1];
prev_expression = args.expression_thresholds[args.expression_thresholds.size() - 1];

for (int j = args.number_expression_levels - 2; j >= 0; j--)
for (int j = args.number_expression_thresholds - 2; j >= 0; j--)
{
if constexpr (samplewise)
load_ibf(ibf, estimate_args.path_in.string() + "IBF_Level_" + std::to_string(j));
else
load_ibf(ibf, estimate_args.path_in.string() + "IBF_" + std::to_string(args.expression_levels[j]));
load_ibf(ibf, estimate_args.path_in.string() + "IBF_" + std::to_string(args.expression_thresholds[j]));

// Go over the sequences
#pragma omp parallel for
Expand All @@ -214,11 +216,11 @@ void estimate(estimate_ibf_arguments & args, IBFType & ibf, std::filesystem::pat
expressions, j, fprs[j]);
else
check_ibf<IBFType, false, false>(args, ibf, estimations[i], seqs[i], prev_counts[i],
args.expression_levels[j], prev_expression, fprs[j]);
args.expression_thresholds[j], prev_expression, fprs[j]);
}

if (!samplewise)
prev_expression = args.expression_levels[j];
prev_expression = args.expression_thresholds[j];
}

std::ofstream outfile;
Expand All @@ -235,6 +237,7 @@ void estimate(estimate_ibf_arguments & args, IBFType & ibf, std::filesystem::pat

}

// Calls the correct form of estimate
void call_estimate(estimate_ibf_arguments & args, estimate_arguments & estimate_args)
{
load_args(args, std::string{estimate_args.path_in} + "IBF_Data");
Expand Down
Loading

0 comments on commit f5b53e5

Please sign in to comment.