- Introduction
- Tools
- Installation
- Usage
- Use
u4falign
for manipulating alignment results - Difference between
read-SAM
andfrag-SAM
Falign
is a sequence alignment toolkit for long noisy chromosome conformation capture (3C) reads, such as Pore-C.
Falign
is written in C and C++ programming language.
Three tools are released together in this toolkit.
falign
. The alignment tool.falign_ngf
. Another alignment tool. It is used for benchmark only.u4falign
. A utility for manipulatingSAM
orPAF
mapping results.- The directory
supplementary_source_code
contains scripts for generating simulated Pore-C reads and analyzing mapping results.
$ git clones https://github.com/xiaochuanle/Falign.git
$ cd Falign/release
$ tar xzvf Falign-0.0.1-20221010-Linux-amd64.tar.gz
$ cd Falign/Linux-amd64/bin/
$ export PATH=$PATH:$(pwd)
The last command export PATH=$PATH:$(pwd)
is used for adding the path of falign
to the system PATH
so that you don't have to type the full path of falign
(such as /data3/cy/map_test/Falign/Linux-amd64/bin/falign
) every time you used falign
.
$ cd Falign/src
$ make -j
$ cd ../Linux-amd64/bin/
$ export PATH=$PATH:$(pwd)
Decompress the sample data and decompress it.
$ cd Falign
$ tar xzvf sample-data.tar.gz
$ cd sample-data
We provide one sample reference ara_2_4.fa
in the ./sample-data/ref
directory:
$ ls ref
ara_2_4.fa
and provide three sample reads in the ./sample-data/reads/
directory:
$ ls reads
ara_reads_1.fq ara_reads_2.fq ara_reads_3.fq
By default, falign
outputs mapping results in SAM
format:
$ falign -num_threads 48 ^GATC ref/ara_2_4.fa \
reads/ara_reads_1.fq \
reads/ara_reads_2.fq \
reads/ara_reads_3.fq > map.sam
Users can output the mapping results in PAF
format by using the -outfmt paf
option:
$ falign -num_threads 48 -outfmt paf ^GATC ref/ara_2_4.fa \
reads/ara_reads_1.fq \
reads/ara_reads_2.fq \
reads/ara_reads_3.fq > map.paf
In the running commands above, ^GATC
is the sequence of the DpnII restriction enzyme used for generating the reads. falign
provides the sequences for all familiar restriction enzymes (to see the following information, just type the falign
command) so that you don't bother to lookup for them in other places:
*** Examples of familiar <enzyme_seq>:
Enzyme_Name Enzyme_Seq
DpnII ^GATC
HindIII A^AGCTT
NcoI C^CATGG
NlaIII CATG^
Besides the way in the examples above, falign
also accepts reads input in the following way:
- Read list. You can list the paths of all the reads in a file:
$ cat read_list.lst
/Users/chenying/Desktop/mwj/temp/Falign/sample-data/reads/ara_reads_1.fq
/Users/chenying/Desktop/mwj/temp/Falign/sample-data/reads/ara_reads_2.fq
/Users/chenying/Desktop/mwj/temp/Falign/sample-data/reads/ara_reads_3.fq
And then input read_list.lst
to falign
:
falign -num_threads 48 ^GATC ref/ara_2_4.fa.gz read_list.lst > map.sam
- Directory. If you have many
FASTQ
s of reads in a directory, just type the name of the directory:
falign -num_threads 48 ^GATC ref/ara_2_4.fa.gz reads > map.sam
Note that only read files are allowed in the reads
directory. If you put other files in the reads
, falign
will complain.
falign
supports SAM
output format:
$ falign -outfmt sam
and PAF
format:
$ falign -outfmt paf
In each output result (note that every alignment has for offsets: read start, read end, reference start, reference end), falign
adds the following additional fields:
qS:i:
the nearest restiction enzyme site to the read start positionqE:i:
the nearest restiction enzyme site to the read end positionvS:i:
the nearest restriction enzyme site to the reference start positionvE:i:
the nearest restriction enzyme site to the reference end positionpi:f:
percentage of identity of the alignmentgs:i:
the global chain score of the alignment's candidatehm:Z:
a homologous map of the fragment
u4falign
is used for manipulating output results of falign
. It supports the following commands:
sam2salsa2
Transfer SAM results to pairwise contacts for SALSA2sam23ddna
Transfer SAM results to pairwise contacts for 3DDNApaf2salsa2
Transfer PAF results to pairwise contacts for SALSA2paf23ddna
Transfer PAF results ot pairwise contacts for 3DDNAsam2frag-sam
Transfer SAM results output by falign to fragment SAM mapping results
By convention, in SAM
mapping results, a read may contain multiple mapping results. For saving space, the read sequence is usually only presented in the first mapping result of this read. In the second to last mapping results of this read, the sequence field
is filled by a start *
. falign
outputs SAM
mapping results in this manner. And we call SAM
mapping results output in this manner read-SAM
. Since a Pore-C read always contains multiple fragments, a Pore-C read usually contains many mapping results:
0cd79600-51cf-4255-a6f7-0e9660721e85 16 2 1794202 1 ...
0cd79600-51cf-4255-a6f7-0e9660721e85 16 2 1791136 60 ...
0cd79600-51cf-4255-a6f7-0e9660721e85 0 2 1733063 60 ...
0cd79600-51cf-4255-a6f7-0e9660721e85 16 4 12727850 60 ...
0cd79600-51cf-4255-a6f7-0e9660721e85 0 2 1713207 60 ...
In the example above, the read 0cd79600-51cf-4255-a6f7-0e9660721e85
contains five alignment results.
Some tools such as whatshap taking BAM
format as input will complain if there exist too many duplicate read names. In this case we suggest transfer read-SAM
to frag-SAM
. In frag-SAM
every fragment is treadted as an individual read. We can use the sam2frag-sam
command in the u4falign
to transfer read-SAM
to frag-SAM
:
$ u4falign sam2frag-sam map.sam frag-map.sam
After transformation, we have
0cd79600-51cf-4255-a6f7-0e9660721e85_0000000001:000:0000000000:0001794201 16 2 1794202 1 ...
0cd79600-51cf-4255-a6f7-0e9660721e85_0000000001:001:0000000000:0001791135 16 2 1791136 60 ...
0cd79600-51cf-4255-a6f7-0e9660721e85_0000000001:002:0000000000:0001733062 0 2 1733063 60 ...
0cd79600-51cf-4255-a6f7-0e9660721e85_0000000001:003:0000000001:0012727849 16 4 12727850 60 ...
0cd79600-51cf-4255-a6f7-0e9660721e85_0000000001:004:0000000000:0001713206 0 2 1713207 60 ...
In frag-SAM
, the sequence field
in every alignment results is represented by the fragment sequence (not the whose read sequence). Note that we add a suffix to every read name to avoid duplicate read names in the SAM
file. The meanings of the fields in the suffix string is
read-id:fragment-id:reference-sequence-id:reference-mapping-position
- Chuan-Le Xiao, [email protected]
- Ying Chen, [email protected]
GPLv3