-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathdew.pl
More file actions
executable file
·4860 lines (4394 loc) · 186 KB
/
dew.pl
File metadata and controls
executable file
·4860 lines (4394 loc) · 186 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
#!/usr/bin/env perl
=pod
=head1 TODO
Speed up graph making by using memory (parallelizing doesn't work)
Version 2
* Network analysis
* add DESeq and make Venn diagram
=head1 NAME
dew - Differential Expression on the Web
=head1 DESCRIPTION
requires salmon v1.4.0
It accepts a reference gene FASTA and aligns readsets against it using single-end mode of BWA.
Since alignments take the longest, it sacrifices speed for low memory usage (uses sqlite) so it is good weak for web-server with a few GB of memory or very large datasets (TODO: farm out alignments)
User can also provide sequences (as FASTA) to be used as a reference from the cmdline
Makes a PNG image of the coverage of a reference gene(s) for each readset provided
Files are recalculated only if non-existent/older
if sequence exists in database with alignments, then don't re-align. use md5 on sequence
created server to run these things on
Added kangade
added fpkm via express - removed it in favour of salmon
added bowtie2; made it the default; accepts fastq.bz2 and gz
when not in contextual alignment, salmon processes each reference sequence separately (yes, it is incorrect but db results would be incorrect otherwise and are getting the fpkm...)
added edgeR
added html5 interactive heatmap and scatterplot
housekeeping genes are found automatically using pvalue 1 +/-0.05
Currently, highest memory use is from samtools sort (2Gb) and the memory used by depth calculations (9gb for fungal dataset).
SQLite is fast because it resides in memory and then written out in file; but this prevents parallel runs.
Sort requires up to 20Gb of RAM but is configurable
samtools I use is v. 0.1.18
=head1 EXAMPLE CMD
See /demo and try:
dew.pl -in schizosaccharomyces_pombe_972h-_2_genes.fasta.renamed.nr98 -lib lib_alias.txt \
-1 Sp.plat.1M.left.fq.bz2 Sp.ds.1M.left.fq.bz2 Sp.hs.1M.left.fq.bz2 Sp.log.1M.left.fq.bz2 \
-2 Sp.plat.1M.right.fq.bz2 Sp.ds.1M.right.fq.bz2 Sp.hs.1M.right.fq.bz2 Sp.log.1M.right.fq.bz2 \
-threads 6 -contextual -correct_bias -over -uid dew_fungal_demo 2>&1 | tee dew_fungal_demo.dew.log
If there is an error, email us a compressed dew_fungal_demo.dew.log
test times:
3 fungal datasets all genes;4 threads;direct I/O
for standalone (new SQL database):
preliminary < 1'; 'each readset: 11.5' for aln (1.5')+express (9')+depth (<1') each readset; kangade=failed?;stats n graphs=18'; R=failed'; TOTAL=55'
real 55m43.152s
user 68m50.390s
sys 7m18.750s
for standalone (existing SQL database):
for context (new SQL database):
preliminary < 1'; 4.5' each readset: for aln (1.5')+express (2')+depth (<1') each readset; kangade=1.5';stats n graphs=18'; R=<1'; TOTAL=33:15min.
real 33m29.720s
user 46m22.530s
sys 1m16.300s
DB: size = dew_webserver.db 992K
for context (existing SQL database):
real 31m51.256s
user 43m29.010s
sys 1m14.580s
DB: size = dew_webserver.db 3.4M : why?
=head1 USAGE
Legend (:s => string; :i => integer; :f => float; {1,} => one or more); shortened names are valid if they are unique (i.e. -1 instead of -1read)
Mandatory:
-infile :s => Reference file of genes
-sequence :s => Instead of a file, provide a single sequence (in FASTA format with \n as line separator);
-format :s => The reads can be in BAM or FASTQ format. With bowtie2, FASTQ can be .bz2 or .gz
-1read|readset1|r :s{1,}=> Sets of files (one per library). Tested with Phred33 FASTQ format
-2read|readset2: s{1,} => If provided, do paired end alignment. Sets of paired right files (synced to readset1). Optional.
File paths if not in $PATH:
-samtools_exec :s => Executable to samtools if not in your path
-bwa_exec :s => Executable to BWA if not in your path
-bowtie2_exec :s => Executable to Bowtie2 if not in your path
-bamtools_exec :s => Executable to bamtools if not in your path
-kangade_exec :s => Executable to biokanga if not in your path
Optional:
* => Recommended
-output or -uid :s => A uid for naming output files. Optional, otherwise generate
* -threads :i => Number of CPUs to use for alignment. BWA has no advantage over 4 threads
* -library_name_file :s => An tag value tab delimited file for giving a friendly alias for each readset library. Needs a header line to describe columns ("file" and "name" in that order). Only include -1read files.
-sample_names :s{,} => A list of names to assign to the samples. Only if a -library_name_file is not provided. Must be same order/number as -readsets
-need_all_readsets => All sets of reads must have alignments against the gene in order for it to be processed. Otherwise, 1+ is sufficient.
-over => Allow overwriting of any files with the same name
-nographs => Do not produce any graphs. Graphs can take a very long time when there are many readsets (e.g. 30+ libraries and 30k+ genes).
-gene_coverage_graph => Only do enough work to get the gene depth/coverage graphs and then exit. No expression boxplots
-no_js_graphs => If producing edgeR graphs, then don't produce javascript based graphs.
* -contextual => Complete realignment of all genes in order to run a correction of biases properly. Does not read/store data in the cache
-use_bwa => Use BWA instead of Bowtie2. Much slower.
-prepare_only => Quit after post-processing readsets and writing initial DB
-seeddb :s => Initialize database using this database file (e.g. after -prepare_only)
-kanga => Use kanga instead of bowtie2 for alignments. It requires a LOT of memory (ca. 1Gb per million reads) and post-processing paired-end is much slower than bowtie
-existing_aln :s{1,} => Use an existing bam file instead of doing a new alignment (must be read name sorted)
-coord_existing => Above existing alignments are co-ordinate sorted rather than read name sorted
-resume => Load existing data from database and do not reprocess existing readsets (good for adding new readsets even with contextual. NB assumes same FASTA input so DO NOT use if you changed the FASTA reference gene file)
-do_kangade|dokangade => Use kangade to process pairwise libraries. Experimental
-db_use_file => Use file for SQL database rather than system memory (much slower but possible to analyze larger datasets)
-dispersion => For edgeR: if we have replicates, dispersion can be set to auto. otherwise set it to a float such as 0.1 (def)
-fdr_cutof => Cut off FDR for DE as float (0.001 def.)
-cpm_cutoff => Cut off counts-per-million for parsing alignments to edgeR (def 2)
-library_cutoff => Number of libraries that need to have cpm_cutoff to keep alignment (def 1)
-binary_min_coverage :f => Minimum gene (length) coverage (0.00 ~ 1.00) for counting a gene as a expressed or not (def. 0.3)
-binary_min_reads :i => Minimum number of aligned reads for counting a gene as a expressed or not (def. 4)
-never_skip => Always process graph for gene/readset pair, regardless of cutoffs
* -sort_memory_gb :i => Max memory (in Gb) to use for sorting (def 10). Make sure this is less than the available/requested RAM
* -remove_redund => Remove redundant sequences (identical) in -infile
-only_alignments => Stop after all alignments are completed. Good for large data/alignments and HPC environments. Use without -contextual (and use with -nographs).
-cleanup => Delete alignments after successful completion
* -no_pairwise => Do not do pairwise comparisons (kangade and edgeR). Otherwise, this can be VERY memory intense for genomewide for many (20+) libraries (160Gb)
-no_check => When re-starting, do not check database if every gene has been stored. Do not use if you're adding new genes or database was incomplete (it will crash later), but use if you're restarting and have lots of genes.
-verbose => Print on the screen any system commands that are run. Caution, that will produce a lot of output on the screen they are kept in the .log file anyway).
-no_pdf => Do not convert gene coverage/expression images to multi-page PDF. Otherwise, will print a PDF for every 500 genes per PDF (slow for large genomes & dozens of readsets)
* -sort_tmp :s => Temporary directory to use for sorting files. It will need to be on a fast disk that has sufficient free space (depends on number of -threads)
* -isoforms => Use salmon to correct Illumina sequencing biases and transcript isofrm assignments. Increases runtime. Use -contextual for accuracy
* -genomewide => Your input provides all the genes of the genome, i.e. expecting to have all reads in the readset aligning. This influences salmon only. Probably needed for genomewide analyses that have readsets with large amount of non coding sequence (e.g. rDNA). Also stores data in database cache
-is_quantseq => Tell the program that you've sequenced only the end(s) of the genes (3' UTR sequencing)
Salmon options:
-extra_options :s => Extra options for e.g. salmon, exclude any initial --dashes from the salmon options (eg for QuantSeq give as "salmon:minAssignedFrags 5;salmon:noLengthCorrection;salmon:libType SF", including the "quotes" ). I highly recommend you include the salmon:libType (ISR for typical Illumina, if nothing given then set to Auto)
-readset_separation => Expected insert size for Salmon (approximately). Defaults to 500.
NB: if you are using an NFS, you can run -prepare_only on a fast server to prepare the database, then restart with -resume -no_check -over
NB2: I normally use -contextual -isoforms -genomewide -remove_redund -no_pairwise for genome-wide analyses
=head1 FAQ on errors
1. I get this error:
$VAR1 = bless( {}, 'DBI::st' );
Cannot find md5 data for Msex2.02821.2
You are trying to reuse an existing directory but not using an existing SQLite database. This error happens because the file *.md5 *.checked and *.depth.completed exists in your result directory. Please delete them and restart (rm -f outdir/*md5 outdir/*checked). Do not set -resume
=head1 AUTHORS
Alexie Papanicolaou
CSIRO Ecosystem Sciences
alexie@butterflybase.org
Many thanks/credit to B. Haas (The Broad) for code bundled in Trinity-RNA-Seq
=head1 DISCLAIMER & LICENSE
Copyright 2013-2014 the Commonwealth Scientific and Industrial Research Organization.
Copyright 2015-2016 the Western Sydney University
This software is released under the Mozilla Public License v.2.
It is provided "as is" without warranty of any kind.
You can find the terms and conditions at http://www.mozilla.org/MPL/2.0.
=head1 BUGS & LIMITATIONS
Making of BioPerl graphs has a memory leak somewhere...
See TODO, otherwise none other known so far but probably lots
=cut
use strict;
use warnings;
use Carp;
use FindBin qw/$RealBin/;
use lib $RealBin. '/PerlLib';
use Pod::Usage;
use Data::Dumper;
use Getopt::Long;
use Digest::MD5 qw(md5_hex);
#use Digest::SHA qw/sha1 md5_hex/;
use Statistics::Descriptive;
use Time::Progress;
use Time::localtime;
use File::Basename;
use File::stat;
use File::Path qw(remove_tree);
use File::Copy;
use Text::CSV_XS;
use Fasta_reader;
use Cwd qw(abs_path getcwd);
use JSON::PP;
use Compress::LZ4;
#threads
use Thread_helper; # threads -1
use threads::shared;
#db
use DBI qw(:sql_types);
use Storable qw(freeze thaw);
#graphics
use SVG;
use GD::SVG; ## we don't want PNG anymore...
use Bio::Graphics;
use Bio::SeqFeatureI;
use Bio::SeqFeature::Generic;
use Bio::SeqFeature::Lite;
$| = 1;
$ENV{'PATH'} = "$RealBin:$RealBin/3rd_party/bin:$RealBin/util:$RealBin/3rd_party/salmon/bin:" . $ENV{'PATH'};
#################################
my (
$debug, $input_reference_file,
@readsets, @housekeeping_ids,
%housekeeping, $reference_file_md5sum,
@reference_sequence_list, %library_aliases,
$contextual_alignment, %library_metadata,
$extra_genes, %paired_readset_lookup,
$perform_bias_correction, %fold_changes,
%skipped_references, %user_alias,
%groups_readsets, $cleanup,
$md5_aliases_file, $remove_redund,
$kangade_exec, $kangax_exec,
$existing_bam_is_coordsorted,
$kanga_exec, %gene_aliases_threads, $do_galaxy_cleanup, @sample_names
);
my $db_hostname = 'localhost';
my $sqlite3_exec =&check_program('sqlite3');
my (
$ps2pdf_exec, $inkscape_exec,
$convert_imagemagick_exec, $pdfcrop_exec, $biokanga_exec,
$samtools_exec, $bowtie2_exec, $bwa_exec,
$salmon_exec, $bamtools_exec, $sort_exec
)
= &check_program_optional(
'gs', 'inkscape', 'convert',
'pdfcrop', 'biokanga', 'samtools', 'bowtie2',
'bwa', 'salmon', 'bedtools', 'sort'
);
$convert_imagemagick_exec .= " -limit disk 5gb " if $convert_imagemagick_exec;
if ($ps2pdf_exec) {
$ps2pdf_exec .=
" -sPAPERSIZE=a0 -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -q -sOutputFile=";
}
# TODO If there exist a sizeable number of housekeeping transcripts that should not be DE, the the dispersion could be estimated from them.
my $readset_separation = 500;
my $fragment_max_length = 800;
my $edgeR_dispersion = 0.4;
my $minCPM = 2;
my $minLibs = 1;
my $fdr_pval_cutoff = 0.001;
my $tree_clusters = 10;
my $read_format = 'fastq';
my $threads = 3;
my $min_housekeeping_genes = 5;
my $process_cutoff = 1;
my $binary_min_coverage = 0.3;
my $binary_min_reads = 4;
my $sort_memory = 10;
my $salmon_min_bias = 2*10^6; # 2 mil as we often have few reads
my (
$uid, $lib_alias_file, $demand_all_readsets,
$use_kanga, @use_existing_bam, $overwrite_results,
@readsets2, $db_file, $no_checks,
$db_use_file, $dbname, $no_graphs,
$user_ref_sequence, $do_kangade, $main_output_dir,
$use_bwa, $prepare_input_only, @sample_overlays,
$initial_db, $resume, $gene_coverage_graphs_only,
%binary_table, $never_skip,
$extra_options, $genomewide, $only_alignments,
$options_single, $no_js_graphs, $do_png,
$no_pdf, $no_pairwise, $verbose, $is_quantseq
);
my $cwd = getcwd;
my $given_cmd = $0 . " " . join( " ", @ARGV );
my $sort_tmp = $cwd.'/sort_tmp';
pod2usage $! unless GetOptions(
'sort_tmp:s' => \$sort_tmp,
'do_galaxy_cleanup' => \$do_galaxy_cleanup, # delete EVERYTHING EXCEPT PDF AND TSV and sqlite
'nocheck|no_check' => \$no_checks, #TMP for Debug; should not be needed
'infile:s' => \$input_reference_file,
'extra_genes:s' => \$extra_genes,
'sequence:s' => \$user_ref_sequence,
'format:s' => \$read_format,
'1read|readset1|r:s{1,}' => \@readsets, # if bowtie2, it can be /bz2
'2read|readset2:s{1,}' => \@readsets2, # if bowtie2, it can be /bz2
'samtools_exec:s' => \$samtools_exec,
'bowtie2_exec:s' => \$bowtie2_exec,
'bamtools_exec:s' => \$bamtools_exec, # SE bowtie2 only
'bwa_exec:s' => \$bwa_exec, # modified bwa
'biokanga_exec:s' => \$biokanga_exec,
'out|output|uid:s' => \$uid,
'outdir:s' => \$main_output_dir,
'threads|cpus:i' => \$threads,
'alias|library_name_file:s' => \$lib_alias_file,
'sample_names:s{,}' => \@sample_names,
'prepare_only' => \$prepare_input_only,
'need_all_readsets' => \$demand_all_readsets,
'over' => \$overwrite_results,
'no_graphs|nographs' => \$no_graphs,
'hostname:s' => \$db_hostname,
'dbname:s' => \$dbname,
'seeddb:s' => \$initial_db,
'contextual' => \$contextual_alignment,
'correct_bias|isoforms' => \$perform_bias_correction,
'use_bwa' => \$use_bwa,
'kanga' => \$use_kanga,
'existing_aln:s{1,}' => \@use_existing_bam,
'coord_existing' => \$existing_bam_is_coordsorted,
'debug:i' => \$debug, # should not be used by users
'verbose' => \$verbose,
'resume' => \$resume,
'do_kangade|dokangade' => \$do_kangade,
'db_use_file' => \$db_use_file,
'gene_coverage_graphs_only' => \$gene_coverage_graphs_only,
'dispersion:s' => \$edgeR_dispersion, #auto for bio.reps
'fdr_cutoff:f' => \$fdr_pval_cutoff,
'cpm_cutoff' => \$minCPM,
'cutoff_library' => \$minLibs,
'binary_min_coverage:f' => \$binary_min_coverage,
'binary_min_reads:i' => \$binary_min_reads,
'never_skip' => \$never_skip,
'memory_gb|sort_memory_gb:i' => \$sort_memory,
'remove_redund' => \$remove_redund,
'readset_separation:i' => \$readset_separation,
'extra_options:s' => \$extra_options,
'genomewide' => \$genomewide,
'only_alignments' => \$only_alignments,
'cleanup' => \$cleanup,
'no_js_graphs' => \$no_js_graphs,
'png_graphs' => \$do_png,
'nopairwise|no_pairwise' => \$no_pairwise,
'options_single:s' => \$options_single,
'nopdf|no_pdf' => \$no_pdf,
'salmon_min_bias:i' => \$salmon_min_bias,
'is_quantseq' => \$is_quantseq,
);
die
"-binary_min_coverage has to be between 0 and 1 (not $binary_min_coverage)\n"
unless $binary_min_coverage > 0 && $binary_min_coverage <= 1;
$threads = 2 if $threads < 2; # no idea what happens with a single thread
my ($bunzip2_exec) = &check_program_optional('pbzip2');
my ($gunzip_exec) = &check_program_optional('pigz');
my $bunzip_threads = $threads <= 6 ? $threads : 6;
$bunzip2_exec .= " -p$bunzip_threads " if $bunzip2_exec;
($bunzip2_exec) = &check_program('bzip2') if !$bunzip2_exec;
#MEMORY & threads for sorts
die "Please provide memory as an integer of gigabytes, e.g. -sort_memory_gb 10\n" unless $sort_memory && $sort_memory=~/^\d+$/ && $sort_memory >=1;
$sort_memory = $sort_memory * 1024 * 1024 * 1024;
my $sort_memory_exec = $sort_memory;
my $sort_memory_sub = int($sort_memory/3).'b';
my $sort_version = `sort --version|head -1`;
if ($sort_version=~/^sort \(GNU coreutils\) (\d+)/){
my $sort_threads = $threads >= 4 ? 4 : 2; #same as alignment
# for pipelines
my $sort_threads_sub = int($threads / 3);
$sort_threads_sub = $sort_threads_sub > 5 ? 5 : 5;
$sort_memory_exec .= " --parallel=$sort_threads" if ($1 && $1 >= 8);
$sort_memory_sub .= " --parallel=$sort_threads_sub" if ($1 && $1 >= 8);
}
mkdir($sort_tmp) if !-d $sort_tmp;
# parallelise alignments; each alignment uses 4 threads, or 2 if less available
my $alignment_threads = $threads >= 4 ? 4 : 2;
my $alignnent_parallel = int($threads/$alignment_threads) - 1;
my $alignment_helper_threads = $alignnent_parallel > 0
? $alignnent_parallel : 1;
my $alignment_thread_helper = new Thread_helper($alignment_helper_threads);
my $samtools_threads = $threads >= 4 ? 4 : 2; # same as alignment
my $sam_sort_memory = int($sort_memory / $samtools_threads);
my ( $extra_salmon, %single_end_readsets, $salmon_options_single );
my $kanga_threads = $threads <= 8 ? $threads : 10;
my $R_threads = $threads <= 8 ? $threads : 10;
$uid = &get_uid_time('dew') unless $uid;
$dbname = $uid . '_transcriptome.sqlite.db' unless $dbname;
my $result_dir =
$main_output_dir
? $cwd . "/" . $main_output_dir . '/'
: $cwd . "/" . $uid . '_results/';
my $edgeR_dir = $result_dir . '/edgeR/';
my $counts_expression_level_matrix =
"$result_dir/$uid.counts.expression_levels.matrix";
my $total_timer = new Time::Progress;
&perform_checks_preliminary();
my (
$dbh,
$get_from_seqdb,
$get_hash_from_seqdb,
$add_seqhash_to_seqdb,
$init_expression_statistics,
$add_readset,
$update_readset,
$update_readset_size,
$add_depth_data,
$check_depth_data,
$check_depth_readset,
$delete_depth_data,
$get_readset,
$get_expression_statistics,
$check_expression_statistics,
$update_expression_statistics,
$set_housekeeping,
$set_housekeeping_fold,
$check_hash_from_seqdb,
$add_sequence_alias,
$get_sequence_alias,
$add_to_kangade_analysis,
$check_kangade_analysis,
$delete_kangade_analysis,
$start_fold_change,
$check_fold_change,
$add_kangade_fold_change,
$add_salmon_fold_change,
$add_raw_fold_change,
$get_kangade_fold_change,
$get_salmon_fold_change,
$get_raw_fold_change,
$update_salmon_expression_statistics,
$update_kangade_expression_statistics,
$update_rpkm_expression_statistics,
$get_readset_filename
) = &sqlite_init();
my $file_for_alignment = &prepare_input_data();
# this stores all the expression data for creating the coverage graphs without database acceess
my %expression_coverage_stats_hash;
if ( !-s $counts_expression_level_matrix
|| ( $contextual_alignment && $debug && $debug >= 2 ) )
{
&starts_alignments($file_for_alignment);
if ($only_alignments) {
print "User stop requested after alignments are completed.\n";
exit(0);
}
&perform_stats();
&process_expression_level();
&sqlite_backup() unless $db_use_file;
# from now on, we don't need to write to the database
close STATS;
close STATS_RATIO;
&print_binary_table();
}
else {
print "Expression data already exists ($counts_expression_level_matrix). Will not reprocess. Acquiring data from database...\n";
# we need to get all the expression coverage data
&get_all_expression_data();
&print_binary_table();
}
if ($only_alignments) {
print "User stop requested after alignments are completed.\n";
&process_completion();
}
# get TMM normalized expression. this is relatively fast.
my $check_lines = `wc -l < $counts_expression_level_matrix`;
confess "No expression data available (empty $counts_expression_level_matrix)!\n"
unless ( -s $counts_expression_level_matrix && $check_lines > 1 );
my ( $effective_expression_level_matrix_TMM_tpm )
= &perform_TMM_normalization_edgeR($counts_expression_level_matrix);
$check_lines = `wc -l < $effective_expression_level_matrix_TMM_tpm `
if $effective_expression_level_matrix_TMM_tpm;
confess
"edgeR did not produce any TMM output ($effective_expression_level_matrix_TMM_tpm)!\n"
unless ( -s $effective_expression_level_matrix_TMM_tpm && $check_lines > 1 );
# this needs the database in order to do pairwise comparisons...
if ( !$no_pairwise ) {
print
"\nPreparing edgeR differential expression data and graphs, This may take a long time if you have a large number of groups (use -no_pairwise to not do that in the future)\n";
&perform_edgeR_pairwise();
&prepare_edgeR_graphs($effective_expression_level_matrix_TMM_tpm);
my ( $html2d, $html3d ) =
&prepare_scatter_for_canvas($effective_expression_level_matrix_TMM_tpm);
}
else {
print "User requested no pairwise edgeR comparisons...\n";
}
## from now on, we have no db access? we can multithread.
&sqlite_destroy();
my $expression_coverage_stats_hashref =
shared_clone( \%expression_coverage_stats_hash );
undef(%expression_coverage_stats_hash);
&perform_coverage_graphs($expression_coverage_stats_hashref);
undef($expression_coverage_stats_hashref);
if ($gene_coverage_graphs_only) {
print "User stop requested after gene coverage graphs were completed.\n";
&process_completion();
}
unless ($no_graphs){
&process_edgeR_graphs_overview( $effective_expression_level_matrix_TMM_tpm,
$result_dir . '/gene_expression_tpm/','TPM' );
}
&process_completion();
#########################################################################################################
sub process_cmd {
my ( $cmd, $dir ) = @_;
print "CMD: $cmd\n" if $debug || $verbose;
print LOG "CMD: $cmd\n";
chdir($dir) if $dir;
my $ret = system($cmd);
if ( $ret && $ret != 256 ) {
confess "Error, cmd died with ret $ret\n";
}
chdir($cwd) if $dir;
return;
}
sub mytime() {
my @mabbr =
qw(January February March April May June July August September October November December);
my @wabbr = qw(Sunday Monday Tuesday Wednesday Thursday Friday Saturday);
my $sec = localtime->sec() < 10 ? '0' . localtime->sec() : localtime->sec();
my $min = localtime->min() < 10 ? '0' . localtime->min() : localtime->min();
my $hour =
localtime->hour() < 10 ? '0' . localtime->hour() : localtime->hour();
my $wday = $wabbr[ localtime->wday ];
my $mday = localtime->mday;
my $mon = $mabbr[ localtime->mon ];
my $year = localtime->year() + 1900;
return "$wday, $mon $mday, $year: $hour:$min:$sec\t";
}
sub sqlite_backup() {
if ($debug) {
warn "DEBUG mode: Will not backup database\n";
return;
}
my $keep = shift;
my $backup_db_file = shift;
$backup_db_file = $db_use_file ? $db_file . '.backup' : $db_file
if !$backup_db_file;
print "\nCheckpointing database ($backup_db_file). DO NOT halt...\r";
$dbh->do("VACUUM");
$dbh->do("PRAGMA shrink_memory");
# ensure no problems if it crashes while backing up
$dbh->sqlite_backup_to_file( $backup_db_file . ".tmp" );
if ( -s $backup_db_file . ".tmp" ) {
unlink( $backup_db_file . '.old' );
rename( $backup_db_file, $backup_db_file . '.old' ) if $keep;
unlink($backup_db_file);
rename( $backup_db_file . ".tmp", $backup_db_file );
print
"Checkpointing database ($backup_db_file). Done: SQL database checkpointed!\n";
}
else {
print
"Checkpointing database ($backup_db_file). Error: Checkpointing failed or there were no data to write out!\n";
}
}
sub sqlite_init() {
$db_file = $debug ? $cwd . "/$dbname.debug" : $cwd . "/$dbname";
my $active_db_file = $db_use_file ? $db_file . '.active' : ':memory:';
if ( !-s $db_file && $initial_db && $initial_db ne $db_file ) {
die "You asked for a seed database ($initial_db) but it doesn't exist!\n"
unless -s $initial_db;
warn "Will use $initial_db as seed database\n";
copy( $initial_db, $db_file );
}
my $db_existed = -s $db_file ? 1 : int(0);
my $dbh;
$dbh = DBI->connect(
"dbi:SQLite:" . $active_db_file,
"", "",
{
sqlite_see_if_its_a_number => 1,
AutoCommit => 1,
RaiseError => 1
}
);
my $db_version = $dbh->{sqlite_version};
confess "Cannot create SQLite database" unless $db_version;
print "\tUsing SQLite DB v$db_version";
if ($db_use_file) {
print " (with a file)\n";
}
else {
print " (in memory)\n";
}
if ($db_existed) {
$dbh->sqlite_backup_from_file($db_file);
my $res = `$sqlite3_exec $db_file "PRAGMA integrity_check"`;
die "Database integrity check failed for $db_file. Delete and try again\n" if ($res && $res!~/^ok/);
}
else {
if ($resume || $no_checks){
print
"Warning! Database $db_file does not seem to be ok, will recreate and not resume.\n";
undef($resume);
undef($no_checks);
}
print "\tCreating database...\n";
$dbh->do("PRAGMA encoding = 'UTF-8'");
#$dbh->do("CREATE TABLE file_alignments(file_md5sum char(32),readset_id integer,bam blob)" );
#$dbh->do("CREATE UNIQUE INDEX file_alignments_idx1 ON file_alignments(file_md5sum,readset_id)" );
$dbh->do("CREATE TABLE sequence_data(seq_md5hash char(32) primary key,seq_length integer,housekeeping integer DEFAULT 0)" );
$dbh->do("CREATE TABLE sequence_aliases (seq_md5hash char(32), alias text)");
$dbh->do("CREATE INDEX sequence_aliases_idx ON sequence_aliases(seq_md5hash)");
$dbh->do("CREATE TABLE readsets (readset_id INTEGER PRIMARY KEY,readset_file varchar(255),total_reads integer, is_paired varchar, alias varchar(255), readlength_median integer, ctime timestamp)" );
$dbh->do("CREATE UNIQUE INDEX readsets_idx1 ON readsets(readset_file)");
$dbh->do("CREATE TABLE expression_statistics (seq_md5hash char(32), readset_id integer, gene_length_coverage REAL, gene_length_coverage_mean REAL, no_coverage integer, rpkm integer, aligned_reads_per_base REAL, gene_length_coverage_median integer, total_aligned_reads integer, gene_length_coverage_max integer, gene_length_coverage_sd REAL, salmon_eff_counts REAL, salmon_tpm REAL, kangade_counts integer)" );
$dbh->do("CREATE UNIQUE INDEX expression_statistics_idx1 ON expression_statistics(seq_md5hash,readset_id)" );
#tmp for uncached
$dbh->do("CREATE TABLE expression_statistics_tmp (seq_md5hash char(32), readset_id integer, gene_length_coverage REAL, gene_length_coverage_mean REAL, no_coverage integer, rpkm integer, aligned_reads_per_base REAL, gene_length_coverage_median integer, total_aligned_reads integer, gene_length_coverage_max integer, gene_length_coverage_sd REAL, salmon_eff_counts REAL, salmon_tpm REAL , kangade_counts integer)" );
$dbh->do("CREATE UNIQUE INDEX expression_statistics_tmp_idx1 ON expression_statistics_tmp(seq_md5hash,readset_id)" );
# r-tree?
$dbh->do("CREATE TABLE depth (seq_md5hash char(32), readset_id integer, data blob)"); $dbh->do("CREATE INDEX depth_idx1 ON depth(seq_md5hash,readset_id)");
$dbh->do("CREATE TABLE depth_tmp (seq_md5hash char(32), readset_id integer, data blob)" );
$dbh->do("CREATE INDEX depth_tmp_idx1 ON depth_tmp(seq_md5hash,readset_id)");
$dbh->do("CREATE TABLE kangade_analysis (seq_md5hash char(32), readset1_id INTEGER, readset2_id INTEGER,Classification INTEGER,Score INTEGER,DECntsScore INTEGER,PearsonScore INTEGER,CtrlUniqueLoci INTEGER,"
. "ExprUniqueLoci INTEGER,CtrlExprLociRatio INTEGER,PValueMedian REAL,PValueLow95 REAL,PValueHi95 REAL,TotCtrlCnts INTEGER,TotExprCnts INTEGER,TotCtrlExprCnts INTEGER,ObsFoldChange REAL,FoldMedian REAL,"
. "FoldLow95 REAL,FoldHi95 REAL,ObsPearson REAL,PearsonMedian REAL,PearsonLow95 REAL,PearsonHi95 REAL)"
);
$dbh->do("CREATE UNIQUE INDEX kangade_analysis_idx1 ON kangade_analysis(seq_md5hash,readset1_id,readset2_id)" );
$dbh->do("CREATE TABLE fold_changes (seq_md5hash char(32), readset1_id INTEGER, readset2_id INTEGER, raw_rpkm REAL, salmon_effective_counts REAL,salmon_tpm REAL , kangade_observed REAL,housekeeping integer DEFAULT 0)" );
$dbh->do("CREATE UNIQUE INDEX fold_changes_idx1 ON fold_changes(seq_md5hash,readset1_id,readset2_id)" );
}
### PRAGMAS for speed
$dbh->do("PRAGMA journal_mode = MEMORY");
$dbh->do("PRAGMA temp_store = 2 "); # memory
$dbh->do("PRAGMA cache_size = -448000 "); # 400mb of RAM for sqlite
$dbh->do("PRAGMA synchronous = OFF") if $db_use_file;
$dbh->do("PRAGMA quick_check");
### SQL queries
my $start_fold_change = $dbh->prepare(
"INSERT INTO fold_changes (seq_md5hash,readset1_id,readset2_id) VALUES (?,(SELECT readset_id from readsets where readset_file=?),(SELECT readset_id from readsets where readset_file=?))"
);
my $check_fold_change = $dbh->prepare(
"SELECT seq_md5hash,readset1_id,readset2_id FROM fold_changes WHERE seq_md5hash=? AND readset1_id=(SELECT readset_id from readsets WHERE readset_file=?) AND readset2_id=(SELECT readset_id from readsets WHERE readset_file=?)"
);
my $add_kangade_fold_change = $dbh->prepare(
"UPDATE fold_changes SET kangade_observed=? WHERE seq_md5hash=? AND readset1_id=(SELECT readset_id from readsets WHERE readset_file=?) AND readset2_id=(SELECT readset_id from readsets WHERE readset_file=?)"
);
my $add_salmon_fold_change = $dbh->prepare(
"UPDATE fold_changes SET salmon_effective_counts=?,salmon_tpm=? WHERE seq_md5hash=? AND readset1_id=(SELECT readset_id from readsets where readset_file=?) AND readset2_id=(SELECT readset_id from readsets where readset_file=?)"
);
my $add_raw_fold_change = $dbh->prepare(
"UPDATE fold_changes SET raw_rpkm=? WHERE seq_md5hash=? AND readset1_id=(SELECT readset_id from readsets where readset_file=?) AND readset2_id=(SELECT readset_id from readsets where readset_file=?)"
);
my $get_kangade_fold_change = $dbh->prepare(
"SELECT kangade_observed FROM fold_changes WHERE seq_md5hash=? AND readset1_id=(SELECT readset_id from readsets where readset_file=?) AND readset2_id=(SELECT readset_id from readsets where readset_file=?)"
);
my $get_salmon_fold_change = $dbh->prepare(
"SELECT salmon_tpm, salmon_effective_counts FROM fold_changes WHERE seq_md5hash=? AND readset1_id=(SELECT readset_id from readsets where readset_file=?) AND readset2_id=(SELECT readset_id from readsets where readset_file=?)"
);
my $get_raw_fold_change = $dbh->prepare(
"SELECT raw_rpkm FROM fold_changes WHERE seq_md5hash=? AND readset1_id=(SELECT readset_id from readsets where readset_file=?) AND readset2_id=(SELECT readset_id from readsets where readset_file=?)"
);
my $add_to_kangade_analysis = $dbh->prepare(
"INSERT INTO kangade_analysis (seq_md5hash, readset1_id, readset2_id,Classification,Score,DECntsScore,PearsonScore,CtrlUniqueLoci,ExprUniqueLoci,CtrlExprLociRatio,PValueMedian,PValueLow95,PValueHi95,TotCtrlCnts,TotExprCnts,TotCtrlExprCnts,ObsFoldChange,FoldMedian,FoldLow95,FoldHi95,ObsPearson,PearsonMedian,PearsonLow95,PearsonHi95) "
. "VALUES (?,(SELECT readset_id from readsets where readset_file=?),(SELECT readset_id from readsets where readset_file=?),?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
);
my $check_kangade_analysis = $dbh->prepare(
"SELECT * FROM kangade_analysis WHERE seq_md5hash=? AND readset1_id=(SELECT readset_id from readsets where readset_file=?) AND readset2_id=(SELECT readset_id from readsets where readset_file=?) "
);
my $delete_kangade_analysis = $dbh->prepare(
"DELETE FROM kangade_analysis WHERE seq_md5hash=? AND readset1_id=(SELECT readset_id from readsets where readset_file=?) AND readset2_id=(SELECT readset_id from readsets where readset_file=?)"
);
#my $readset_id_to_filename = $dbh->prepare_cached("SELECT readset_file FROM readsets WHERE readset_id = ? ");
#my $readset_filename_to_id = $dbh->prepare_cached("SELECT readset_id FROM readsets WHERE readset_file = ? ");
#my $add_to_db = $dbh->prepare("INSERT INTO file_alignments (file_md5sum,readset_id,bam) VALUES (?, (SELECT readset_id from readsets where readset_file=?), ?)" );
#my $delete_from_db = $dbh->prepare("DELETE FROM file_alignments WHERE file_md5sum=? and readset_id=(SELECT readset_id from readsets where readset_file=?)" );
#my $check_db = $dbh->prepare("SELECT readset_id FROM file_alignments WHERE file_md5sum=? AND readset_id=(SELECT readset_id from readsets where readset_file=?)" );
#my $get_from_aligndb = $dbh->prepare("SELECT bam FROM file_alignments WHERE file_md5sum=? AND readset_id=(SELECT readset_id from readsets where readset_file=?)" );
my $get_from_seqdb = $dbh->prepare_cached(
"SELECT seq_length FROM sequence_data WHERE seq_md5hash=?");
my $get_hash_from_seqdb =
$dbh->prepare("SELECT seq_md5hash FROM sequence_aliases WHERE alias=?");
my $check_hash_from_seqdb =
$dbh->prepare("SELECT seq_md5hash FROM sequence_data WHERE seq_md5hash=?");
my $add_seqhash_to_seqdb = $dbh->prepare(
"INSERT INTO sequence_data (seq_md5hash,seq_length) VALUES (?,?)");
my $depth_table = 'depth';
$depth_table .= '_tmp' if $contextual_alignment && !$genomewide;
my $check_depth_data = $dbh->prepare(
"SELECT data from $depth_table WHERE seq_md5hash=? AND readset_id=(SELECT readset_id from readsets where readset_file=?)"
);
my $check_depth_readset = $dbh->prepare(
"SELECT count(*) from $depth_table WHERE readset_id=(SELECT readset_id from readsets where readset_file=?)"
);
my $delete_depth_data = $dbh->prepare(
"DELETE from $depth_table WHERE seq_md5hash=? AND readset_id=(SELECT readset_id from readsets where readset_file=?) "
);
my $add_depth_data = $dbh->prepare(
"INSERT INTO $depth_table (seq_md5hash,readset_id,data) VALUES (?,(SELECT readset_id from readsets where readset_file=?),?)"
);
my $expression_statistics_table = 'expression_statistics';
$expression_statistics_table .= '_tmp'
if $contextual_alignment && !$genomewide;
my $init_expression_statistics = $dbh->prepare(
"INSERT INTO $expression_statistics_table (seq_md5hash, readset_id) VALUES (?,(SELECT readset_id from readsets where readset_file=?)) "
);
my $update_rpkm_expression_statistics = $dbh->prepare(
"UPDATE $expression_statistics_table set rpkm=?, total_aligned_reads=? WHERE seq_md5hash=? AND readset_id=(SELECT readset_id from readsets where readset_file=?)"
);
my $update_expression_statistics = $dbh->prepare(
"UPDATE $expression_statistics_table set gene_length_coverage=?, gene_length_coverage_mean=?, no_coverage=?, aligned_reads_per_base=?, gene_length_coverage_median=?, gene_length_coverage_max=?, gene_length_coverage_sd=? WHERE seq_md5hash=? AND readset_id=(SELECT readset_id from readsets where readset_file=?)"
);
my $update_salmon_expression_statistics = $dbh->prepare(
"UPDATE $expression_statistics_table set salmon_eff_counts=?,salmon_tpm=? WHERE seq_md5hash=? AND readset_id=(SELECT readset_id from readsets where readset_file=?)"
);
my $update_kangade_expression_statistics = $dbh->prepare(
"UPDATE $expression_statistics_table set kangade_counts=? WHERE seq_md5hash=? AND readset_id=(SELECT readset_id from readsets where readset_file=?)"
);
my $get_expression_statistics = $dbh->prepare(
"SELECT * FROM $expression_statistics_table WHERE seq_md5hash=? AND readset_id=(SELECT readset_id from readsets where readset_file=?)"
);
my $check_expression_statistics = $dbh->prepare(
"SELECT seq_md5hash FROM $expression_statistics_table WHERE seq_md5hash=? AND readset_id=(SELECT readset_id from readsets where readset_file=?)"
);
my $set_housekeeping =
$dbh->prepare("UPDATE sequence_data set housekeeping=? WHERE seq_md5hash=?");
my $set_housekeeping_fold = $dbh->prepare(
"UPDATE fold_changes set housekeeping=? WHERE seq_md5hash=? AND readset1_id=? AND readset2_id=? "
);
my $get_readset = $dbh->prepare("SELECT * from readsets WHERE readset_file=?");
my $get_readset_filename =
$dbh->prepare("SELECT readset_file from readsets WHERE alias=?");
my $add_readset = $dbh->prepare(
"INSERT INTO readsets (readset_file,is_paired,alias,total_reads,readlength_median,ctime) VALUES (?,?,?,?,?,?)"
);
my $update_readset = $dbh->prepare( "UPDATE readsets SET is_paired=?, alias=? WHERE readset_file=?");
my $update_readset_size = $dbh->prepare( "UPDATE readsets SET total_reads=? WHERE readset_file=?");
my $add_sequence_alias = $dbh->prepare(
"INSERT INTO sequence_aliases (seq_md5hash,alias) VALUES (?,?)");
my $get_sequence_alias =
$dbh->prepare("SELECT alias from sequence_aliases WHERE seq_md5hash=?");
return (
$dbh,
$get_from_seqdb,
$get_hash_from_seqdb,
$add_seqhash_to_seqdb,
$init_expression_statistics,
$add_readset,
$update_readset,
$update_readset_size,
$add_depth_data,
$check_depth_data,
$check_depth_readset,
$delete_depth_data,
$get_readset,
$get_expression_statistics,
$check_expression_statistics,
$update_expression_statistics,
$set_housekeeping,
$set_housekeeping_fold,
$check_hash_from_seqdb,
$add_sequence_alias,
$get_sequence_alias,
$add_to_kangade_analysis,
$check_kangade_analysis,
$delete_kangade_analysis,
$start_fold_change,
$check_fold_change,
$add_kangade_fold_change,
$add_salmon_fold_change,
$add_raw_fold_change,
$get_kangade_fold_change,
$get_salmon_fold_change,
$get_raw_fold_change,
$update_salmon_expression_statistics,
$update_kangade_expression_statistics,
$update_rpkm_expression_statistics,
$get_readset_filename
);
}
sub sqlite_set_get_as_housekeeping() {
my $seq_md5hash = shift;
my $set = shift;
my $readset1 = shift;
my $readset2 = shift;
if ( $set && $set == 2 ) {
$set_housekeeping->execute( 1, $seq_md5hash );
}
if ( $readset1 && $readset2 ) {
$set_housekeeping_fold->execute( 1, $seq_md5hash, $readset1, $readset2 );
}
}
sub sqlite_destroy() {
return unless $dbh;
print &mytime() . "Closing SQL connections and backing-up\n";
$get_readset_filename->finish();
$get_from_seqdb->finish();
$get_hash_from_seqdb->finish();
$add_seqhash_to_seqdb->finish();
$check_hash_from_seqdb->finish();
$check_depth_data->finish();
$check_depth_readset->finish();
$delete_depth_data->finish();
$add_depth_data->finish();
$set_housekeeping->finish();
$set_housekeeping_fold->finish();
$init_expression_statistics->finish();
$get_expression_statistics->finish();
$check_expression_statistics->finish();
$get_readset->finish();
$update_expression_statistics->finish();
$update_salmon_expression_statistics->finish();
$update_kangade_expression_statistics->finish();
$add_readset->finish();
$update_readset->finish();
$update_readset_size->finish();
$get_sequence_alias->finish();
$add_sequence_alias->finish();
$add_to_kangade_analysis->finish();
$check_kangade_analysis->finish();
$delete_kangade_analysis->finish();
$start_fold_change->finish();
$check_fold_change->finish();
$add_kangade_fold_change->finish();
$add_salmon_fold_change->finish();
$add_raw_fold_change->finish();
$get_kangade_fold_change->finish();
$get_salmon_fold_change->finish();
$update_rpkm_expression_statistics->finish();
$get_raw_fold_change->finish();
if ( $contextual_alignment && !$genomewide && !$debug ) {
print "Contextual non-genomewide alignment was requested, I will delete the tmp directories from the database\n";
#empty temporary tables and re-create their schema
$dbh->do("DROP TABLE expression_statistics_tmp");
$dbh->do(
"CREATE TABLE expression_statistics_tmp (seq_md5hash char(32), readset_id integer, gene_length_coverage REAL, gene_length_coverage_mean REAL, no_coverage integer, rpkm integer, aligned_reads_per_base REAL, gene_length_coverage_median integer, total_aligned_reads integer, gene_length_coverage_max integer, gene_length_coverage_sd REAL, salmon_eff_counts REAL, salmon_tpm REAL , kangade_counts integer)"
);
$dbh->do(
"CREATE UNIQUE INDEX expression_statistics_tmp_idx1 ON expression_statistics_tmp(seq_md5hash,readset_id)"
);
$dbh->do("DROP TABLE depth_tmp");
$dbh->do(
"CREATE TABLE depth_tmp (seq_md5hash char(32), readset_id integer, data blob)"
);
$dbh->do("CREATE INDEX depth_tmp_idx1 ON depth_tmp(seq_md5hash,readset_id)");
$dbh->do("VACUUM");
}
#&sqlite_backup() unless $db_use_file;
my $backup_db_file = $db_use_file ? $db_file . '.backup' : $db_file;
print "\nCheckpointing database ($backup_db_file). DO NOT halt...\r";
# ensure no problems if it crashes while backing up
$dbh->sqlite_backup_to_file( $backup_db_file . ".tmp" );
if ( -s $backup_db_file . ".tmp" ) {
unlink( $backup_db_file . '.old' );
rename( $backup_db_file, $backup_db_file . '.old' );
unlink($backup_db_file);
rename( $backup_db_file . ".tmp", $backup_db_file );
print
"Checkpointing database ($backup_db_file). Done: SQL database checkpointed!\n";
}
$dbh->disconnect();
undef($dbh);
unlink( $db_file . '.active' ) if $db_use_file;
}
sub not_used1() {
####
## how to store the entire alignment in the database.... not used.
#sub sqlite_check_align_old($) {
# my $readset = shift;
# $check_db->execute( $reference_file_md5sum, $readset );
# my $result = $check_db->fetchrow_arrayref();
# $result = $result ? int(1) : int(0);
# print "Readset $readset was already in DB. Not re-aligning.\n" if $result;
# return $result;
#}
#sub sqlite_add_align_old($) {
# my $readset = shift;
# my $bamfile = shift;
# my $bamcontent = `cat $bamfile`;
# $add_to_db->bind_param( 1, $reference_file_md5sum );
# $add_to_db->bind_param( 2, $readset );
# $add_to_db->bind_param( 3, $bamcontent, SQL_BLOB );
# $add_to_db->execute();
# $check_db->execute( $reference_file_md5sum, $readset );
# my $result = $check_db->fetchrow_arrayref();
# $result = $result ? int(1) : int(0);
# print "Readset $readset added to DB\n" if $result;
# return $result;
#}
#sub sqlite_get_align_old($) {
# my $readset = shift;
# my $bam = shift;
# $get_from_aligndb->execute( $reference_file_md5sum, $readset );
# my $row = $get_from_aligndb->fetchrow_arrayref();
# my $result = $row ? int(1) : int(0);
# open( BAM, ">$bam" );
# print BAM $row->[0];
# close BAM;
# &process_cmd("$samtools_exec index $bam")
# unless -s $bam . '.bai' && ( -s $bam . '.bai' ) > 200;
# return $result;
#}
}
sub sqlite_get_seq_length($) {
my $md5sum = shift;
$get_from_seqdb->execute($md5sum);
my $row = $get_from_seqdb->fetchrow_arrayref();
return $row->[0];
}
sub sqlite_check_kangade_analysis() {
my ( $md5_sum, $read1, $read2 ) = @_;
$check_kangade_analysis->execute( $md5_sum, $read1, $read2 );
my $res = $check_kangade_analysis->fetchrow_hashref();
return $res;
}
sub sqlite_get_md5($) {
my $id = shift;
$get_hash_from_seqdb->execute($id);
my $row = $get_hash_from_seqdb->fetchrow_arrayref();
unless ($row) {
confess "Cannot find md5 data for $id\n";
}
return $row->[0];
}
sub sqlite_add_seq_md5($$$) {
my $id = shift;
my $md5sum = shift;
my $seq_length = shift;
$check_hash_from_seqdb->execute($md5sum);
my $check = $check_hash_from_seqdb->fetchrow_arrayref();
$add_seqhash_to_seqdb->execute( $md5sum, $seq_length ) if !$check;
$get_sequence_alias->execute($md5sum);
my %existing_ids;
while ( my $result = $get_sequence_alias->fetchrow_arrayref() ) {
$existing_ids{ $result->[0] } = 1;
}
$add_sequence_alias->execute( $md5sum, $id ) unless $existing_ids{$id};
$gene_aliases_threads{$md5sum}{$id}=1;
}
sub sqlite_get_seq_aliases() {
my $md5sum = shift;
my %existing_ids;
$get_sequence_alias->execute($md5sum);
while ( my $result = $get_sequence_alias->fetchrow_arrayref() ) {
$existing_ids{ $result->[0] } = 1;
}
return keys %existing_ids;
}
sub sqlite_get_readset_metadata($) {
my $readset_filename = shift;
$get_readset->execute($readset_filename);
my $result = $get_readset->fetchrow_hashref();
return $result;
}