-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Dropping more notes for posterity.
We recently got some data mapped via starsolo rather than cellranger, following the official --soloBarcodeMate 1 --clip5pNbases 39 0 instructions to get paired end data to work. This creates a soft-clip in the BAM, retaining the entire sequence with the cell/UMI barcodes at the start. Scafe's bam_to_ctss is unhappy with this - it detects the correct offset of ~40 to the TSO, but then promptly forgets it, looks at error rate with an offset of 8, sees a ~70% error rate and explodes. I don't have the exact error on hand anymore as this was a few runs ago.
One way to get the BAM to clear would be to run with --detect_TS_oligo trim and a meaningless 39bp sequence as --TS_oligo_seq. However, if wanting to make use of the power of the various modes, getting the BAM close to the cellranger standard with its 26bp hard clip gets the job done. There's still a 13bp softclip where the TSO is supposed to go, but that might actually be a blessing as it may prevent errant alignment to the reference?
Without further ado, an operational script to get the BAM into that shape:
import pysam
#check if a given bit (from the end, one-indexed) is set in a number's binary representation
#https://www.tutorialspoint.com/check-whether-the-bit-at-given-position-is-set-or-unset-in-python
#k=5 for revcomp (16)
#k=7 for first read (64)
def set_bin_flag(n,k):
temp = n >> (k - 1)
if temp & 1:
return True
return False
bam = "Aligned.sortedByCoord.out.bam"
infile = pysam.AlignmentFile(bam,"rb")
outfile = pysam.AlignmentFile("scafe.bam", "wb", template=infile)
#I don't think this needs until_eof=True as scafe only cares about aligned reads
for read in infile:
#is our thing R1, and needs 26bp clipped from somewhere?
if set_bin_flag(read.flag, 7):
#it is an R1!
#we'll need the quality stripped off to a separate variable
#as per pysam docs, modifying query_sequence resets qualities
q = read.query_qualities
#just trimming the sequence/quality isn't enough
#need to shave 26bp out of the correct bit of cigar
#this happens within the if
#is it a revcomped R1?
if set_bin_flag(read.flag, 5):
#it is revcomped. trim 26bp from end
read.query_sequence = read.query_sequence[:-26]
read.query_qualities = q[:-26]
#cigar modification time
#start by sanity checking taht we actually have a softclip (S=4 as per docs) there
if read.cigartuples[-1][0] != 4:
raise ValueError(read.query_name+" did not have S in the right place in the cigar string?!")
#alright, it's an S there, like it should be. shave the range by 26bp
#need to do a proper full blown assign, it doesn't do partial assigns somehow
read.cigartuples = read.cigartuples[:-1] + [(read.cigartuples[-1][0], read.cigartuples[-1][1]-26)]
else:
#it's forward stranded. trim 26bp from start
read.query_sequence = read.query_sequence[26:]
read.query_qualities = q[26:]
#same story with the cigar, but the first tuple needs to be modified
if read.cigartuples[0][0] != 4:
raise ValueError(read.query_name+" did not have S in the right place in the cigar string?!")
read.cigartuples = [(read.cigartuples[0][0], read.cigartuples[0][1]-26)] + read.cigartuples[1:]
#the read is now modded, if it needs to be
outfile.write(read)
#we're done
infile.close()
outfile.close()The idea is to find R1s, then check if they're aligned to the reverse strand (both of these using SAM flags), then trim 26bp either from the start or the end of the read+CIGAR as appropriate. This actually takes a while to run, think it might be taking longer than the subsequent scafe on my end.