Skip to content

Commit

Permalink
added thesis code, final version
Browse files Browse the repository at this point in the history
  • Loading branch information
kcajj committed Jul 21, 2024
1 parent 1917acb commit 52271d8
Show file tree
Hide file tree
Showing 21 changed files with 4,514 additions and 4 deletions.
3 changes: 1 addition & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,4 @@ data/
__pycache__/
results/
log/
.snakemake/
thesis/
.snakemake/
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# recombinant_population_analysis

This repository represents the largest portion of my [Bachelor thesis](/Thesis_Giacomo_Castagnetti.pdf) in Genomics at University of Bologna. This project started in summer 2023 during the Biozentrum research summer project, at [NeherLab](https://neherlab.org/). I graduated in July 2024, with a final score of 110/110 cum laude, along with a honourable mention.
This repository represents the largest portion of my [Bachelor thesis](/thesis/Thesis_Giacomo_Castagnetti.pdf) in Genomics at University of Bologna. This project started in summer 2023 during the Biozentrum research summer project, at [NeherLab](https://neherlab.org/). I graduated in July 2024, with a final score of 110/110 cum laude, along with a honourable mention.

This repository processes the data produced by an experiment performed with the Aionostat, a machine that allows automatic experimental evolution of phages.

The experiment consists in evolving 3 phages with a bacterium that they do not infect well initially. Through an initial [exploratory analysis](https://github.com/kcajj/rec_genome_analysis) that I performed, we know that two phages recombined and one got extinct. The objective of this repository is to create a more precise and high througput procedure to measure recombination in population sequencing data.
The experiment consists in evolving 3 phages with a bacterium that they do not infect well initially. Through an initial [exploratory analysis](https://github.com/kcajj/rec_genome_analysis) that I performed, we know that two phages recombined and one got extinct. The objective of this repository is to create a more precise and high througput procedure to measure recombination in population sequencing data.

We produced a pipeline that carries out the whole analysis, yielding the plots of the coverage of the genomes of the two phages for each timestep of the experiment.
1 change: 1 addition & 0 deletions notes/note3.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,3 +96,4 @@ reads used 67274

we add a Snakefile, config.yaml file, cluster folder, conda_envs folder, local folder and log folder.

The scripts produced so far were adapted to work in a pipeline.
File renamed without changes.
4 changes: 4 additions & 0 deletions thesis/alignments/hybrid_on_bas51.bam

Large diffs are not rendered by default.

4 changes: 4 additions & 0 deletions thesis/alignments/hybrid_on_bas54.bam

Large diffs are not rendered by default.

54 changes: 54 additions & 0 deletions thesis/coordinate_converter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import pysam

def get_hyb_ref_map(bam_file):
map_hyb_ref={}
with pysam.AlignmentFile(bam_file, "rb") as bam:
for read in bam.fetch():
if not(read.is_secondary):
alignment=read.get_aligned_pairs()
for hyb,ref in alignment:
if hyb==None:
continue
if hyb in map_hyb_ref.keys():
if map_hyb_ref[hyb]==None:
map_hyb_ref[hyb]=ref
else:
continue
else:
map_hyb_ref[hyb]=ref
return map_hyb_ref

if __name__ == "__main__":
populations=["P2","P3"]
timesteps=["1","3","5","7"]

recombination_folder="results/genomewide_recombination"
coverage_folder="results/coverage_arrays"
#refs_msa_path=
#output_path="thesis/plots"

threshold_frequency=0.01

bas51_map=get_hyb_ref_map("thesis/alignments/hybrid_on_bas51.bam")
bas54_map=get_hyb_ref_map("thesis/alignments/hybrid_on_bas54.bam")

print(" ")
print("bas51")
print(bas51_map[98100])

print(" ")
print("hybrid ref")
print(list(bas51_map.keys())[list(bas51_map.values()).index(95262)])
print(list(bas51_map.keys())[list(bas51_map.values()).index(123100)])
print(list(bas51_map.keys())[list(bas51_map.values()).index(125434)])



print(" ")
print("bas54")
print(bas54_map[96213])
print(bas54_map[124146])
print(bas54_map[126481])
78 changes: 78 additions & 0 deletions thesis/handle_msa.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
from Bio import SeqIO,AlignIO
import numpy as np
import matplotlib.pyplot as plt

def read_msa(path):
alignment = AlignIO.read(open(path), "fasta")
print(alignment)
l=alignment.get_alignment_length()
msa_matrix=np.zeros([3,l],dtype=str)
for i,record in enumerate(alignment):
for pos,nuc in enumerate(record.seq):
msa_matrix[i][pos]=nuc
return msa_matrix

def get_evidences_distributions(msa_matrix,i_ref1=0,i_ref2=1,i_extra=2):
l=len(msa_matrix[0])
e_distribution = np.zeros(l, dtype=int)

for pos,array in enumerate(msa_matrix[0]):
nuc_extra=msa_matrix[i_extra,pos]
nuc_first_ref=msa_matrix[i_ref1,pos]
nuc_second_ref=msa_matrix[i_ref2,pos]
#if nuc_first_ref!='-' and nuc_second_ref!='-':
if nuc_extra!='-' and nuc_first_ref!='-' and nuc_second_ref!='-':
if nuc_extra==nuc_first_ref and nuc_extra==nuc_second_ref:
continue
elif nuc_extra!=nuc_first_ref and nuc_extra!=nuc_second_ref:
e_distribution[pos]=1
elif nuc_extra==nuc_first_ref and nuc_extra!=nuc_second_ref:
e_distribution[pos]=2
elif nuc_extra!=nuc_first_ref and nuc_extra==nuc_second_ref:
e_distribution[pos]=3
#elif nuc_first_ref=="-" and nuc_second_ref=="-":
# if nuc_extra==nuc_first_ref:
# continue
# elif nuc_extra!=nuc_first_ref:
# e_distribution[pos]=1

return e_distribution

def map_refcoord_msacoord(ref_path,refs_msa_path,i_ref_in_msa):
map={}
i=0
j=0
ref_seq=SeqIO.read(ref_path, "fasta").seq
alignment=AlignIO.read(open(refs_msa_path), "fasta")
while i<len(ref_seq):
if alignment[i_ref_in_msa,j]!='-':
map[i]=j
i+=1
j+=1
return map

def add_to_msa(msa_path,seq,mapping_start,mapping_end):
alignment = AlignIO.read(open(msa_path), "fasta")
l=alignment.get_alignment_length()

msa_matrix=np.zeros([3,l],dtype=str)
for i,record in enumerate(alignment):
for pos,nuc in enumerate(record.seq):
msa_matrix[i][pos]=nuc

cut_msa_matrix=msa_matrix[:,mapping_start:mapping_end]
for pos in range(len(cut_msa_matrix[2])):
cut_msa_matrix[2][pos]=seq[pos]

return cut_msa_matrix

def length_msa(msa_path):
alignment = AlignIO.read(open(msa_path), "fasta")
return alignment.get_alignment_length()

def extract_references_names(msa_path):
alignment = AlignIO.read(open(msa_path), "fasta")
names=[]
for record in alignment:
names.append(record.id)
return names
65 changes: 65 additions & 0 deletions thesis/plots/hot_sites_P2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
t=0.1
P2 1
Values: []
Indices: []
P2 3
Values: [0.43 0.3683]
Indices: [ 95262 123100]
P2 5
Values: [0.5754 0.5578 0.1535]
Indices: [ 95262 123100 95418]
P2 7
Values: [0.7862 0.7257 0.5069]
Indices: [ 95262 123100 78670]

t=0.05
P2 1
Values: [0.1935 0.0928]
Indices: [ 95262 123100]
P2 3
Values: [0.43 0.3683 0.108 0.076 0.0555]
Indices: [ 95262 123100 95418 125434 96040]
P2 5
Values: [0.5754 0.5578 0.1535 0.1499 0.0741]
Indices: [ 95262 123100 95418 78670 125434]
P2 7
Values: [0.7862 0.7257 0.5069 0.1652 0.0635 0.0581 0.0577 0.0565]
Indices: [ 95262 123100 78670 79334 79623 78669 123096 95418]

t=0.01
P2 1
Values: [0.1935 0.0928 0.0531 0.036 0.028 0.026 0.0255 0.0251 0.02 0.0193
0.0188 0.0187 0.0164 0.0139]
Indices: [ 95262 123100 67722 96040 77864 94729 79179 96865 125434 76956
69844 95152 69631 79178]
P2 3
Values: [0.43 0.3683 0.108 0.076 0.0555 0.0511 0.0368 0.0308 0.0296 0.0295
0.0288 0.0269 0.0257 0.0224 0.0207 0.02 0.0178 0.0174 0.0158 0.0147
0.0144 0.014 0.0139 0.0139 0.0138 0.0134 0.0133 0.013 0.0129 0.0128
0.0125 0.012 0.0118 0.0118 0.0113 0.0111 0.011 0.0109 0.0107 0.0106
0.0104]
Indices: [ 95262 123100 95418 125434 96040 96865 77864 96726 123096 95718
95004 76956 97155 95893 95675 96230 96866 95462 97076 97156
96565 96655 96566 95585 96727 95676 96953 94554 95745 94727
96808 97404 96116 95294 123241 95419 97615 95152 95527 96307
95086]
P2 5
Values: [0.5754 0.5578 0.1535 0.1499 0.0741 0.0571 0.0489 0.0472 0.0461 0.0442
0.0406 0.0324 0.0316 0.0274 0.0273 0.0268 0.0267 0.025 0.0235 0.0219
0.0219 0.0218 0.0209 0.0199 0.0197 0.0184 0.0182 0.0181 0.0181 0.0181
0.0173 0.0171 0.0163 0.0162 0.0147 0.0144 0.014 0.0132 0.0132 0.0128
0.0127 0.0123 0.0113 0.0113 0.0109 0.0106 0.0106 0.0102]
Indices: [ 95262 123100 95418 78670 125434 96865 79334 96040 96726 123096
97155 95675 95893 77864 95462 97156 96565 95086 96727 95585
79623 95718 95294 79284 78669 96953 95419 76956 97076 95745
95676 97495 123241 96808 96566 97295 96866 96229 96230 96361
95817 95527 79774 96655 95774 78961 99036 78762]
P2 7
Values: [0.7862 0.7257 0.5069 0.1652 0.0635 0.0581 0.0577 0.0565 0.0565 0.0341
0.0317 0.0303 0.0291 0.0271 0.0249 0.0245 0.024 0.0238 0.0233 0.0204
0.0195 0.0159 0.0156 0.0152 0.0145 0.0136 0.013 0.0127 0.0119 0.0118
0.0116 0.0112 0.0111 0.0111 0.0108]
Indices: [ 95262 123100 78670 79334 79623 78669 123096 95418 79284 78961
79774 78762 96726 125434 96040 97155 79218 95893 123241 96865
95675 96565 79861 79611 96727 95745 77864 123242 97076 96655
78664 95462 80064 96229 95585]
36 changes: 36 additions & 0 deletions thesis/plots/hot_sites_P3.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
P3 1
Values: []
Indices: []
P3 3
Values: []
Indices: []
P3 5
Values: []
Indices: []
P3 7
Values: []
Indices: []
P3 1
Values: []
Indices: []
P3 3
Values: []
Indices: []
P3 5
Values: []
Indices: []
P3 7
Values: []
Indices: []
P3 1
Values: []
Indices: []
P3 3
Values: []
Indices: []
P3 5
Values: [0.0175 0.0156 0.0117]
Indices: [95262 78670 93362]
P3 7
Values: [0.0247 0.0212]
Indices: [124695 124732]
Binary file added thesis/plots/time_dynamics_coverage_P2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added thesis/plots/time_dynamics_coverage_P3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added thesis/plots/time_dynamics_recombination_P2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added thesis/plots/time_dynamics_recombination_P3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 52271d8

Please sign in to comment.