sampling from subpopulations #2383
-
I have a tree mts that was generated by applying pyslim's recapitation function to a three-subpopulation tree ts generated in slim. Because I did not specify the ancestry and sampling in msprime, I'm unclear on what the most efficient way to sample nsamp haplotypes from each subpopulation would be. My three subpopulations have 20K, 4K, and 4K diploid individuals, and are labeled populations 1,2,3. I thought that I could do something along the lines of
to get nsamp haplotypes from population 1, but this doesn't work. Specifically, Pop_Array_A has no attribute genotype_matrix() What is the simplest way to do this for an mts object for which I didn't define an ancestry in msprime? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
Hi @mshpak76 👋 This isn't really an msprime specific discussion, so I moved it to tskit. The reason your code above isn't working is because Getting out the full genotype matrix is usual not ideal, unless you're working with very small datasets. What would you like to do with this genotype matrix? |
Beta Was this translation helpful? Give feedback.
-
I am using the allele frequencies at each site to calculate Reynolds Fst. It seemed to me that getting the genotype matrix for each subpopulation would be the best way to do this. I found a work-around that converts the ts into a genotype matrix, which I then sample by generating random indices corresponding to the index range for every population. I'm sure there's a more efficient way of doing this. |
Beta Was this translation helpful? Give feedback.
-
If you want to pull out the subset tree sequences for the different populations you can do this: for pop_id in [1, 2, 3]:
ts_subset = ts.simplify(ts.samples(population=pop_id), filter_sites=False)
G = ts_subset.genotype_matrix()
# G should be the per-population genotype matrix now The I haven't tested this, so beware! |
Beta Was this translation helpful? Give feedback.
If you want to pull out the subset tree sequences for the different populations you can do this:
The
filter_sites
argument is required so that we don't remove any sites from the genotype matrices that don't have any mutations in the subset trees. See the documentation for simplify for details.I haven't tested this, so beware!