VCF-to-23andMe

These scripts convert a Sanger Imputation Service output into the 23andMe V3 raw data format.

data_to_db.py is used to convert the VCF file and any additional 23andMe raw data file (adds 23andMe identifiers) into an indexed SQLite3 database for quick searching. db_to_23.py then inserts genotypes into the blank file retrieved from the database by chromosome, position, and identifier.

The data_to_db.py script accepts both compressed and uncompressed data files.

Requirements

Preparation

If your autosomal file is not in the 23andMe format convert it using DNA Kit Studio.

Impute your genome with the Sanger Imputation Service. In this project, the Haplotype Reference Consortium (release 1.1) and the "Pre-phasing and imputation with EAGLE2+PBWT" pipeline was used.

Usage

cd /path/to/imputed.vcfs
# Merge chromosomes
bcftools concat -Oz 1.vcf.gz 2.vcf.gz 3.vcf.gz 4.vcf.gz 5.vcf.gz 6.vcf.gz 7.vcf.gz 8.vcf.gz 9.vcf.gz 10.vcf.gz 11.vcf.gz 12.vcf.gz 13.vcf.gz 14.vcf.gz 15.vcf.gz 16.vcf.gz 17.vcf.gz 18.vcf.gz 19.vcf.gz 20.vcf.gz 21.vcf.gz 22.vcf.gz X.vcf.gz > wgs.vcf.gz
# Select identified SNPs
bcftools view -Oz -e 'ID=="."' -o filtered_wgs.vcf.gz wgs.vcf.gz
# Select good SNPs (optional)
# bcftools view -Oz -i 'INFO>0.95' -o filtered_second_wgs.vcf.gz filtered_wgs.vcf.gz
# Select good and rare SNPs (optional)
# bcftools view -Oz -i 'INFO>0.95' -q 0.05:minor -o filtered_second_wgs.vcf.gz filtered_wgs.vcf.gz

# Transofrm the whole genome into the 23andMe format (optional)
cd /path/to/plink
./plink --vcf /path/to/imputed.vcfs/filtered_wgs.vcf.gz  --snps-only --recode 23 --out imputed_23andme_full

cd /path/to/vcf-to-23andme
# Construct a short file in the 23andMe format
python data_to_db.py /path/to/original/23andme_v5_original.txt 23andme genome.db # (optional)
python data_to_db.py /path/to/imputed.vcfs/filtered_wgs.vcf.gz vcf genome.db
# Use all_templates_merged_blank.tsv to consider all SNPs from 23andMe v3,v4 and v5, AncestryDNA v1 and v2,
# FTDNA v1 and v2, Tellmegen v1 and v2, LivingDNA, SelfDecode v1 and MyHerritage v2.
# Alternatively use 23andme_v3_blank.tsv, 23andme_v5_blank.tsv or 23andme_merged_v3v4v5_blank.tsv
python db_to_23.py genome.db all_templates_merged_blank.tsv imputed_23andme_full_short.txt

# If phasing was performed before imputation (in this case the genotype is split by | in the VCF file, e.g. G|C) you can run split_parents.py to split
# imputed_23andme_full_short.txt into parent 1 and parent 2 files. 
# These files can be later used in DNAGenics's Admixture Studio and G25 Studio
python split_parents.py

If there are some new SNPs, present in your original file, but missing in the result file, or you want to keep yDNA and mtDNA SNPs, merge the files using DNA Kit Studio.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VCF-to-23andMe

Requirements

Preparation

Usage

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
23andme_merged_v3v4v5_blank.tsv		23andme_merged_v3v4v5_blank.tsv
23andme_v3_blank.tsv		23andme_v3_blank.tsv
23andme_v5_blank.tsv		23andme_v5_blank.tsv
LICENSE		LICENSE
README.md		README.md
all_templates_merged_blank.tsv		all_templates_merged_blank.tsv
data_to_db.py		data_to_db.py
db_to_23.py		db_to_23.py
split_parents.py		split_parents.py

License

axenov/vcf-to-23andme

Folders and files

Latest commit

History

Repository files navigation

VCF-to-23andMe

Requirements

Preparation

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages