Personal Genome Simulator
PGsim is a simple, fast, memory-efficient personal genome simulator, developped by L.J. The program leverages the comprehensive knowledges about human genome, such as:
(1) all known variants, (2) overall allele frequencies of human variants, (3) variant AFs of a specific population, (4) common structural variations, (5) disease-related variants, (6) Ti/Tv ratio, (7) indel rate and length distribution, (8) variant distribution pattern in coding or other functional genomic regions, etc.
PGsim provide highly customizable options for realistic, reliable and flexible personal genome simulating.
First user need to set a detailed configuration for the genome simulating. Though a lot of genome simulating parameters are required, most of the parameters can be assign to a value range, ensuring the randomness of the results. The users can also assign a precise value to any parameter by narrowing the ceiling and floor of the parameter range to a single value.
Databases of reference genome, gene model, all known variants, common variants, structural variants, disease-related variants and populations can be assigned in the configuration. The program does not need sql database supporting. Bgzip/tabix compressed/indexed VCF files are recommended.
The reference genome should be stored in FASTA or gzipped FASTA files. The GRCh38 version human genome can be obtained at: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
The gene model stored no overlapping coding regions, should be in sorted bed format. The "hg38.cds.bed" file in the repository is generated from the GENCODE v32 annotations.
The VCF file of dbSNP 151 can be obtained at: ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/All_20180418.vcf.gz While the common variants of dbSNP 151 can be obtained at: ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/common_all_20180418.vcf.gz Users can choose to use identical database for both "all known variants database" and "common variants database", as long as the necessary background allele frequency information is available. Only variants whose AF greater than "Common Variant Threshold" value (Default is 0.01, can also be set as another value) are regarded as "common variants".
The recommended structural variation database is dbVar: ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_assembly/GRCh38/vcf/GRCh38.variant_call.vcf.gz
The disease-related variants database is clinvar: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20191202.vcf.gz (Acturally, this database can be replaced by any variant set depending on the specific needs of users. The specified number of these variants would be randomly blended to the simulated genome, which are not limited to disease-related variants.)
The PGsim is composed by three components: PG_planner, PG_simulator, PG_generator.
PG_planner. Usage: perl PG_planner configuration_file Personal_genome_ID
PG_planner analyzes the databases, randomly extracts candidate SVs and PVs from the structural variation database and the disease-related variant database, randomly generate novel variants, based on the input parameters. Once the databases have been analyzed, '.pgstat' files, which sharing the prefix of the database files, are generated. Rerunning the program will not need to analyze them again.
Based on the database analysis results and input parameters, the PG_planner generate the following intermediate files: Personal_genome_ID.SV_database.pgsim.vcf: the candidate SVs of the simulated personal genome. Personal_genome_ID.PV_database.pgsim.vcf: the candidate PVs of the simulated personal genome. Personal_genome_ID.nv.loc: the candidate novel variants of the simulated personal genome. Personal_genome_ID.pgsim.allparams.conf: the comprehensive configuration file of the whole parameter set of the personal genome simulating, including internal parameters, randomly selected value of each parameter between the user-assigned range, and hidden parameters that calculated based on database analysis results.
PG_simulator. Usage: perl PG_simulator Personal_genome_ID
PG_simulator simulates the personal genome variants based on the plan and parameters. the results are sorted in chromosome, and stored in VCF format.
The result file is: Personal_genome_ID.vcf
PG_generator. Usage: perl PG_generator Reference_genome PG_input_VCF
The diploid genome sequences are generated by PG_generator. Meanwhile, coordinate mapping files are also generated.
The result files are: Personal_genome_ID1.fasta, Personal_genome_ID2.fasta, Personal_genome_ID1.map, Personal_genome_ID2.map.
Please do not hesitate to address comments/questions/suggestions regarding this tool to: pgbrowser@gmail.com.
L.J. Dec. 20, 2019