Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect parsing in vcf-expression-annotator #80

Open
Stikus opened this issue Feb 7, 2025 · 2 comments
Open

Incorrect parsing in vcf-expression-annotator #80

Stikus opened this issue Feb 7, 2025 · 2 comments

Comments

@Stikus
Copy link

Stikus commented Feb 7, 2025

Hello @susannasiebert, looks like vcf-expression-annotator have problems with % sign in SAMPLE column and some other symbols:

Input string (from VarScan):
chr1    1008988 .       C       T       .       PASS    DP=175;SOMATIC;SS=2;SSC=77;GPV=1E0;SPV=1.9783E-8;CSQ=T|upstream_gene_variant|MODIFIER||ENSG00000231702|Transcript|ENST00000423619.2|processed_pseudogene||||||||||rs369247934|1|759|-1|||SNV|1|||YES||||||||||||||chr1:g.1008988C>T|0.0062|0.0234|0|0|0|0||||||||||||0.0234|AFR|||||||||||,T|downstream_gene_variant|MODIFIER||ENSG00000224969|Transcript|ENST00000458555.1|lncRNA||||||||||rs369247934|1|3009|-1|||SNV|1|||YES|2|||||||||||||chr1:g.1008988C>T|0.0062|0.0234|0|0|0|0||||||||||||0.0234|AFR|||||||||||,T|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624652.1|protein_coding||2/2|ENST00000624652.1:c.-22+709C>T|||||||rs369247934|1||1|cds_end_NF||SNV|1|HGNC|HGNC:4053||3||ENSP00000485313||A0A096LNZ9.44|UPI00053BD5BA|||1|||||chr1:g.1008988C>T|0.0062|0.0234|0|0|0|0||||||||||||0.0234|AFR|||||||||||MLAGNEFQVSLSSSMSVSELKAQITQKIGVHAFQQRLAVHPSGVALQDRVPLASQGLGPGSTVLLVVDKCDEPLSILVRNNKGRSSTYEVRLTQTVAHLKQQVSGLEGVQDDLFWLTFEGKPLEDQLPLGEYGLKPLSTVFMN,T|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624697.4|protein_coding||2/2|ENST00000624697.4:c.-22+709C>T|||||||rs369247934|1||1|||SNV|1|HGNC|HGNC:4053||3||ENSP00000485643||A0A096LPJ4.39|UPI0004F23698|||1|||||chr1:g.1008988C>T|0.0062|0.0234|0|0|0|0||||||||||||0.0234|AFR|||||||||||MLAGNEFQVSLSSSMSVSELKAQITQKIGVHAFQQRLAVHPSGVALQDRVPLASQGLGPGSTVLLVVDKCDEPLSILVRNNKGRSSTYEVRLTQTVAHLKQQVSGLEGVQDDLFWLTFEGKPLEDQLPLGEYGLKPLSTVFMNLRLRGGGTEPGGRS,T|upstream_gene_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000649529.1|protein_coding||||||||||rs369247934|1|4509|1||1|SNV|1|HGNC|HGNC:4053|YES||CCDS6.1|ENSP00000496832|P05161.225||UPI0000048D70||NM_005101.4|1|||||chr1:g.1008988C>T|0.0062|0.0234|0|0|0|0||||||||||||0.0234|AFR|||||||||||MGWDLTVKMLAGNEFQVSLSSSMSVSELKAQITQKIGVHAFQQRLAVHPSGVALQDRVPLASQGLGPGSTVLLVVDKCDEPLSILVRNNKGRSSTYEVRLTQTVAHLKQQVSGLEGVQDDLFWLTFEGKPLEDQLPLGEYGLKPLSTVFMNLRLRGGGTEPGGRS GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:38:38:0:0%:17,21,0,0      0/1:.:137:79:58:42.34%:40,39,36,22

Output string:
chr1	1008988	.	C	T	.	PASS	DP=175;SOMATIC;SS=2;SSC=77;GPV=1.0;SPV=1.9783e-08;CSQ=T|upstream_gene_variant|MODIFIER||ENSG00000231702|Transcript|ENST00000423619.2|processed_pseudogene||||||||||rs369247934|1|759|-1|||SNV|1|||YES||||||||||||||chr1%3Ag.1008988C>T|0.0062|0.0234|0|0|0|0||||||||||||0.0234|AFR|||||||||||,T|downstream_gene_variant|MODIFIER||ENSG00000224969|Transcript|ENST00000458555.1|lncRNA||||||||||rs369247934|1|3009|-1|||SNV|1|||YES|2|||||||||||||chr1%3Ag.1008988C>T|0.0062|0.0234|0|0|0|0||||||||||||0.0234|AFR|||||||||||,T|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624652.1|protein_coding||2/2|ENST00000624652.1%3Ac.-22+709C>T|||||||rs369247934|1||1|cds_end_NF||SNV|1|HGNC|HGNC%3A4053||3||ENSP00000485313||A0A096LNZ9.44|UPI00053BD5BA|||1|||||chr1%3Ag.1008988C>T|0.0062|0.0234|0|0|0|0||||||||||||0.0234|AFR|||||||||||MLAGNEFQVSLSSSMSVSELKAQITQKIGVHAFQQRLAVHPSGVALQDRVPLASQGLGPGSTVLLVVDKCDEPLSILVRNNKGRSSTYEVRLTQTVAHLKQQVSGLEGVQDDLFWLTFEGKPLEDQLPLGEYGLKPLSTVFMN,T|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624697.4|protein_coding||2/2|ENST00000624697.4%3Ac.-22+709C>T|||||||rs369247934|1||1|||SNV|1|HGNC|HGNC%3A4053||3||ENSP00000485643||A0A096LPJ4.39|UPI0004F23698|||1|||||chr1%3Ag.1008988C>T|0.0062|0.0234|0|0|0|0||||||||||||0.0234|AFR|||||||||||MLAGNEFQVSLSSSMSVSELKAQITQKIGVHAFQQRLAVHPSGVALQDRVPLASQGLGPGSTVLLVVDKCDEPLSILVRNNKGRSSTYEVRLTQTVAHLKQQVSGLEGVQDDLFWLTFEGKPLEDQLPLGEYGLKPLSTVFMNLRLRGGGTEPGGRS,T|upstream_gene_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000649529.1|protein_coding||||||||||rs369247934|1|4509|1||1|SNV|1|HGNC|HGNC%3A4053|YES||CCDS6.1|ENSP00000496832|P05161.225||UPI0000048D70||NM_005101.4|1|||||chr1%3Ag.1008988C>T|0.0062|0.0234|0|0|0|0||||||||||||0.0234|AFR|||||||||||MGWDLTVKMLAGNEFQVSLSSSMSVSELKAQITQKIGVHAFQQRLAVHPSGVALQDRVPLASQGLGPGSTVLLVVDKCDEPLSILVRNNKGRSSTYEVRLTQTVAHLKQQVSGLEGVQDDLFWLTFEGKPLEDQLPLGEYGLKPLSTVFMNLRLRGGGTEPGGRS	GT:GQ:DP:RD:AD:FREQ:DP4:TX	0/0:.:38:38:0:0%25:17%2C21%2C0%2C0:.	0/1:.:137:79:58:42.34%25:40%2C39%2C36%2C22:ENST00000423619.2|0.0,ENST00000458555.1|0.0510985,ENST00000624652.1|0.0,ENST00000624697.4|0.0,ENST00000649529.1|4.35964

Here are the differences:

GPV=1E0;SPV=1.9783E-8;
GPV=1.0;SPV=1.9783e-08;

chr1:g.1008988C>T
chr1%3Ag.1008988C>T

0/0:.:38:38:0:0%:17,21,0,0
0/0:.:38:38:0:0%25:17%2C21%2C0%2C0

First one is ok, but second and, especially, third are problematic. Can you fix them, please? If you need any additional info, we'll provide.

@susannasiebert
Copy link
Collaborator

This is the expected behavior of the VCF parser that VAtools uses to deal with special characters. They will be encoded in order to ensure that they don't conflict with any special characters that are used natively in the VCF formatting. It's usually not recommended to save values as percent in VCF files. Use decimals instead or if you need to use percent the % sign should be left off.

@susannasiebert
Copy link
Collaborator

This specific difference:

17,21,0,0
17%2C21%2C0%2C0

Could be explained by the definition for the DP4 field not being set up correctly. What is the header line for the DP4 field?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants