Fixed parsing bug for USI continaining a colon in the interpretation #19

douweschulte · 2024-12-17T15:49:50Z

I changed the parsing to recognise the provinance identifiers to split the last remaining piece (interpretation+provinance) based on that. I first explored a stack based approach but that quickly grew in complexity. Additionally the specification is very unclear about provinances, but I think you know that as well, so I did not want to assume anything about the validity of using any kind of brackets in the provinance. If the specification was updated to disallow any kind of bracket (all of "()[]{}<>") the code could be changed to look for the last colon, but if it finds any bracket before it finds a colon it returns the full tail as the interpretation.

…fiers

src/io/usi.rs

mobiusklein · 2024-12-18T03:36:32Z

Thank you for adding this. I think the intent was that the string would be parsed from left to right with tokenization going the whole way, instead of hacky splitting on :, but provenance identifiers are a specific use-case that I don't really understand yet.

Your solution looks good, but please see my note about rsplit_once vs split_once.

douweschulte · 2024-12-18T19:49:57Z

My main reason for giving up on parsing colons is that the specification for the provenance identifiers is fully unclear, so it is not specified if the provenance id can contain any kind of bracket. Additionally parsing the bracket structure of pro forma is doable but quite complicated, you need to support arbitrarily deep nesting but also ignore any bracket type that is not the outer bracket. For example [Info:<<{{(([[Have fun parsing!]]] is a valid text in proforma, but [Info:<<{{(([[Have fun parsing!] is not. So the only solid structure I see is the guarantee that :XX- is the start of the provenance id. Yes a pro forma entry with this text somewhere could be made, but that goes awry less often then the current implementation (which fails as soon as any colon comes on stage).

mobiusklein · 2024-12-19T13:15:59Z

Okay, so that short-circuit option won't work, but shouldn't it still use rsplit_once regardless in order to avoid carving up the first CURIE tag in the interpretation?

Splitting mzspec:PXD000561:Adult_Frontalcortex_bRP_Elite_85_f09:scan:17555:VLHPLEGAVVIIFK/2:PR-G47 is fine, you get ("VLHPLEGAVVIIFK/2", "PR-G47")

Splitting mzspec:PXD000561:Adult_Frontalcortex_bRP_Elite_85_f09:scan:17555:VLH[UNIMOD:1]PLEGAVVIIFK/2:PR-G47 will split the sequence, not the identifier, ("VLH[UNIMOD", "1]PLEGAVVIIFK/2:PR-G47"). With rsplit_once you get ("VLH[UNIMOD:1]PLEGAVVIIFK/2", "PR-G47")

douweschulte · 2024-12-19T14:15:21Z

Yes indeed it should be. I will update it.

mobiusklein · 2024-12-19T14:23:29Z

Okay, I can merge whenever you're ready then.

douweschulte · 2024-12-19T14:37:51Z

I realised I removed the repository code by splitting on it, so I took the liberty of making it into an enum and using that to better represent all possible repositories. With this I am ready for a merge.

douweschulte added 2 commits December 17, 2024 16:22

Fixed parsing bug for USI continaining a 'shielded' colon.

f50187b

Changed USI parsing of provinances to use hardcoded provinance identi…

d513733

…fiers

mobiusklein reviewed Dec 18, 2024

View reviewed changes

src/io/usi.rs Outdated Show resolved Hide resolved

Moved to rsplit_once

2fae995

mobiusklein approved these changes Dec 19, 2024

View reviewed changes

Handled provenance repositories better

16a938d

mobiusklein merged commit 417751a into mobiusklein:main Dec 19, 2024
3 checks passed

douweschulte deleted the usi-fix branch December 19, 2024 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed parsing bug for USI continaining a colon in the interpretation #19

Fixed parsing bug for USI continaining a colon in the interpretation #19

douweschulte commented Dec 17, 2024

mobiusklein commented Dec 18, 2024 •

edited

Loading

douweschulte commented Dec 18, 2024

mobiusklein commented Dec 19, 2024

douweschulte commented Dec 19, 2024

mobiusklein commented Dec 19, 2024

douweschulte commented Dec 19, 2024

Fixed parsing bug for USI continaining a colon in the interpretation #19

Fixed parsing bug for USI continaining a colon in the interpretation #19

Conversation

douweschulte commented Dec 17, 2024

mobiusklein commented Dec 18, 2024 • edited Loading

douweschulte commented Dec 18, 2024

mobiusklein commented Dec 19, 2024

douweschulte commented Dec 19, 2024

mobiusklein commented Dec 19, 2024

douweschulte commented Dec 19, 2024

mobiusklein commented Dec 18, 2024 •

edited

Loading