Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed parsing bug for USI continaining a colon in the interpretation #19

Merged
merged 4 commits into from
Dec 19, 2024

Conversation

douweschulte
Copy link
Contributor

I changed the parsing to recognise the provinance identifiers to split the last remaining piece (interpretation+provinance) based on that. I first explored a stack based approach but that quickly grew in complexity. Additionally the specification is very unclear about provinances, but I think you know that as well, so I did not want to assume anything about the validity of using any kind of brackets in the provinance. If the specification was updated to disallow any kind of bracket (all of "()[]{}<>") the code could be changed to look for the last colon, but if it finds any bracket before it finds a colon it returns the full tail as the interpretation.

src/io/usi.rs Outdated Show resolved Hide resolved
@mobiusklein
Copy link
Owner

mobiusklein commented Dec 18, 2024

Thank you for adding this. I think the intent was that the string would be parsed from left to right with tokenization going the whole way, instead of hacky splitting on :, but provenance identifiers are a specific use-case that I don't really understand yet.

Your solution looks good, but please see my note about rsplit_once vs split_once.

@douweschulte
Copy link
Contributor Author

My main reason for giving up on parsing colons is that the specification for the provenance identifiers is fully unclear, so it is not specified if the provenance id can contain any kind of bracket. Additionally parsing the bracket structure of pro forma is doable but quite complicated, you need to support arbitrarily deep nesting but also ignore any bracket type that is not the outer bracket. For example [Info:<<{{(([[Have fun parsing!]]] is a valid text in proforma, but [Info:<<{{(([[Have fun parsing!] is not. So the only solid structure I see is the guarantee that :XX- is the start of the provenance id. Yes a pro forma entry with this text somewhere could be made, but that goes awry less often then the current implementation (which fails as soon as any colon comes on stage).

@mobiusklein
Copy link
Owner

Okay, so that short-circuit option won't work, but shouldn't it still use rsplit_once regardless in order to avoid carving up the first CURIE tag in the interpretation?

Splitting mzspec:PXD000561:Adult_Frontalcortex_bRP_Elite_85_f09:scan:17555:VLHPLEGAVVIIFK/2:PR-G47 is fine, you get ("VLHPLEGAVVIIFK/2", "PR-G47")

Splitting mzspec:PXD000561:Adult_Frontalcortex_bRP_Elite_85_f09:scan:17555:VLH[UNIMOD:1]PLEGAVVIIFK/2:PR-G47 will split the sequence, not the identifier, ("VLH[UNIMOD", "1]PLEGAVVIIFK/2:PR-G47"). With rsplit_once you get ("VLH[UNIMOD:1]PLEGAVVIIFK/2", "PR-G47")

@douweschulte
Copy link
Contributor Author

Yes indeed it should be. I will update it.

@mobiusklein
Copy link
Owner

Okay, I can merge whenever you're ready then.

@douweschulte
Copy link
Contributor Author

I realised I removed the repository code by splitting on it, so I took the liberty of making it into an enum and using that to better represent all possible repositories. With this I am ready for a merge.

@mobiusklein mobiusklein merged commit 417751a into mobiusklein:main Dec 19, 2024
3 checks passed
@douweschulte douweschulte deleted the usi-fix branch December 19, 2024 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants