Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data pipeline output is sometimes unreadable by RDKit #183

Open
NicklasOsterbacka opened this issue Jan 17, 2025 · 4 comments
Open

Data pipeline output is sometimes unreadable by RDKit #183

NicklasOsterbacka opened this issue Jan 17, 2025 · 4 comments

Comments

@NicklasOsterbacka
Copy link

The ring number reordering in the data pipeline occasionally breaks SMILES strings for molecules with many rings.

This SMILES...
COC1CC(=O)CC23C4=c5c6c7c8c9c%10c%11c%12c%13c%14c%15c%16c%17c%18c%19c%20c(c%21c%22c%23c(c5c5c6c6c8c%11c8c%11c%12c%12c%14c(c%15%19)c%14c%20c%22c%15c%19c%23c5c(c68)c%19c%11c%12c%14%15)C%2112)C%18C3C1(C4)C(OC)CC(=O)CC%171C(=C9C7)C%16C%10%13
...becomes this:
COC1CC(=O)CC23C4=c5c6c7c8c9c%10c%11c%12c%13c%14c%15c%16c%17c%18c%19c%20c(c%21c%22c%23c(c5c5c6c6c8c%11c8c%11c%12c%12c%14c(c%15%19)c%14c%20c%22c%15c%19c%23c5c(c68)c%19c%11c%12c%14%15)C%21%12)C%18C3C1(C4)C(OC)CC(=O)CC1%17C(=C9C7)C%16C%10%13

This occurs for 7 molecules in ChEMBL34, including the one above, when using RDKit 2024.09.4. The issue here is that %2112 is changed to %21%12 rather than 12%21.

@halx
Copy link
Contributor

halx commented Jan 17, 2025

The problem is that SMILES have an unfortunated variaty of number labelling (they are not principally for ring although that is the typical use case):
8%13 which can also be %813 or even %138 (I believe)
24%12 which can also be %2412 or %1224

RDKit will except all this form and I think there is no canonical form.

@halx
Copy link
Contributor

halx commented Jan 17, 2025

Also, the %21 probably means that you have that many rings... On the other hand, the percentage of such SMILES is fairly low, also in PubChem, I believe.

@NicklasOsterbacka
Copy link
Author

RDKit always parses %813 as [%81, 3] from what I can tell, so %2112 is equivalent to 12%21. The OpenSMILES specification (and the original Daylight specification, from what I can tell) disallows three-digit ring numbers, which should remove any ambiguity.

Take the SMILES c12ccccc1cccc2 as an example. RDKit parses this renumbering just fine:
c%813ccccc%81cccc3
This, on the other hand, has an unclosed ring:
c%813ccccc%13cccc8

Also, the %21 probably means that you have that many rings... On the other hand, the percentage of such SMILES is fairly low, also in PubChem, I believe.

Yes, it is not a huge problem in practice! It only affects certain SMILES for molecules with more than 10 rings. #184 should nevertheless fix the issue, small as it may be.

@halx
Copy link
Contributor

halx commented Jan 18, 2025

You have it backwards. Specifications do not matter. What matters is that he code can deal with number labels as they appear in the wild. The comment you have deleted in your PR actually hints to that.

The code needs to

  1. Normalize the labels such that RDKit doesn't fall over
  2. Extract tokens sensibly such that no extraneous tokens are generated

E.g. CC%139%2312%99 would need to be transformed to CC9%13%23%12%99 with tokens ["C", "9", "%13", "%23", "%12", "%99"]. The current code doesn't handle this yet but your suggestsions makes it even worse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants