-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data pipeline output is sometimes unreadable by RDKit #183
Comments
The problem is that SMILES have an unfortunated variaty of number labelling (they are not principally for ring although that is the typical use case): RDKit will except all this form and I think there is no canonical form. |
Also, the %21 probably means that you have that many rings... On the other hand, the percentage of such SMILES is fairly low, also in PubChem, I believe. |
RDKit always parses %813 as [%81, 3] from what I can tell, so %2112 is equivalent to 12%21. The OpenSMILES specification (and the original Daylight specification, from what I can tell) disallows three-digit ring numbers, which should remove any ambiguity. Take the SMILES
Yes, it is not a huge problem in practice! It only affects certain SMILES for molecules with more than 10 rings. #184 should nevertheless fix the issue, small as it may be. |
You have it backwards. Specifications do not matter. What matters is that he code can deal with number labels as they appear in the wild. The comment you have deleted in your PR actually hints to that. The code needs to
E.g. |
The ring number reordering in the data pipeline occasionally breaks SMILES strings for molecules with many rings.
This SMILES...
COC1CC(=O)CC23C4=c5c6c7c8c9c%10c%11c%12c%13c%14c%15c%16c%17c%18c%19c%20c(c%21c%22c%23c(c5c5c6c6c8c%11c8c%11c%12c%12c%14c(c%15%19)c%14c%20c%22c%15c%19c%23c5c(c68)c%19c%11c%12c%14%15)C%2112)C%18C3C1(C4)C(OC)CC(=O)CC%171C(=C9C7)C%16C%10%13
...becomes this:
COC1CC(=O)CC23C4=c5c6c7c8c9c%10c%11c%12c%13c%14c%15c%16c%17c%18c%19c%20c(c%21c%22c%23c(c5c5c6c6c8c%11c8c%11c%12c%12c%14c(c%15%19)c%14c%20c%22c%15c%19c%23c5c(c68)c%19c%11c%12c%14%15)C%21%12)C%18C3C1(C4)C(OC)CC(=O)CC1%17C(=C9C7)C%16C%10%13
This occurs for 7 molecules in ChEMBL34, including the one above, when using RDKit 2024.09.4. The issue here is that %2112 is changed to %21%12 rather than 12%21.
The text was updated successfully, but these errors were encountered: