Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README for improved chemical formula parser #207

Open
bertiewooster opened this issue Mar 20, 2022 · 8 comments
Open

Update README for improved chemical formula parser #207

bertiewooster opened this issue Mar 20, 2022 · 8 comments

Comments

@bertiewooster
Copy link

I am working on updating README.rst for the improved formula parsing #205. A few questions regarding the updated parsing which no longer accepts malformed chemical formulas such as "Ch4"--ChemPy will now raise a ParseError, rather than simply stopping at the last valid element (that formula was previously parsed to C aka carbon):

  • Should we call this a breaking change? I'm thinking not, because it doesn't break any valid chemical formulas (that I'm aware of). Maybe just call it an improved parser...
  • Should we note the version as of which this change was made? If so, what do we plan to name the next version?

Also, is it all right if I append to .gitignore

.vscode/

so that Visual Studio Code configuration files will be ignored?

@bertiewooster
Copy link
Author

I'm assuming there is no doctest for README.rst? So I'm manually testing docstrings by running them in a temporary doc_testing.py file, having forked from bjodah/chempy after @jeremyagray merged the parsing improvements.

When I try to run an example in chempy/util/tests/test_parsing.py

from chempy import Substance
Substance.from_formula("Ca2.832Fe0.6285Mg5.395(CO3)6").composition

I get a KeyError:

Exception has occurred: KeyError
'.'
  File "[/Users/jemonat/Projects/chempy/chempy/util/parsing.py]()", line 650, in <genexpr>
    lambda x: "".join(_unicode_sub[str(_)] for _ in x),
  File "[/Users/jemonat/Projects/chempy/chempy/util/parsing.py]()", line 650, in <lambda>
    lambda x: "".join(_unicode_sub[str(_)] for _ in x),
  File "[/Users/jemonat/Projects/chempy/chempy/util/parsing.py]()", line 545, in <lambda>
    string += re.sub(r"([0-9]+\.[0-9]+|[0-9]+)", lambda m: sub(m.group(1)), stoich)
  File "[/Users/jemonat/Projects/chempy/chempy/util/parsing.py]()", line 545, in _formula_to_format
    string += re.sub(r"([0-9]+\.[0-9]+|[0-9]+)", lambda m: sub(m.group(1)), stoich)
  File "[/Users/jemonat/Projects/chempy/chempy/util/parsing.py]()", line 649, in formula_to_unicode
    return _formula_to_format(
  File "[/Users/jemonat/Projects/chempy/chempy/chemistry.py]()", line 190, in from_formula
    unicode_name=formula_to_unicode(formula),
  File "[/Users/jemonat/Projects/chempy/doc_testing.py]()", line 3, in <module>
    Substance.from_formula("Ca2.832Fe0.6285Mg5.395(CO3)6").composition

@jeremyagray
Copy link
Collaborator

jeremyagray commented Mar 20, 2022

That test got left out of the unicode and HTML sections but not the composition and latex sections. The good news is that it works for HTML and it should be similarly fixable for unicode.

The bad news is that it appears impossible to represent in unicode as they have apparently neglected to include subscript and superscript punctuation like . and , despite the necessity for representing subscript and superscript decimals. It's not like two more characters would break unicode. The other solutions I've seen suggested for this problem are to hijack some other diacritical symbol that's good enough or use a space but I dislike both of those because they're hacky, not standard, and possibly difficult to read.

I suppose a solution could be to fail early with a "unicode is broken" exception or use a regular decimal point until a better course of action presents itself. Suggestions welcome; I'll push something to patch it.

EDIT: I have a working "use a regular decimal point" fix; any better solution can be quickly dropped in its place.

@bjodah
Copy link
Owner

bjodah commented Mar 29, 2022

@bertiewooster sure, just add .vscode to the .gitignore file, no worries there!

I'm fine with hijacking unicode characters which looks "approximately like a subscript point".

@bertiewooster
Copy link
Author

Thanks for the guidance, @bjodah. Will do; I'll add .vs to the .gitignore file too, just in case any contributor is using Visual Studio Professional.

Hopefully @jeremyagray also has the guidance needed to help push this across the finish line for a release!

@bertiewooster
Copy link
Author

Hi, just following up to check if @jeremyagray can address the last remaining coding issue (that I'm aware of) before a release by him or @bjodah, hijacking unicode characters which looks "approximately like a subscript point". I can then incorporate that code change into my forked branch and finish updating the README. Thanks!

@spizwhiz
Copy link

Checking in on this - my code is currently broken because apparently v0.8.3 still cannot accept non-integer stoichiometry?

You state: "The good news is that it works for HTML and it should be similarly fixable for unicode"

How can I force HTML or Latex so that I don't get the key error with unicode?

jeremyagray added a commit to jeremyagray/chempy that referenced this issue Sep 23, 2023
Translate a Unicode decimal point in a subscript as a text decimal
point.  Use `.` as the decimal point operator and `..` as the hydrate
operator.  Add zinc nitrate example from bjodah#207.

Signed-off-by: Jeremy A Gray <[email protected]>
@jeremyagray
Copy link
Collaborator

I think this branch addresses your problem, with the above a caveats about lack of a proper Unicode symbol. There's some more related discussion in #223.

@spizwhiz
Copy link

Thanks @jeremyagray. I ended reverting to a much older version of chempy to solve my immediate problems. Looking at the notes in #223, I kind of like the "" identifier for crystal water, but ".." is also fine IMO. It's easy enough to replace the ".." or "" symbols for plotting and reporting.

bjodah pushed a commit that referenced this issue Apr 23, 2024
Translate a Unicode decimal point in a subscript as a text decimal
point.  Use `.` as the decimal point operator and `..` as the hydrate
operator.  Add zinc nitrate example from #207.

Signed-off-by: Jeremy A Gray <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants