-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Case insensitivity of sheet, column, and setting names; plus more performance improvements #746
Open
lindsay-stevens
wants to merge
16
commits into
XLSForm:master
Choose a base branch
from
lindsay-stevens:pyxform-738
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- validators/pyxform/choices.py: - rename function to make it clearer that data is being modified - add docstring, use constants / truthy-checks / tuple instead of list - xls2json.py: - use truthy-checks - use 'meta_children' list directly instead of creating 'survey_meta' list then unpacking into 'meta_children'. - xls2json_backends.py: - only make a new string (via replace) if nbsp is found
- move functions associated with dealias_and_group_headers to new module at pyxform/parsing/sheet_headers.py - avoid processing headers more than once - check for missing required headers - check for duplicates via aliases (e.g. name / value) - aliases.py - replace strings with "::" tokens, with tuples split "::", since that is all that the values are used for - since headers are normalised to snake_case, de-duplicate aliases - utils.py - use shared re_whitespace from xls2json in xls2json_backends.py - xls2json.py - use SurveyElement classes for expected columns, which helps skip some string processing if the header is already a match - skip processing external_choices and OSM sheet if no data provided - require "name" column on choices sheet - require "type" column on survey sheet - test_bug_missing_headers.py - move test cases over to similar scenarios in test_sheet_columns.py - test_translations.py - update test to match external-facing column names (e.g. errors/warnings should now say "image" instead of "media::image") - test_sheet_columns.py - add test cases for aliases and header processing
- xls2json_backends.py - xls and xlsx have provided sheet header info for a long time, but it was not used for xls2json header processing, and having it consistently available will allow for more reliable behaviour - all file types now include sheet names for spell checks - sheet names lowercased during this processing instead of xls2json - md and csv now provide header info - csv might need some further work but there are not many tests to support that refactoring, and csv is not the focus of this commit, and csv is not a primary (or well documented) file type. - add get_xlsform for more consistent file handling between xls2xform.convert (used by external API tests) and xls2json.parse_file_to_json (used by some internal API tests) - utils.py - update external_choices_to_csv to expect DefinitionData - move RE_WHITESPACE to xls2json_backends.py to avoid circular import - xls2json.py - update workbook_to_json to expect DefinitionData - remove parse_file_to_workbook_dict replaced by get_xlsform and associated functions - fold get_filename since this func is one line used in one place - test_xls2json.py - update SURVEY fixture md to include settings, because the markdown behaviour was updated to match xls/xlsx, which is to assume that a doc with one sheet and an unknown name is the "survey". So with "settings" there, and "survey" misspellings are warned about. - also, on line 336 the test is 'Surve' because the sheet_names passed to DefinitionData is verbatim (before lowercasing), which makes the warnings/errors more accurate.
- in these cases the type is markdown which skips parsing attempts for xlsx, xls, etc.
- use of value.strip() in is_empty and _md_strp_cell creates a new string but isspace() does the same without creating a new string. Also, pre-check empty strings before isspace(). - folded markdown row process functions into _md_table_to_ss_structure, since process_md_data did a split + iterate + join, only to pass it to _md_table_to_ss_structure which did split + iterate (twice). Now it should only be split + iterated once. There is still an extra iteration in list_to_dicts to pair row tuples with headers, which could potentially be folded in too.
- xls2json.py: - process optional sheets (all but survey) only if they have data - move entities spell check from entities_parsing.py because the entity declaration is only processed if there is data - move workbook_to_json regexes to module to compile only once - delete references to original workbook_dict to allow gc - parsing/sheet_headers.py: - use the provided sheet_header info instead of gathering from rows - if no sheet_headers provided (e.g. dict xlsform with no survey_header etc) then guess headers from first 100 rows. - add docstrings, spruce up error messages, add tests for each error - validators/pyxform/choices.py: - row may not always contain "__row" key so pop instead of del
- for truthiness checks on class instances, if there is no __bool__ defined then __len__ is used, and in this case __len__ calls get_slot_names. The latter returns a static value, but it's extra function calls vs. just True.
- the Question subclasses had almost the same code in each build_xml so the common parts are now in Question._build_xml. - avoids duplicate call to self.xml_label_and_hint in InputQuestion - avoids attempting insert_xpaths for control tags (see inline comment) - avoids modifying the control dict only to splat it into node, instead set the attributes directly as node() eventually does. - applies the same pattern to xml_instance and xml_action - removed UploadQuestion._get_media_type because this was just getting a value from the control_dict, which already would go into the node. - also collapse split string in test_builder.py
- type is checked during __init__, and it shouldn't change after that
- for choices, an Itemset class is introduced which allows calculating and storing metadata about the choice list more consistently. In particular, working out whether itext is required only once. - the initial motivation was the if block in survey.py line 391, which for a form with many choice lists resulted in a large slow down due to something approaching O(n*m) + costs from string join/split. - this allowed removing: - "_itemset*" properties from MultipleChoiceQuestion - _search_lists property from Survey - functions is_label_dynamic and has_dynamic_label - in the course of implementing this it became clear that actually the Tag class never has choices so code for that is removed. The Tag class is like an Option with respect to OsmUploadQuestion. Added some OSM-related test cases to check it's not broken. - similarly since choices/itext are handled by Survey, the MultipleChoiceQuestion class doesn't need choices children either, just a reference to the Itemset for XML and JSON output. - for instances generation, the above choices changes allowed simplification of _generate_static_instances, and otherwise the changes are mainly to avoid repeating checks or using intermediate lists by instead using generators/yield as much as possible. - test_j2x_creation.py / test_j2x_question.py / strings.ini - updated these tests since they are using internal APIs in a way that diverges significantly from what xls2json currently emits - For example test_select_one_question_multilingual had multi-lang choice labels but the expected XML string had a reference like "/test/qname/a:label" which implies choice itemsets aren't shared which has not been the case for a while. - tried to make these tests more useful by adding xpath assertions, and unlike other tests using ss_structure they may be useful for validating/showing what dict structure xlsforms can be provided as. - test_j2x_xform_build_preparation.py - removed this test since it's not clear what the expectation is. If it was intended to check that identical choice lists from separate questions are merged, then that functionality doesn't exist, and the choices should not be provided separately per question anyway. - test_dynamic_default.py / test_translations.py - updated performance test figures. - translation test benefits most from the choices changes because it has lots of choice lists. Increase in memory most likely due to Itemset instances that now wrap each choice list. - dynamic_default slightly faster due to the instances changes and earlier commits today (e.g. not calling xml_label_or_hint twice for input/text questions, etc).
- this test seems to be about serdes / round-tripping so calling validate twice for each form is probably unnecessary
- also re-saved text_and_integer.xlsx since openpyxl warns about an "unrecognised extension" that is ignored - seems to be something from the macos version of excel.
- new test to check for case-insensitive sheet names and headers - xlsjson_backends.py: fix csv processing to match other input types, i.e. strip values and lowercase sheet name keys. - xpath_helpers: - add helpers for XPaths for entities, questions, settings assertions used in new test_case_insensitivity that existed elsewhere already - apply these helpers to existing tests using the same XPaths - entities_parsing.py: use enum values so that to_json_dict repr uses the value rather than something like <EntityColumns.DATASET> etc. - xls2json.py: include EntityColumns since these are input columns, but they are placed into a "parameters" dict rather than slotted attrs
- these test modules are about the same topic
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #738
Closes #731
Closes #572
Why is this the best possible solution? Were any other approaches considered?
The ambition is to fix various sheet name and header case-insensitivity / processing issues by:
DefinitionData
which has the sheet headers, sheet data, and original sheet names (for spell check).list[list[str]]
rather thanlist[dict]
. The xls2json_backends are convertinglist[list[str]]
tolist[dict]
, only for thoselist[dict]
to be re-keyed as a newlist[dict]
with processed headers via xls2json.choices
sheet need to be verbatim in order for choice_filter expressions to work.Test cases that justify closing issues:
Also includes a bunch of performance improvements and/or tidying found along the way, detailed in commit messages. These should help all types of XLSForms, but in particular ones a) with labels or hints, or b) lots of choice lists, or c) written in markdown.
What are the regression risks?
If someone is using workbook_to_json directly, or passing in dict to
xls2xform.convert
, then they'll now need to conform to the DefinitionData structure. This could be as simple asDefinitionDict(**workbook_dict)
or might require specifying the sheet headers.Does this change require updates to documentation? If so, please file an issue here and include the link below.
No
Before submitting this PR, please make sure you have:
tests
python -m unittest
and verified all tests passruff format pyxform tests
andruff check pyxform tests
to lint code