Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case insensitivity of sheet, column, and setting names; plus more performance improvements #746

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

lindsay-stevens
Copy link
Contributor

@lindsay-stevens lindsay-stevens commented Jan 9, 2025

Closes #738
Closes #731
Closes #572

Why is this the best possible solution? Were any other approaches considered?

The ambition is to fix various sheet name and header case-insensitivity / processing issues by:

  • standardising the output data structure from all xls2json_backends processors (and input data to workbook_to_json) on DefinitionData which has the sheet headers, sheet data, and original sheet names (for spell check).
  • using the headers passed in from the backends, so they can be processed once only (rather than repeatedly as before), and then mapping original headers to processed headers.
    • it might make sense to go one step further and require that backends pass the raw content as list[list[str]] rather than list[dict]. The xls2json_backends are converting list[list[str]] to list[dict], only for those list[dict] to be re-keyed as a new list[dict] with processed headers via xls2json.
  • using the SurveyElement class attributes to determine which columns to process and which to leave as-is. For example extra columns in the choices sheet need to be verbatim in order for choice_filter expressions to work.
  • adding relatively exhaustive tests for:
    • each XLSForm input type with all supported sheet names and a sample of headers
    • all the possibilities for header splitting, normalisation, and nesting

Test cases that justify closing issues:

Also includes a bunch of performance improvements and/or tidying found along the way, detailed in commit messages. These should help all types of XLSForms, but in particular ones a) with labels or hints, or b) lots of choice lists, or c) written in markdown.

What are the regression risks?

If someone is using workbook_to_json directly, or passing in dict to xls2xform.convert, then they'll now need to conform to the DefinitionData structure. This could be as simple as DefinitionDict(**workbook_dict) or might require specifying the sheet headers.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

No

Before submitting this PR, please make sure you have:

  • included test cases for core behavior and edge cases in tests
  • run python -m unittest and verified all tests pass
  • run ruff format pyxform tests and ruff check pyxform tests to lint code
  • verified that any code or assets from external sources are properly credited in comments

- validators/pyxform/choices.py:
  - rename function to make it clearer that data is being modified
  - add docstring, use constants / truthy-checks / tuple instead of list
- xls2json.py:
  - use truthy-checks
  - use 'meta_children' list directly instead of creating 'survey_meta'
    list then unpacking into 'meta_children'.
- xls2json_backends.py:
  - only make a new string (via replace) if nbsp is found
- move functions associated with dealias_and_group_headers to new module
  at pyxform/parsing/sheet_headers.py
  - avoid processing headers more than once
  - check for missing required headers
  - check for duplicates via aliases (e.g. name / value)
- aliases.py
  - replace strings with "::" tokens, with tuples split "::", since
    that is all that the values are used for
  - since headers are normalised to snake_case, de-duplicate aliases
- utils.py
  - use shared re_whitespace from xls2json in xls2json_backends.py
- xls2json.py
  - use SurveyElement classes for expected columns, which helps skip
    some string processing if the header is already a match
  - skip processing external_choices and OSM sheet if no data provided
  - require "name" column on choices sheet
  - require "type" column on survey sheet
- test_bug_missing_headers.py
  - move test cases over to similar scenarios in test_sheet_columns.py
- test_translations.py
  - update test to match external-facing column names (e.g.
    errors/warnings should now say "image" instead of "media::image")
- test_sheet_columns.py
  - add test cases for aliases and header processing
- xls2json_backends.py
  - xls and xlsx have provided sheet header info for a long time, but it
    was not used for xls2json header processing, and having it
    consistently available will allow for more reliable behaviour
    - all file types now include sheet names for spell checks
    - sheet names lowercased during this processing instead of xls2json
    - md and csv now provide header info
      - csv might need some further work but there are not many tests
        to support that refactoring, and csv is not the focus of this
        commit, and csv is not a primary (or well documented) file type.
  - add get_xlsform for more consistent file handling between
    xls2xform.convert (used by external API tests) and
    xls2json.parse_file_to_json (used by some internal API tests)
- utils.py
  - update external_choices_to_csv to expect DefinitionData
  - move RE_WHITESPACE to xls2json_backends.py to avoid circular import
- xls2json.py
  - update workbook_to_json to expect DefinitionData
  - remove parse_file_to_workbook_dict replaced by get_xlsform and
    associated functions
  - fold get_filename since this func is one line used in one place
- test_xls2json.py
  - update SURVEY fixture md to include settings, because the markdown
    behaviour was updated to match xls/xlsx, which is to assume that a
    doc with one sheet and an unknown name is the "survey". So with
    "settings" there, and "survey" misspellings are warned about.
  - also, on line 336 the test is 'Surve' because the sheet_names passed
    to DefinitionData is verbatim (before lowercasing), which makes the
    warnings/errors more accurate.
- in these cases the type is markdown which skips parsing attempts for
  xlsx, xls, etc.
- use of value.strip() in is_empty and _md_strp_cell creates a new
  string but isspace() does the same without creating a new string.
  Also, pre-check empty strings before isspace().
- folded markdown row process functions into _md_table_to_ss_structure,
  since process_md_data did a split + iterate + join, only to pass it
  to _md_table_to_ss_structure which did split + iterate (twice). Now
  it should only be split + iterated once. There is still an extra
  iteration in list_to_dicts to pair row tuples with headers, which
  could potentially be folded in too.
- xls2json.py:
  - process optional sheets (all but survey) only if they have data
  - move entities spell check from entities_parsing.py because the
    entity declaration is only processed if there is data
  - move workbook_to_json regexes to module to compile only once
  - delete references to original workbook_dict to allow gc
- parsing/sheet_headers.py:
  - use the provided sheet_header info instead of gathering from rows
  - if no sheet_headers provided (e.g. dict xlsform with no
    survey_header etc) then guess headers from first 100 rows.
  - add docstrings, spruce up error messages, add tests for each error
- validators/pyxform/choices.py:
  - row may not always contain "__row" key so pop instead of del
- for truthiness checks on class instances, if there is no __bool__
  defined then __len__ is used, and in this case __len__ calls
  get_slot_names. The latter returns a static value, but it's extra
  function calls vs. just True.
- the Question subclasses had almost the same code in each build_xml
  so the common parts are now in Question._build_xml.
- avoids duplicate call to self.xml_label_and_hint in InputQuestion
- avoids attempting insert_xpaths for control tags (see inline comment)
- avoids modifying the control dict only to splat it into node, instead
  set the attributes directly as node() eventually does.
  - applies the same pattern to xml_instance and xml_action
- removed UploadQuestion._get_media_type because this was just getting
  a value from the control_dict, which already would go into the node.
- also collapse split string in test_builder.py
- type is checked during __init__, and it shouldn't change after that
- for choices, an Itemset class is introduced which allows calculating
  and storing metadata about the choice list more consistently. In
  particular, working out whether itext is required only once.
  - the initial motivation was the if block in survey.py line 391, which
    for a form with many choice lists resulted in a large slow down due
    to something approaching O(n*m) + costs from string join/split.
  - this allowed removing:
    - "_itemset*" properties from MultipleChoiceQuestion
    - _search_lists property from Survey
    - functions is_label_dynamic and has_dynamic_label
  - in the course of implementing this it became clear that actually
    the Tag class never has choices so code for that is removed. The
    Tag class is like an Option with respect to OsmUploadQuestion. Added
    some OSM-related test cases to check it's not broken.
  - similarly since choices/itext are handled by Survey, the
    MultipleChoiceQuestion class doesn't need choices children either,
    just a reference to the Itemset for XML and JSON output.
- for instances generation, the above choices changes allowed
  simplification of _generate_static_instances, and otherwise the
  changes are mainly to avoid repeating checks or using intermediate
  lists by instead using generators/yield as much as possible.
- test_j2x_creation.py / test_j2x_question.py / strings.ini
  - updated these tests since they are using internal APIs in a way that
    diverges significantly from what xls2json currently emits
    - For example test_select_one_question_multilingual had multi-lang
      choice labels but the expected XML string had a reference like
      "/test/qname/a:label" which implies choice itemsets aren't shared
      which has not been the case for a while.
  - tried to make these tests more useful by adding xpath assertions,
    and unlike other tests using ss_structure they may be useful for
    validating/showing what dict structure xlsforms can be provided as.
- test_j2x_xform_build_preparation.py
  - removed this test since it's not clear what the expectation is. If
    it was intended to check that identical choice lists from separate
    questions are merged, then that functionality doesn't exist, and
    the choices should not be provided separately per question anyway.
- test_dynamic_default.py / test_translations.py
  - updated performance test figures.
  - translation test benefits most from the choices changes because it
    has lots of choice lists. Increase in memory most likely due to
    Itemset instances that now wrap each choice list.
  - dynamic_default slightly faster due to the instances changes and
    earlier commits today (e.g. not calling xml_label_or_hint twice for
    input/text questions, etc).
- this test seems to be about serdes / round-tripping so calling
  validate twice for each form is probably unnecessary
- also re-saved text_and_integer.xlsx since openpyxl warns about an
  "unrecognised extension" that is ignored - seems to be something from
  the macos version of excel.
- new test to check for case-insensitive sheet names and headers
- xlsjson_backends.py: fix csv processing to match other input types,
  i.e. strip values and lowercase sheet name keys.
- xpath_helpers:
  - add helpers for XPaths for entities, questions, settings assertions
    used in new test_case_insensitivity that existed elsewhere already
  - apply these helpers to existing tests using the same XPaths
- entities_parsing.py: use enum values so that to_json_dict repr uses
  the value rather than something like <EntityColumns.DATASET> etc.
- xls2json.py: include EntityColumns since these are input columns, but
  they are placed into a "parameters" dict rather than slotted attrs
@lindsay-stevens lindsay-stevens marked this pull request as ready for review January 15, 2025 18:25
- these test modules are about the same topic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant