Skip to content

Fix series length ordering for string[python] IDs in dataframe validation/conversion#470

Open
dario-fumarola wants to merge 1 commit intoamazon-science:mainfrom
dario-fumarola:fix/issue-440-string-python-id-order
Open

Fix series length ordering for string[python] IDs in dataframe validation/conversion#470
dario-fumarola wants to merge 1 commit intoamazon-science:mainfrom
dario-fumarola:fix/issue-440-string-python-id-order

Conversation

@dario-fumarola
Copy link

Summary

Fixes #440 by making series-length extraction deterministic and aligned with row order, including when id_column uses pandas string[python] dtype.

Root cause

After sorting by (id_column, timestamp_column), the code used:
value_counts(sort=False).to_list()
to derive per-series lengths. For some ID dtypes (notably string[python]), this can produce an order that does not match contiguous row blocks, which then misaligns timestamp slicing and can trigger false frequency inference failures.

Changes

  • In validate_df_inputs, replaced:
    • df[id_column].value_counts(sort=False).to_list()
    • with df.groupby(id_column, sort=False).size().to_list()
  • Applied the same fix in convert_df_input_to_list_of_dicts_input when validate_inputs=False for consistency.
  • Added regression tests:
    • test_validate_df_inputs_accepts_string_python_ids_with_unequal_lengths
    • test_validate_df_inputs_has_consistent_metadata_for_object_and_string_python_ids
    • test_convert_df_with_validate_inputs_false_handles_string_python_ids

Validation

  • pytest test/test_df_utils.py (36 passed)
  • mypy src test (no issues)

Compatibility

No public API changes. Behavior is unchanged except for correcting dtype-dependent ordering/misalignment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Cannot infer frequency when id_column has string[python] dtype

1 participant