Skip to content

Conversation

@yashwantbezawada
Copy link

@yashwantbezawada yashwantbezawada commented Dec 27, 2025

Reference Issues/PRs

Fixes #2800

What does this implement or fix?

When you try to write a DataFrame with np.str_ values, you get this error:

ArcticDbNotYetImplemented: Failed to normalize column 'col' with dtype 'object'. 
Found first non-null value of type '<class 'numpy.str_'>', but only strings, unicode, 
and Timestamps are supported.

The issue is in _accept_array_string() - it checks type(v) in (str, bytes) which does an exact type match. But np.str_ is its own class that inherits from str, so the check fails even though it is basically a string.

Switched to isinstance(v, (str, bytes)) which handles subclasses properly. Same thing for coerce_string_column_to_fixed_length_array() - changed to_type == str to issubclass(to_type, str).

Any other comments?

Quick sanity check that this makes sense:

>>> isinstance(np.str_("hello"), str)
True
>>> isinstance(np.bytes_(b"hello"), bytes)
True

The C++ side already does the right thing - PyUnicode_Check and PyBytes_Check both check the type hierarchy, so np.str_ values work fine once they get past the Python normalization.

Added some tests to cover this.

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

The normalization code was using exact type checks (type(v) in (str, bytes))
which failed for numpy string scalar types since np.str_ and np.bytes_ are
distinct classes that inherit from str and bytes respectively.

Changed to use isinstance() checks which properly handle the type hierarchy,
allowing numpy string types to be normalized correctly.

Also updated coerce_string_column_to_fixed_length_array to use issubclass()
for consistent handling when dynamic_strings=False.

Added regression tests for the fix.

Fixes man-group#2800

Signed-Off By: Yashwant Bezawada <[email protected]>. By including this sign-off line I agree to the terms of the Contributor License Agreement.
@phoebusm phoebusm force-pushed the fix-numpy-str-normalization branch from 4b2dc27 to 2380eea Compare December 29, 2025 18:30
Copy link
Collaborator

@IvoDD IvoDD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contributing! Looks good! Just one suggestion for an extra test

Per reviewer feedback referencing issue man-group#704, using isinstance() to
accept all str/bytes subclasses has been problematic before. Changed
to use strict type equality with explicit numpy types:
- type(v) in (str, bytes, np.str_, np.bytes_)
- to_type in (str, np.str_)

This supports numpy string types while avoiding issues with arbitrary
string subclasses.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for np.str_ types

2 participants