Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH(string dtype): fallback for HDF5 with UTF-8 surrogates #60993

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Feb 23, 2025

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

One oddity I encountered here: only the index is currently encoded / decoded on write / read operations respectively. Columns on the other hand are written and read as strings. I haven't looked into why this is, and if we can avoid encode/decode for index.

It seemed best to only fallback when errors="surrogatepass", though that use is a bit odd since there is no actual decode occurring. If there are other approaches (perhaps always falling back?), I'm certainly open to them.

Another option here is to fallback to object instead of StringDtype(storage="python"), but it seems with infer_string=True the latter is more appropriate.

@rhshadrach rhshadrach added IO HDF5 read_hdf, HDFStore Strings String extension data type and string data labels Feb 23, 2025
@rhshadrach rhshadrach added this to the 2.3 milestone Feb 23, 2025
@@ -5224,7 +5246,7 @@ def _convert_string_array(data: np.ndarray, encoding: str, errors: str) -> np.nd
# encode if needed
if len(data):
data = (
Series(data.ravel(), copy=False)
Series(data.ravel(), copy=False, dtype="object")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We go immediately back to NumPy, so no reason to use string dtypes here.

Comment on lines +5289 to +5291
ser = Series(data, copy=False).str.decode(
encoding, errors=errors, dtype="object"
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We go immediately back to NumPy, so no reason to use string dtypes here.

and isinstance(values, np.ndarray)
and is_string_array(values, skipna=True)
):
result = result.astype(StringDtype(na_value=np.nan))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added back in #54431. I do not believe it is no longer necessary - we will infer string in the Series constructor.

@rhshadrach rhshadrach marked this pull request as ready for review February 23, 2025 15:41
@rhshadrach rhshadrach changed the title ENH(string dtype): object fallback for HDF5 with UTF-8 surrogates ENH(string dtype): fallback to Python storage for HDF5 with UTF-8 surrogates Feb 23, 2025
@rhshadrach rhshadrach changed the title ENH(string dtype): fallback to Python storage for HDF5 with UTF-8 surrogates ENH(string dtype): fallback for HDF5 with UTF-8 surrogates Feb 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HDF5 read_hdf, HDFStore Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant