-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH(string dtype): fallback for HDF5 with UTF-8 surrogates #60993
base: main
Are you sure you want to change the base?
Conversation
@@ -5224,7 +5246,7 @@ def _convert_string_array(data: np.ndarray, encoding: str, errors: str) -> np.nd | |||
# encode if needed | |||
if len(data): | |||
data = ( | |||
Series(data.ravel(), copy=False) | |||
Series(data.ravel(), copy=False, dtype="object") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We go immediately back to NumPy, so no reason to use string dtypes here.
ser = Series(data, copy=False).str.decode( | ||
encoding, errors=errors, dtype="object" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We go immediately back to NumPy, so no reason to use string dtypes here.
and isinstance(values, np.ndarray) | ||
and is_string_array(values, skipna=True) | ||
): | ||
result = result.astype(StringDtype(na_value=np.nan)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added back in #54431. I do not believe it is no longer necessary - we will infer string in the Series constructor.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.One oddity I encountered here: only the index is currently encoded / decoded on write / read operations respectively. Columns on the other hand are written and read as strings. I haven't looked into why this is, and if we can avoid encode/decode for index.
It seemed best to only fallback when
errors="surrogatepass"
, though that use is a bit odd since there is no actual decode occurring. If there are other approaches (perhaps always falling back?), I'm certainly open to them.Another option here is to fallback to
object
instead ofStringDtype(storage="python")
, but it seems withinfer_string=True
the latter is more appropriate.