Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] [NFC] Fix docstrings for string.strip() function family #3843

Closed
wants to merge 3 commits into from

Conversation

martinvuyk
Copy link
Contributor

@martinvuyk martinvuyk commented Dec 7, 2024

Fix docstrings for string.strip() function family.

Was deleted in b0094d3. @lsh I really don't want that documentation deleted, this is very often not included and caused me quite a couple of headaches in finding the correct information and backward compatibility reasons for using those characters. And we should be explicit that this method does not yet support full Python spaces until we can run the methods with .isspace() at compile time.

Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
@martinvuyk martinvuyk requested a review from a team as a code owner December 7, 2024 18:47
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
@owenhilyard
Copy link
Contributor

+1, knowing this doesn't handle all the various whitespace-like characters in unicode is important information.

Does it make sense to have .strip() handle unicode and have a separate .strip_ascii() which can be more heavily vectorized for people who know they only have ascii text and want performance?

@martinvuyk
Copy link
Contributor Author

Does it make sense to have .strip() handle unicode and have a separate .strip_ascii() which can be more heavily vectorized for people who know they only have ascii text and want performance?

@owenhilyard We could add such a helper function inside string_slice.mojo, but I've found that for several cases having a sequence of scalar operations to check the value range is faster. I imagine it's because it's more easily pipelineable in the ALUs than the latency of doing any(asci_whitespace_vec == byte_val). I haven't benchmarked the vectorized approach for #3528 but I think it might give the same results as in #3825 where it seems like the amount of branching makes the latency of the ops be expensive on branch prediction fail or because of the amount of instructions when implementing a branchless version (I'm still waiting for #3528 to be merged to experiment further, split() might have a bigger proportion of separator/text than splitlines() so the results might be different).

Both the vectorized and scalar versions are quite trivial to implement so I think people would reimplement them anyway.

I was also thinking that one day we might implement (or an external library for that matter) an ASCIIString which has all of the optimizations.

@owenhilyard
Copy link
Contributor

@martinvuyk Given that we're likely going to eventually have hand-written versions of functions like this per arch, I wonder if it makes sense to have small and large string versions of the code, instead of trying to make one algorithm work for someone who just mmaped a 20 GB csv and someone else who wants to split a single line of 4 comma delimited floats. We can try to make a set of mostly branchless loops for various string sizes, and potentially different versions for SIMD implementations with masking like AVX-512 and SVE.

For instance, on ASCII text you could vectorize each individual compare and OR them together (ex: '\t' <= chunk <= '\r' or chunk == ' ') to make a mask, invert the mask, then use llvm.experimental.vector.compress and a bitcount (ex: popcnt) on the mask to figure out what your store width should be back into the string buffer (either the source string or an output buffer).

@martinvuyk
Copy link
Contributor Author

martinvuyk commented Dec 8, 2024

Given that we're likely going to eventually have hand-written versions of functions like this per arch, I wonder if it makes sense to have small and large string versions of the code, instead of trying to make one algorithm work for someone who just mmaped a 20 GB csv and someone else who wants to split a single line of 4 comma delimited floats.

@owenhilyard For very large mmaped files, I was thinking of eventually implementing a lazy version of the algorithm that works on _StringSliceIter and returns another iterator (I also want to do the same for splitlines).

The main problem of making this vectorized is the unicode separators (and the damned \r\n), unless we have UTF-32 encoding I don't know how to vectorize the iteration and checking for UTF-8. For checking whether they are separators I implemented a branchless version but it was 30% slower (using SIMD ops, might have to test again with scalar ones), to iterate you can have different length of characters so there will be a checking length of the last char and masking cost at each step. There might be a point at which the cost is worth it, but I'd need to invest quite some time into development + benchmarking to be sure (time which I'd prefer investing into vectorizing List once #3810 lands, and many other string methods which are suboptimal).

For instance, on ASCII text you could vectorize each individual compare and OR them together (ex: '\t' <= chunk <= '\r' or chunk == ' ') to make a mask, invert the mask, then use llvm.experimental.vector.compress and a bitcount (ex: popcnt) on the mask to figure out what your store width should be back into the string buffer (either the source string or an output buffer).

Wow, yeah that might be useful for csv parsers that know that the separators can be any ASCII whitespace char. The split(sep: String) algorithm is already vectorized. I think that can be left for an external library to implement though (unless we go all in on implemeting an ASCIIString type)

@JoeLoser
Copy link
Collaborator

!sync

@modularbot modularbot added the imported-internally Signals that a given pull request has been imported internally. label Dec 10, 2024
@modularbot
Copy link
Collaborator

✅🟣 This contribution has been merged 🟣✅

Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours.

We use Copybara to merge external contributions, click here to learn more.

@modularbot modularbot added the merged-internally Indicates that this pull request has been merged internally label Dec 10, 2024
@modularbot
Copy link
Collaborator

Landed in 9930b67! Thank you for your contribution 🎉

@modularbot modularbot added the merged-externally Merged externally in public mojo repo label Dec 11, 2024
modularbot pushed a commit that referenced this pull request Dec 11, 2024
…n family (#52540)

[External] [stdlib] [NFC] Fix docstrings for `string.strip()` function
family

Fix docstrings for `string.strip()` function family.

Was deleted in b0094d3. @lsh I really
don't want that documentation deleted, this is very often not included
and caused me quite a couple of headaches in finding the correct
information and backward compatibility reasons for using those
characters. And we should be explicit that this method does not yet
support full Python spaces until we can run the methods with
`.isspace()` at compile time.

Co-authored-by: martinvuyk <110240700+martinvuyk@users.noreply.github.com>
Closes #3843
MODULAR_ORIG_COMMIT_REV_ID: 95456c663bc28381d8c9fc1dd98c4d03d38d0ca4
@modularbot modularbot closed this Dec 11, 2024
@martinvuyk martinvuyk deleted the fix-string-strip-docs branch December 11, 2024 17:44
modularbot pushed a commit that referenced this pull request Feb 13, 2025
…n family (#52540)

[External] [stdlib] [NFC] Fix docstrings for `string.strip()` function
family

Fix docstrings for `string.strip()` function family.

Was deleted in b0094d3. @lsh I really
don't want that documentation deleted, this is very often not included
and caused me quite a couple of headaches in finding the correct
information and backward compatibility reasons for using those
characters. And we should be explicit that this method does not yet
support full Python spaces until we can run the methods with
`.isspace()` at compile time.

Co-authored-by: martinvuyk <110240700+martinvuyk@users.noreply.github.com>
Closes #3843
MODULAR_ORIG_COMMIT_REV_ID: 95456c663bc28381d8c9fc1dd98c4d03d38d0ca4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imported-internally Signals that a given pull request has been imported internally. merged-externally Merged externally in public mojo repo merged-internally Indicates that this pull request has been merged internally
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants