-
-
Notifications
You must be signed in to change notification settings - Fork 31.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do we still need asciilib
for better performance?
#98229
Comments
I'd generally be in favor of removing almost-duplicate code if there is no performance benefit. Have you done any analysis with the pyperformance benchmarks? |
ASCII-only strings have multiple efficient properties. Examples:
Some optimizations are implemented outside stringlib. Like:
|
Only very limited ones in #98025 (comment) and #98228 (comment) |
Please perform tests also with longer strings (100, 1000, 10000 characters or more). Also test with strings containing non-ASCII whitespace characters |
If you want to work with some real-world ASCII data you can download a FASTQ file named SRR15096500_1.fastq.gz here https://www.ebi.ac.uk/ena/browser/view/SRR15096500. The data can be quite big. Whole genome sequencing can yield FASTQ files that are 100GB when compressed with gzip. @marcelm has created a nice library called dnaio for parsing the format. To add to @vstinner ASCII has some really interesting other properties to (to complement @vstinner):
Anyway. I am curious to see if asciilib works well when using real-world data such as posted above. A lot of formats in my field are ASCII, so Python's performance in this area matters to me. |
The utf8_decode() function of the Python "stringlib" internal library uses this property to check if a string is ASCII-only in an efficient way. It even works on size_t words (ex: 8 bytes on 64-bit systems) per loop iteration. See: Objects/stringlib/codecs.h. |
@vstinner I am aware that the current ASCII check exists and is quite efficient. However, it can be more efficient: #101289 The current ASCII check is per string and involves a lot of code. A quick check on the input buffer is almost cost-free. The copying can then proceed using only PyUnicode_New(127) and memcpy. That is much faster. Although it will require some reworking of the ASCII/UTF-8 code. |
Feature or enhancement
asciilib
was introduced in c3cec78 as a performance feature for ascii strings.11 years have passed since then.
While working on #98025 we with @vstinner experimented on
unicode_count
with and withoutasciilib_count
calls.Results were clear: it does not bring any benefits on our benchmarks.
And this commit was made: df3a6d9
Later, while working on #98228 I've noticed that
asciilib_rsplit_whitespace
also does not provide significant performance gains on my platform and my simple data input.So, maybe this should be analyzed deeply?
Pitch
PyUnicode_IS_ASCII(str)
I don't have access to Windows, but I can do the research for other platforms.
If it does not provide any performance benefits, it should be removed.
The text was updated successfully, but these errors were encountered: