Do we still need `asciilib` for better performance? #98229

sobolevn · 2022-10-12T23:11:37Z

Feature or enhancement

asciilib was introduced in c3cec78 as a performance feature for ascii strings.

11 years have passed since then.

While working on #98025 we with @vstinner experimented on unicode_count with and without asciilib_count calls.

Results were clear: it does not bring any benefits on our benchmarks.
And this commit was made: df3a6d9

Later, while working on #98228 I've noticed that asciilib_rsplit_whitespace also does not provide significant performance gains on my platform and my simple data input.

So, maybe this should be analyzed deeply?

Pitch

I think that these calls should be further investigated: do we really need this?
We should come up with better data to make a final decision, including:
- What pytohn methods / c-api functions are affected?
- Short and long ascii / non-ascii strings (because this check slows down all non-ascii strings by calling extra PyUnicode_IS_ASCII(str)
- Windows / MacOS / Linux platforms, maybe the results will be different

I don't have access to Windows, but I can do the research for other platforms.

If it does not provide any performance benefits, it should be removed.

The text was updated successfully, but these errors were encountered:

mdboom · 2022-10-13T14:34:38Z

I'd generally be in favor of removing almost-duplicate code if there is no performance benefit.

Have you done any analysis with the pyperformance benchmarks?

vstinner · 2022-10-13T15:30:27Z

ASCII-only strings have multiple efficient properties. Examples:

Looking for a non-ASCII character in an ASCII string has complexity of O(1): it's always false
_PyUnicode_FromASCII() doesn't have to scan the string to compute the maximum code path: it's known that the string is ASCII
Maximum code point is 127 (STRINGLIB_MAX_CHAR=127 for asciilib functions)

Some optimizations are implemented outside stringlib. Like:

static Py_ssize_t
any_find_slice(PyObject* s1, PyObject* s2, ...)
{
    ...
    kind1 = PyUnicode_KIND(s1);
    kind2 = PyUnicode_KIND(s2);
    if (kind1 < kind2)
        return -1;
    ...
}

sobolevn · 2022-10-13T17:08:44Z

Have you done any analysis with the pyperformance benchmarks?

Only very limited ones in #98025 (comment) and #98228 (comment)

serhiy-storchaka · 2022-10-14T07:39:41Z

Please perform tests also with longer strings (100, 1000, 10000 characters or more). Also test with strings containing non-ASCII whitespace characters \x85 and \xa0.

rhpvorderman · 2023-01-24T14:43:47Z

If you want to work with some real-world ASCII data you can download a FASTQ file named SRR15096500_1.fastq.gz here https://www.ebi.ac.uk/ena/browser/view/SRR15096500.
FASTQ is a format to store DNA (or RNA) sequencing reads. You can find a description of the format here: https://en.wikipedia.org/wiki/FASTQ_format.

The data can be quite big. Whole genome sequencing can yield FASTQ files that are 100GB when compressed with gzip.

@marcelm has created a nice library called dnaio for parsing the format. To add to @vstinner ASCII has some really interesting other properties to (to complement @vstinner):

The last 5 bits of lowercase and uppercase characters are the same. Allowing quick case-independent comparing.
The last 4 bits of 0, 1, 2, 3, are 0000, 0001, 0010, 0011.. you get the idea
The fact that the highest code point is at most 127 allows you to create reasonably small lookup tables with only 128 entries.
The most significant bit of ASCII is always 0. So you can do a very quick check if strings are ascii by doing a Bitwise OR on all characters and checking if the result's significant bit is also zero. That trick is used in dnaio and runs at approximately 50GB/s (using SSE2 instructions) or 20GB/s (without). This could potentially be used in UTF-8 parsing.

Anyway. I am curious to see if asciilib works well when using real-world data such as posted above. A lot of formats in my field are ASCII, so Python's performance in this area matters to me.

vstinner · 2023-04-05T18:21:02Z

This could potentially be used in UTF-8 parsing.

The utf8_decode() function of the Python "stringlib" internal library uses this property to check if a string is ASCII-only in an efficient way. It even works on size_t words (ex: 8 bytes on 64-bit systems) per loop iteration. See: Objects/stringlib/codecs.h.

rhpvorderman · 2023-04-07T07:12:31Z

The utf8_decode() function of the Python "stringlib" internal library uses this property to check if a string is ASCII-only in an efficient way.

@vstinner I am aware that the current ASCII check exists and is quite efficient. However, it can be more efficient: #101289

The current ASCII check is per string and involves a lot of code. A quick check on the input buffer is almost cost-free. The copying can then proceed using only PyUnicode_New(127) and memcpy. That is much faster. Although it will require some reworking of the ASCII/UTF-8 code.

sobolevn added type-feature A feature request or enhancement performance Performance or resource usage labels Oct 12, 2022

sobolevn mentioned this issue Oct 12, 2022

gh-94808: Cover str.rsplit for UCS1, UCS2 or UCS4 #98228

Merged

This was referenced Oct 22, 2022

Add regression tests for ancient str methods #98536

Closed

gh-94808: Improve coverage of unicode_find and unicode_rfind #98648

Merged

iritkatriel added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do we still need `asciilib` for better performance? #98229

Do we still need `asciilib` for better performance? #98229

sobolevn commented Oct 12, 2022 •

edited

Loading

mdboom commented Oct 13, 2022

vstinner commented Oct 13, 2022

sobolevn commented Oct 13, 2022

serhiy-storchaka commented Oct 14, 2022

rhpvorderman commented Jan 24, 2023

vstinner commented Apr 5, 2023

rhpvorderman commented Apr 7, 2023

Do we still need asciilib for better performance? #98229

Do we still need asciilib for better performance? #98229

Comments

sobolevn commented Oct 12, 2022 • edited Loading

Feature or enhancement

Pitch

mdboom commented Oct 13, 2022

vstinner commented Oct 13, 2022

sobolevn commented Oct 13, 2022

serhiy-storchaka commented Oct 14, 2022

rhpvorderman commented Jan 24, 2023

vstinner commented Apr 5, 2023

rhpvorderman commented Apr 7, 2023

Do we still need `asciilib` for better performance? #98229

Do we still need `asciilib` for better performance? #98229

sobolevn commented Oct 12, 2022 •

edited

Loading