Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-115999: Specialize LOAD_ATTR for instance and class receivers in free-threaded builds #128164

Merged
merged 75 commits into from
Jan 14, 2025

Conversation

mpage
Copy link
Contributor

@mpage mpage commented Dec 21, 2024

This PR finishes specialization forLOAD_ATTR in the free-threaded build by adding support for class and instance receivers.

The bulk of it is dedicated to making the specialized instructions and the specialization logic thread-safe. This consists of using atomics / locks in the appropriate places, avoiding races in specialization related to reloading versions, and ensuring that the objects stored in inline caches remain valid when accessed (by only storing deferred objects). See the section on "Thread Safety" below for more details.

Additionally, making this work required a few unrelated changes to fix existing bugs or work around differences between the two builds that results from only storing deferred values (which causes in specialization failures in the free-threaded build when a value that would be stored in the cache is not deferred):

  • Fixed a bug in the cases generator where it wasn't treating macro tokens as terminators when searching for the assignment target of an expression involving PyStackRef_FromPyObjectNew.
  • Refactored test_descr.MiscTests.test_type_lookup_mro_reference to work when specialization fails (and also be a behavorial test).
  • Temporarily skip test_capi.test_type.TypeTests.test_freeze_meta when running refleaks tests on free-threaded builds. Specialization failure triggers an existing bug.

Single-threaded Performance

  • Performance is improved by ~12% on free-threaded builds.
  • Performance is ~neutral on default builds. (Technically the benchmark runner reports a 0.2% improvement, but that looks like noise to me.)

We're leaving a bit of perf on the table by only storing deferred objects: we can't specialize attribute lookups that resolve to class attributes (e.g. counters, settings). I haven't measured how much perf we're giving up, but I'd like to address that in a separate PR.

Scalability

The object_cfunction and pymethod benchmarks are improved (1.4x slower -> 14.3x faster, 1.8x slower -> 13.0x faster, respectively). Other benchmarks appear unchanged.

I would expect that cmodule_function would also improve, but it looks like the benchmark is bottlenecked on increfing the int.__floor__ method that is returned from the call to _PyObject_LookupSpecial in math_floor (the incref happens in _PyType_LookupRef, which is called by PyObject_LookupSpecial):

PyObject *method = _PyObject_LookupSpecial(number, state->str___floor__);

Raw numbers:

Running benchmarks with 28 threads
                   Merge-base    This PR
object_cfunction    1.4x slower  14.3x faster
cmodule_function    1.7x slower   1.7x slower
mult_constant      12.8x faster  13.6x faster
generator          11.5x faster  12.2x faster
pymethod            1.8x slower  13.4x faster
pyfunction         12.2x faster  13.0x faster
module_function    14.5x faster  14.8x faster
load_string_const  13.0x faster  13.2x faster
load_tuple_const   12.7x faster  14.1x faster
create_pyobject    12.8x faster  13.5x faster
create_closure     13.9x faster  14.8x faster
create_dict        10.5x faster  13.3x faster
thread_local_read   3.9x slower   4.0x slower

Thread Safety

Thread safety of specialized instructions is addressed in a few different ways:

  • Use atomics where needed.
  • Lock the instance dictionary in _LOAD_ATTR_WITH_HINT.
  • Only store deferred objects in the inline cache. All uops that retrieve objects from the cache are preceded by type version guards, which ensure that the (deferred) value retrieved from the cache is valid. The type version will be mutated before destroying the reference in the type. If the type held the last reference, then GC will need to run in order to reclaim the object. GC requires stopping the world, so if GC reclaims the object then we will see the new type version and the guard will fail.

Thread safety of specialization is addressed using similar techniques:

  • Atomics / locking is used to access shared state.
  • Specialization code is refactored to read versions at the start of specialization and store the same version in caches (as opposed to rereading the version). This ensures that the specialization is consistent with the version by avoiding races where the type or shared keys dictionary changes after we've queried it.

Stats

Following the instructions in the comment preceding specialize_attr_loadclassattr, I collected stats for the default build for both this PR and its merge base using ./python -m test_typing test_re test_dis test_zlib and compared them using Tools/scripts/summarize_stats.py. The results for LOAD_ATTR are nearly identical and are consistent with results from comparing the merge base against itself:

mpage added 30 commits December 18, 2024 14:13
Look up a unicode key in an all unicode keys object along with the
keys version, assigning one if not present.

We need a keys version that is consistent with presence of the key
for use in the guards.
Reading the shared keys version and looking up the key need to be performed
atomically. Otherwise, a key inserted after the lookup could race with reading
the version, causing us to incorrectly specialize that no key shadows the
descriptor.
* Check that the type hasn't changed across lookups
* Descr is now an owned ref
- Use atomic load for value
- Use _Py_TryIncrefCompareStackRef for incref
- Use atomics and _Py_TryIncrefCompareStackRef in _LOAD_ATTR_SLOT
- Pass type version during specialization
- Check that fget is deferred
- Pass tp_version
Macros should be treated as terminators when searching for the assignment
target of an expression involving PyStackRef_FromPyObjectNew
All instance loads are complete!
@mpage mpage added the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Jan 2, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @mpage for commit 8b96368 🤖

If you want to schedule another build, you need to add the 🔨 test-with-refleak-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Jan 2, 2025
Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of effort in this PR seems to go into LOAD_ATTR_WITH_HINT. There is a lot of locking and unlocking and extra machinery to support having a value on the stack between uops.
Is it worth it?

Generally, I think any effort spent on LOAD_ATTR_WITH_HINT is better spent on increasing the fraction of objects that use inlined values and can be handled by LOAD_ATTR_INSTANCE_VALUE.
It is not just that LOAD_ATTR_INSTANCE_VALUE is faster, it also means that the object is created faster and uses less memory.

@@ -1129,6 +1129,21 @@ dictkeys_generic_lookup(PyDictObject *mp, PyDictKeysObject* dk, PyObject *key, P
return do_lookup(mp, dk, key, hash, compare_generic);
}

static Py_hash_t
check_keys_and_hash(PyDictKeysObject *dk, PyObject *key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell from the name what this is checking.
Maybe check_dict_is_unicode_only_and_key_has_hash_defined?
Which leads me to the question: why check both of these things in a single function. The two checks appear unrelated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which leads me to the question: why check both of these things in a single function.

This is factoring out some common code into a helper. I'll split it into two helpers with more descriptive names.

}

split op(_LOAD_ATTR_INSTANCE_VALUE, (offset/1, owner -- attr, null if (oparg & 1))) {
PyObject *owner_o = PyStackRef_AsPyObjectBorrow(owner);
PyObject **value_ptr = (PyObject**)(((char *)owner_o) + offset);
PyObject *attr_o = *value_ptr;
PyObject *attr_o = FT_ATOMIC_LOAD_PTR_ACQUIRE(*value_ptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know that the values array is still valid at this point?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a case that we are not handling correctly here: if, after loading value_ptr, another thread invalidates the inline values, modifies an attribute, and reallocates another object at the same memory address as the previous attribute, we can incorrectly return a value that wasn't the attribute at that offset.

We need to either:

  • Re-check _PyObject_InlineValues(owner_o)->valid at the end of this handler and clean-up and deopt if it's not still true.
  • Set the contents of the inlines values array to NULL when we mark it as invalid so that the _Py_TryIncrefCompareStackRef handles this case and correctly deopts.

I think the second options is probably simpler to implement and more efficient given that _LOAD_ATTR_INSTANCE_VALUE is common and invalidating inline values is relatively rare.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalidating inline values might not be that rare.
I suggest gathering some stats before deciding how to handle this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From
https://github.com/faster-cpython/benchmarking-public/blob/main/results/bm-20250104-3.14.0a3%2B-0cafa97/bm-20250104-azure-x86_64-python-0cafa97932c6574734bb-3.14.0a3%2B-0cafa97-pystats.md:

LOAD_ATTR_INSTANCE_VALUE: 4,922,474,147
Materialize dict (on request) | 4,444,396 | 2.4%
Materialize dict (new key) | 476,375 | 0.3%
Materialize dict (too big) | 4,884 | 0.0%
Materialize dict (str subclass) | 0 | 0.0%

So, LOAD_ATTR_INSTANCE_VALUE is somewhere between 1,000x-10,000x more frequent than invalidating inline values. (Invalidating inline values requires first materializing the dictionary, but materializing the dictionary doesn't always invalidate the inline values).

DEOPT_IF(attr_o == NULL);
#ifdef Py_GIL_DISABLED
if (!_Py_TryIncrefCompareStackRef(value_ptr, attr_o, &attr)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does _Py_TryIncrefCompareStackRef do?
Function names need to be self explanatory. The result of this function does not depend on a comparison, but on whether something has been modified. Maybe _Py_TryIncrefIfPointersConsistentStackRef as it only works if the `PyObject ** and PyObject *pointers are consistent.

I know that _Py_TryIncrefCompareStackRef was not introduced in this PR, but it is critical to the understanding of this code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an atomic compare-and-incref function, similar to how _Py_atomic_compare_exchange is an atomic compare-and-swap function. If the comparison succeeds, the value is incref'd and 1 is returned. 0 is returned on failure.

"Compare" is a more appropriate term than "consistent" here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think "Compare" is misleading. The name should say what the function does, not how it does it.

PyObject *attr_o;
if (!LOCK_OBJECT(dict)) {
DEAD(dict);
POP_DEAD_INPUTS();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems rather obscure. I see why you need to pop dict, but DEAD(dict); POP_DEAD_INPUTS() seems a rather convoluted way to do this.

Since you adding a new code generator macro, why not POP(dict)?
I think that most people would expect that an explicitly popped value is no longer live, so there should be no need for a kill and a pop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can look at that. POP is already taken by a macro in ceval_macros.h, but we could use POP_INPUT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside:
We shouldn't be using POP, as the code generator should handle pops. We can remove POP in another PR.

mpage added 7 commits January 8, 2025 15:33
These are two logically separate operations
This avoids a race where another thread invalidates the values, overwrites
an attribute stored in the values, and allocates a new object at the address
present in the values.
@mpage
Copy link
Contributor Author

mpage commented Jan 10, 2025

A lot of effort in this PR seems to go into LOAD_ATTR_WITH_HINT. There is a lot of locking and unlocking and extra machinery to support having a value on the stack between uops.
Is it worth it?

The extra machinery to support having a value on the stack is also necessary for LOAD_ATTR_MODULE, so the only net new effort is locking. Increasing the fraction of objects that use inlined values and can be handled by LOAD_ATTR_INSTANCE_VALUE does sound like useful work, but can be done separately.

@mpage mpage requested a review from colesbury January 10, 2025 01:43
@mpage mpage requested a review from colesbury January 10, 2025 21:53
@mpage mpage added the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Jan 10, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @mpage for commit 7ab3ec6 🤖

If you want to schedule another build, you need to add the 🔨 test-with-refleak-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Jan 10, 2025
@mpage mpage merged commit b5ee025 into python:main Jan 14, 2025
63 checks passed
@mpage mpage deleted the gh-115999-load-attr-instance-merged branch January 14, 2025 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants