-
Notifications
You must be signed in to change notification settings - Fork 51
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster startup -- Experiment B -- streamline and tweak #65
Comments
FWIW, in the 100 most popular PyPI packages, these were the 50 most common constants:
|
Crazy idea: combine these two instructions. It could do something like switch (oparg) {
case 0: value = Py_None; break;
case 1: value = Py_False; break;
// Several more cases
default:
value = PyLong_FromLong(oparg - 50);
PUSH(value);
DISPATCH();
}
Py_INCREF(value);
PUSH(value);
DISPATCH(); The idea would be that oparg values in the range 10 to 255 translate to -40 to 215 -- that should cover the most common integers. (Or we could use a different bias, e.g. only go down to -5, since cached small ints cover the range of -5 to 255.) Then again, a dedicated MAKE_INT instruction would be faster. (We could still support negative values by using a bias value.) |
I missed empty bytes ( |
|
Oh, it's |
I started working on MAKE_INT here iritkatriel/cpython#26 I think this opcode might be more appropriately named LOAD_SMALL_INT. Initial perf numbers show this to be slightly slower than LOAD_CONST (I'll post something here soon). It might help if I create a function that takes the index in the small int cache and does the lookup. Right now we translate the opcode (which is the index) to the value and then the lookup function transforms the value back to the index. |
New:
Old:
Sanity check: New:
Old:
same dis as above. |
I'm not getting very stable results from pyperformance, but here's a table anyway.
|
So I did the optimisation I mentioned above and now I get: which is comparable to LOAD_CONST. The change is this:
and then changed in coeval.c:
|
Cool! So the compiler applies the bias? That makes the bytecode dependent on its value — easy to miss changing the magic number if the bias changes. I also wonder whether it would be faster or slower if instead we just hard-coded a few popular values (e.g. -1 .. 3) in LOAD_COMMON_CONSTANT? |
I think that might be tidier. There is too much Mark-style breaking of the rules in what I did. |
Here's a rewrite with one common-consts array for ints and other types: For now I only populate ints, next I will uncomment the other types (and fix the test). |
I think we need to check the common consts again. |
Uh, I found it. It is used for function defaults.
|
I think we should only use How many of the constants in #65 (comment) are in run-once code that would be better implemented as bytecode? The reason I ask is that creating objects in the interpreter is likely to be faster than in the marshal module. Small integer constants and single character constants might be better implemented with their own instructions, |
I would personally only put empty sequences in there (i.e., There's a potentially clever ("Mark-style") trick to use this for small integers without having a large array -- in the interpreter state, allocate the array of singletons (see above) right before the array of small ints, so that over-indexing the singletons array gets you a small int. (To make it less hackish we could have a single array and adjust the macros for getting a small int from its value -- it's all fixed offsets relative to the start of the interpreter state struct anyways.) However, I wouldn't do that yet, because IIUC @ericsnowcurrently is planning to replace the array of small integer pointers with an array of small integer structs -- see #79. Extending that idea for |
Right, so do we want one opcode for all common consts or do we want LOAD_INT, LOAD_CHAR and LOAD_COMMON_CONST for a few others? |
I would start with just the special constants, and include integer (only)
zero. Next small integers, and forget about characters.--
--Guido (mobile)
|
I started doing some prep work that we will need for any kind LOAD_CONST split - basically to refactor LOAD_CONST handling into a way that can be generalized later. dis module: python/cpython#28258 Also this to make the tests easier to work with: python/cpython#28247 |
Are there any objections to renaming the LOAD_CONST opcode to LOAD_CO_CONST? If we do that then LOAD_CONST is a collective name for LOAD_CO_CONST, LOAD_NONE, LOAD_COMMON_CONST, etc, and it can appear as such in macros that generate one of those based on the value. If we leave it LOAD_CONST I think it's confusing. |
I think renaming LOAD_CONST is not a great idea. It defies 30+ years of
tradition.
|
I've done LOAD_NONE (on top of the three prep PRs, so I made a PR into my branch that contains the prep): iritkatriel/cpython#29 It's showing a small speedup, see on the PR. Note that the docstring Nones are still in co_const. |
I wonder if the best compromise for the docstring isn't to define a new flag bit (we have plenty of spare bits) that means "co_consts[0] is the docstring". Leave it unset for code objects without docstrings (including classes, lambdas, comprehensions), set it for functions that have a docstring. |
We can try: iritkatriel/cpython#30 |
As I commented at here, removing docstring from
|
How about we try something super simple that makes None exceptional? See what the refleaks tests say.
|
The ref leak test rely on _Py_RefTotal so it's just a matter of not updating it for None, like:
I'll try to get some perf numbers for this tomorrow. |
main branch:
LOAD_NONE:
LOAD_NONE + immortal None:
|
A RETURN_NONE opcode might also make sense. |
FYI, regarding "immortal objects" see (UPDATE: This should be python/cpython#24828) |
|
That's not the link you pasted in the chat (which I have now lost). |
In the chat, Eric posted this: python/cpython#24828 |
The docstring would also be on the function object, so you could access it from there. Arguably it's a property of the function and not of the function's code object. Or am I missing something? |
I am thinking of a situation (say scanning PYC files) where you don't want to execute any code. Without execution, you have no function objects. So the doc string as specified in the source has to be somewhere in some code object. Function objects exist primarily to store all the dynamic state wrapping code objects:
There are also a bunch of "metadata" attributes that are set on the function from the code object: |
I added total size of co_consts to the script: faster-cpython/tools#5 Here are some results. Total co_consts size goes from 735,736 to 718,616, so about 2.3% smaller. It will get even better (if) when we move None docstrings out of co_consts. Old:
With LOAD_NONE:
|
Here are top 30 opcode pairs with LOAD_NONE: Note: LOAD_NONE --> RETURN_VALUE 7,439 2.52%
|
While I do think we should do RETURN_NONE, I wouldn't expect the world of it -- dynamically, there can be only one RETURN_NONE instruction per call (by definition :-). |
With this PR we got: Total co_consts size goes from 48,822 to 46,682, about 4.3% less. If I add this change: iritkatriel/cpython#30 |
With only the docstring change (but no LOAD_NONE), it's 48,302 . |
Either,there’s a typo or that last conclusion is wrong (48,302 is hardly less than 48,822)? |
Sorry, I misread the numbers, you're right. So it's the other way around - adding the docstring change to LOAD_NONE helps a lot, but docstring alone not so much. |
So there are 520 code objects that have a None at position 0 that's only used to indicate there's no docstring (these code objects don't load None on the stack at all). And there are 2140 code objects that have a None in at another position used to load None only (we can deduce that these code objects have a docstring at position 0). But 7944 code objects have a None in position 0 that is used both to load None and to indicate there's no docstring. So we should do both things if we care about the size of co_consts. Counter to this, we don't measure a real speedup, we save only 64 kBif all those packages you counted are imported in one process, and the marshal encoding of None is a single byte ( I say we give up. :-( It's too bad, I had high hopes for each of these until we did the work. |
Yes, it doesn't seem to be worth another opcode. Is there a chance that a MAKE_INT opcode will do anything? |
I expect it's the same story. The common small ints are already cached, so LOAD_CONST is more efficient than anything else. We'd save much less because there just are fewer of these than None loads (6.27% of the top 25 constants are ints), although the marshal size is more (5 bytes -- there's no "short int" type in marshal, and it probably wouldn't matter). |
One more thing I haven't tried yet is LOAD_CONST(None)+STORE_FAST: |
But if I read that correctly, it's only ~0.2% of opcode pairs that gets improved. Maybe we should create a framework for measuring dynamic opcode pairs, using e.g. a benchmark suite like PyPerformance as a basis. |
There are still some ideas here that we might do, so I'm moving it to Discussions. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
This is a placeholder for a description of Experiment B (see #32 (comment))
try to speed up unmarshalling of code objects by low-level optimizations
streamline code objects (fewer redundant, computed fields; fewer small objects)
opcodes for immediate values:
Benefit of MAKE_INT and LOAD_COMMON_CONSTANT is that they reduce the size of co_consts (and hence of PYC files) without adding to the size of the instruction array. Downside is that the speed and simplicity of LOAD_CONST is hard to beat, so it will be hard to measure the difference in benchmarks. But None, False, True and small ints make up a lot of the constants (we can easily do research on this).
The text was updated successfully, but these errors were encountered: