Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster startup -- Experiment B -- streamline and tweak #65

Closed
gvanrossum opened this issue Jul 12, 2021 · 62 comments
Closed

Faster startup -- Experiment B -- streamline and tweak #65

gvanrossum opened this issue Jul 12, 2021 · 62 comments

Comments

@gvanrossum
Copy link
Collaborator

gvanrossum commented Jul 12, 2021

This is a placeholder for a description of Experiment B (see #32 (comment))

  • try to speed up unmarshalling of code objects by low-level optimizations

  • streamline code objects (fewer redundant, computed fields; fewer small objects)

  • opcodes for immediate values:

    • MAKE_INT creates an int from oparg; for 0-255 this is a sure "integer cache" hit that we can inline
    • LOAD_COMMON_CONSTANT: oparg indexes in a second switch (or array?) with shared immutable common constants, e.g.
      • 0: None
      • 1: False
      • 2: True
      • 3: Ellipsis
      • 4: AssertionError (subsumes LOAD_ASSERTION_ERROR)
      • 5: ''
      • 6: ()
      • 7-11 (or some other range): negative values -1 through -5 (also sure hits in the integer cache)

    Benefit of MAKE_INT and LOAD_COMMON_CONSTANT is that they reduce the size of co_consts (and hence of PYC files) without adding to the size of the instruction array. Downside is that the speed and simplicity of LOAD_CONST is hard to beat, so it will be hard to measure the difference in benchmarks. But None, False, True and small ints make up a lot of the constants (we can easily do research on this).

@gvanrossum gvanrossum changed the title Faster startup -- Experiment B Faster startup -- Experiment B -- streamline and tweak Jul 12, 2021
@gvanrossum
Copy link
Collaborator Author

FWIW, in the 100 most popular PyPI packages, these were the 50 most common constants:

Total: 239 errors; 9,992 files; 218,124 code objects; 3,369,978 lines; 1,614,918 constants

Top 50 constants:
                                    None :     208223 (12.89%)
                                       0 :      41652 (2.58%)
                                       1 :      34245 (2.12%)
                                    True :      21520 (1.33%)
                                       2 :      19403 (1.20%)
                                   False :      16121 (1.00%)
                                       3 :      10646 (0.66%)
                                       5 :       6599 (0.41%)
                                       4 :       6413 (0.40%)
                                      -1 :       6359 (0.39%)
                                      '' :       5680 (0.35%)
                              ('dtype',) :       5539 (0.34%)
                                      10 :       5538 (0.34%)
                                     'a' :       4088 (0.25%)
                              ('match',) :       3505 (0.22%)
                                   'foo' :       3393 (0.21%)
                                      () :       2992 (0.19%)
                                     1.0 :       2940 (0.18%)
                                     'b' :       2837 (0.18%)
                                       6 :       2639 (0.16%)
                                 (None,) :       2562 (0.16%)
                                     'A' :       2385 (0.15%)
                                       8 :       2217 (0.14%)
                                  'name' :       2149 (0.13%)
                                'return' :       2132 (0.13%)
                                       7 :       2124 (0.13%)
                                     0.5 :       2006 (0.12%)
                              ('index',) :       1956 (0.12%)
                                     'x' :       1949 (0.12%)
                               ('name',) :       1946 (0.12%)
                                       9 :       1940 (0.12%)
                                     0.0 :       1939 (0.12%)
                               (1, 2, 3) :       1892 (0.12%)
                                     100 :       1859 (0.12%)
                                     '/' :       1624 (0.10%)
                                   'bar' :       1608 (0.10%)
                               ('axis',) :       1605 (0.10%)
                                      12 :       1574 (0.10%)
                                      20 :       1557 (0.10%)
                                    '\n' :       1552 (0.10%)
                                     'B' :       1538 (0.10%)
                                    'id' :       1459 (0.09%)
                        ('primary_key',) :       1421 (0.09%)
                                     ' ' :       1405 (0.09%)
                                     2.0 :       1386 (0.09%)
                                     'c' :       1365 (0.08%)
                                     '.' :       1362 (0.08%)
                                  'data' :       1283 (0.08%)
                                 'utf-8' :       1213 (0.08%)
                              '__main__' :       1186 (0.07%)

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Jul 16, 2021

Crazy idea: combine these two instructions. It could do something like

switch (oparg) {
case 0: value = Py_None; break;
case 1: value = Py_False; break;
// Several more cases
default:
    value = PyLong_FromLong(oparg - 50);
    PUSH(value);
    DISPATCH();
}
Py_INCREF(value);
PUSH(value);
DISPATCH();

The idea would be that oparg values in the range 10 to 255 translate to -40 to 215 -- that should cover the most common integers. (Or we could use a different bias, e.g. only go down to -5, since cached small ints cover the range of -5 to 255.)

Then again, a dedicated MAKE_INT instruction would be faster. (We could still support negative values by using a bias value.)

@gvanrossum
Copy link
Collaborator Author

I missed empty bytes (b'') for LOAD_COMMON_CONSTANT. There are probably a few other empty immutable values I missed (empty frozen set?). I also believe that (None,) may be generated in some case by the compiler, but I can't recall where exactly (closures? comprehensions?)

@methane
Copy link

methane commented Aug 12, 2021

(None,) is common for simple functions without docstring.

@gvanrossum
Copy link
Collaborator Author

Oh, it's co_consts. Maybe we should add the docstring to an explicit field in the code object. Then we can load it on demand -- another little save on code load time, perhaps.

@iritkatriel
Copy link
Collaborator

I started working on MAKE_INT here iritkatriel/cpython#26
Currently some test_modulefinder tests are still failing.

I think this opcode might be more appropriately named LOAD_SMALL_INT.

Initial perf numbers show this to be slightly slower than LOAD_CONST (I'll post something here soon). It might help if I create a function that takes the index in the small int cache and does the lookup. Right now we translate the opcode (which is the index) to the value and then the lookup function transforms the value back to the index.

@iritkatriel
Copy link
Collaborator

New:

% ./python.exe -m timeit "x=2"
20000000 loops, best of 5: 11.8 nsec per loop
>>> dis.dis("x=2")
  1           0 MAKE_INT                 7 (2)
              2 STORE_NAME               0 (x)
              4 LOAD_CONST               0 (None)
              6 RETURN_VALUE

Old:

% ./python.exe -m timeit "x = 2"
20000000 loops, best of 5: 10.8 nsec per loop

>>> dis.dis("x=2")
  1           0 LOAD_CONST               0 (2)
              2 STORE_NAME               0 (x)
              4 LOAD_CONST               1 (None)
              6 RETURN_VALUE

Sanity check:

New:

% ./python.exe -m timeit "x=300"
20000000 loops, best of 5: 10.7 nsec per loop

>>> dis.dis("x=300")
  1           0 LOAD_CONST               0 (300)
              2 STORE_NAME               0 (x)
              4 LOAD_CONST               1 (None)
              6 RETURN_VALUE

Old:

% ./python.exe -m timeit "x = 300"
20000000 loops, best of 5: 10.6 nsec per loop

same dis as above.

@iritkatriel
Copy link
Collaborator

I'm not getting very stable results from pyperformance, but here's a table anyway.

+-------------------------+---------+-----------------------+
| Benchmark               | old     | new                   |
+=========================+=========+=======================+
| xml_etree_iterparse     | 124 ms  | 119 ms: 1.04x faster  |
+-------------------------+---------+-----------------------+
| richards                | 77.0 ms | 75.0 ms: 1.03x faster |
+-------------------------+---------+-----------------------+
| fannkuch                | 554 ms  | 540 ms: 1.03x faster  |
+-------------------------+---------+-----------------------+
| scimark_lu              | 186 ms  | 182 ms: 1.02x faster  |
+-------------------------+---------+-----------------------+
| python_startup          | 14.1 ms | 13.8 ms: 1.02x faster |
+-------------------------+---------+-----------------------+
| scimark_sparse_mat_mult | 6.52 ms | 6.41 ms: 1.02x faster |
+-------------------------+---------+-----------------------+
| python_startup_no_site  | 10.2 ms | 10.1 ms: 1.02x faster |
+-------------------------+---------+-----------------------+
| telco                   | 7.41 ms | 7.31 ms: 1.01x faster |
+-------------------------+---------+-----------------------+
| unpack_sequence         | 61.1 ns | 60.5 ns: 1.01x faster |
+-------------------------+---------+-----------------------+
| logging_format          | 9.25 us | 9.17 us: 1.01x faster |
+-------------------------+---------+-----------------------+
| sympy_integrate         | 25.8 ms | 25.6 ms: 1.01x faster |
+-------------------------+---------+-----------------------+
| sympy_expand            | 613 ms  | 609 ms: 1.01x faster  |
+-------------------------+---------+-----------------------+
| pickle_list             | 4.10 us | 4.12 us: 1.01x slower |
+-------------------------+---------+-----------------------+
| spectral_norm           | 183 ms  | 185 ms: 1.01x slower  |
+-------------------------+---------+-----------------------+
| scimark_sor             | 212 ms  | 214 ms: 1.01x slower  |
+-------------------------+---------+-----------------------+
| chameleon               | 10.4 ms | 10.6 ms: 1.01x slower |
+-------------------------+---------+-----------------------+
| scimark_monte_carlo     | 101 ms  | 102 ms: 1.01x slower  |
+-------------------------+---------+-----------------------+
| pickle                  | 11.2 us | 11.4 us: 1.01x slower |
+-------------------------+---------+-----------------------+
| unpickle_list           | 4.83 us | 4.90 us: 1.01x slower |
+-------------------------+---------+-----------------------+
| regex_dna               | 187 ms  | 190 ms: 1.02x slower  |
+-------------------------+---------+-----------------------+
| pathlib                 | 49.5 ms | 50.6 ms: 1.02x slower |
+-------------------------+---------+-----------------------+
| logging_silent          | 198 ns  | 205 ns: 1.03x slower  |
+-------------------------+---------+-----------------------+
| hexiom                  | 10.2 ms | 10.5 ms: 1.03x slower |
+-------------------------+---------+-----------------------+
| 2to3                    | 362 ms  | 380 ms: 1.05x slower  |
+-------------------------+---------+-----------------------+
| pickle_dict             | 24.0 us | 25.4 us: 1.06x slower |
+-------------------------+---------+-----------------------+
| Geometric mean          | (ref)   | 1.00x faster          |
+-------------------------+---------+-----------------------+


Benchmark hidden because not significant (33): xml_etree_generate, float, sqlalchemy_declarative, unpickle, logging_simple, django_template, nbody, mako, pidigits, meteor_contest, json_loads, raytrace, regex_compile, sqlite_synth, pyflate, chaos, sympy_sum, sqlalchemy_imperative, xml_etree_process, sympy_str, crypto_pyaes, deltablue, dulwich_log, scimark_fft, xml_etree_parse, json_dumps, nqueens, pickle_pure_python, regex_effbot, regex_v8, unpickle_pure_python, go, tornado_http

@iritkatriel
Copy link
Collaborator

So I did the optimisation I mentioned above and now I get:
iritkatriel@Irits-MBP cpython % ./python.exe -m timeit "x = 2"
20000000 loops, best of 5: 11.1 nsec per loop
iritkatriel@Irits-MBP cpython % ./python.exe -m timeit "x = 2"
20000000 loops, best of 5: 10.8 nsec per loop
iritkatriel@Irits-MBP cpython % ./python.exe -m timeit "x = 2"
20000000 loops, best of 5: 10.9 nsec per loop

which is comparable to LOAD_CONST.

The change is this:
I added in Include/internal/pycore_long.h

+static inline PyObject* _PyLong_SmallIntLookiup_internal(size_t index)
+{
+    PyInterpreterState *interp = _PyInterpreterState_GET();
+    assert(0 <= index && index < _PY_NSMALLNEGINTS + _PY_NSMALLPOSINTS);
+    PyObject *obj = (PyObject*)interp->small_ints[index];
+    // must not be called before _PyLong_Init() nor after _PyLong_Fini().
+    assert(obj != NULL);
+    return obj;
+}

and then changed in coeval.c:

         TARGET(MAKE_INT): {
-            PyObject *value = _PyLong_GetSmallInt(oparg - _PY_MAKE_INT_BIAS);
+            PyObject *value = _PyLong_SmallIntLookiup_internal(oparg);
+            Py_INCREF(value);
             PUSH(value);
             DISPATCH();
         }

@gvanrossum
Copy link
Collaborator Author

Cool! So the compiler applies the bias? That makes the bytecode dependent on its value — easy to miss changing the magic number if the bias changes.

I also wonder whether it would be faster or slower if instead we just hard-coded a few popular values (e.g. -1 .. 3) in LOAD_COMMON_CONSTANT?

@iritkatriel
Copy link
Collaborator

I think that might be tidier. There is too much Mark-style breaking of the rules in what I did.

@iritkatriel
Copy link
Collaborator

Here's a rewrite with one common-consts array for ints and other types:
https://github.com/iritkatriel/cpython/pull/27/files

For now I only populate ints, next I will uncomment the other types (and fix the test).

@methane
Copy link

methane commented Sep 2, 2021

I think we need to check the common consts again. (None,) is common for co_consts, but it is not a target of LOAD_CONST.
LOAD_CONST loads a code object containing the co_consts, or None from the co_consts. But LOAD_CONST don't load co_const itself.

@methane
Copy link

methane commented Sep 2, 2021

Uh, I found it. It is used for function defaults.

#    def truncate(self, pos=None):

401          32 LOAD_CONST              31 ((None,))
             34 LOAD_CONST               7 (<code object truncate at 0x7fd5084a4a40, file "Lib/_pyio.py", line 401>)
             36 MAKE_FUNCTION            1 (defaults)
             38 STORE_NAME               7 (truncate)

@markshannon
Copy link
Member

I think we should only use LOAD_COMMON_CONST for a small set of genuinely common (or necessary, like AssertError) constants.

How many of the constants in #65 (comment) are in run-once code that would be better implemented as bytecode?

The reason I ask is that creating objects in the interpreter is likely to be faster than in the marshal module.
E.g. If the const ("x", "y") is only used once, then LOAD_CONST "x"; LOAD_CONST "y"; BUILD_TUPLE 2 is likely to be faster than LOAD_CONST ("x", "y") as it moves the work of building of the tuple from the marshal module to the interpreter.

Small integer constants and single character constants might be better implemented with their own instructions, LOAD_INT and LOAD_CHAR.

@gvanrossum
Copy link
Collaborator Author

I would personally only put empty sequences in there (i.e., (), b"", "", and frozenset()), some zeros (0, 0.0, maybe 0j), and Ellipsis and AssertionError. Even (None,) seems excessive -- I'd much sooner add 1 and -1 and maybe 2.

There's a potentially clever ("Mark-style") trick to use this for small integers without having a large array -- in the interpreter state, allocate the array of singletons (see above) right before the array of small ints, so that over-indexing the singletons array gets you a small int. (To make it less hackish we could have a single array and adjust the macros for getting a small int from its value -- it's all fixed offsets relative to the start of the interpreter state struct anyways.)

However, I wouldn't do that yet, because IIUC @ericsnowcurrently is planning to replace the array of small integer pointers with an array of small integer structs -- see #79. Extending that idea for None, (), "" etc. would seem cumbersome.

@iritkatriel
Copy link
Collaborator

Small integer constants and single character constants might be better implemented with their own instructions, LOAD_INT and LOAD_CHAR.

Right, so do we want one opcode for all common consts or do we want LOAD_INT, LOAD_CHAR and LOAD_COMMON_CONST for a few others?

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Sep 3, 2021 via email

@iritkatriel
Copy link
Collaborator

I started doing some prep work that we will need for any kind LOAD_CONST split - basically to refactor LOAD_CONST handling into a way that can be generalized later.

dis module: python/cpython#28258
C code: python/cpython#28262

Also this to make the tests easier to work with: python/cpython#28247

@iritkatriel
Copy link
Collaborator

Are there any objections to renaming the LOAD_CONST opcode to LOAD_CO_CONST?

If we do that then LOAD_CONST is a collective name for LOAD_CO_CONST, LOAD_NONE, LOAD_COMMON_CONST, etc, and it can appear as such in macros that generate one of those based on the value. If we leave it LOAD_CONST I think it's confusing.

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Sep 10, 2021 via email

@iritkatriel
Copy link
Collaborator

I've done LOAD_NONE (on top of the three prep PRs, so I made a PR into my branch that contains the prep): iritkatriel/cpython#29

It's showing a small speedup, see on the PR.

Note that the docstring Nones are still in co_const.

@gvanrossum
Copy link
Collaborator Author

I wonder if the best compromise for the docstring isn't to define a new flag bit (we have plenty of spare bits) that means "co_consts[0] is the docstring". Leave it unset for code objects without docstrings (including classes, lambdas, comprehensions), set it for functions that have a docstring.

@iritkatriel
Copy link
Collaborator

I wonder if the best compromise for the docstring isn't to define a new flag bit (we have plenty of spare bits) that means "co_consts[0] is the docstring". Leave it unset for code objects without docstrings (including classes, lambdas, comprehensions), set it for functions that have a docstring.

We can try: iritkatriel/cpython#30

@methane
Copy link

methane commented Sep 11, 2021

As I commented at here, removing docstring from co_consts make it possible to merge many co_consts tuples of simple methods with docstring.

co_consts of most simple methods will be empty tuple. Empty tuple is singleton. So it is nearly zero-cost.

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Sep 12, 2021 via email

@iritkatriel
Copy link
Collaborator

iritkatriel commented Sep 12, 2021

The ref leak test rely on _Py_RefTotal so it's just a matter of not updating it for None, like:

 #ifdef Py_REF_DEBUG
-    _Py_RefTotal++;
+    if (!Py_IsNone(op)) {
+        _Py_RefTotal++;
+    }
 #endif

I'll try to get some perf numbers for this tomorrow.

@iritkatriel
Copy link
Collaborator

main branch:

iritkatriel@Irits-MBP cpython % repeat 5 ./python.exe -m timeit "x=None"
20000000 loops, best of 5: 10.8 nsec per loop
20000000 loops, best of 5: 10.7 nsec per loop
20000000 loops, best of 5: 10.7 nsec per loop
20000000 loops, best of 5: 10.7 nsec per loop
20000000 loops, best of 5: 10.7 nsec per loop

LOAD_NONE:

iritkatriel@Irits-MBP cpython-1 % repeat 5 ./python.exe -m timeit "x=None"
20000000 loops, best of 5: 10.2 nsec per loop
20000000 loops, best of 5: 10.2 nsec per loop
20000000 loops, best of 5: 10.2 nsec per loop
20000000 loops, best of 5: 10.2 nsec per loop
20000000 loops, best of 5: 10.3 nsec per loop

LOAD_NONE + immortal None:

iritkatriel@Irits-MBP cpython-1 % repeat 5 ./python.exe -m timeit "x=None"
20000000 loops, best of 5: 9.77 nsec per loop
50000000 loops, best of 5: 9.79 nsec per loop
20000000 loops, best of 5: 9.86 nsec per loop
50000000 loops, best of 5: 9.81 nsec per loop
50000000 loops, best of 5: 9.78 nsec per loop

@iritkatriel
Copy link
Collaborator

Before I try your branch, here are some quick numbers about the potential of optimizing LOAD_CONST(None) + STORE_FAST. In a fraction of the packages I have downloaded, 3.7% of all opcodes counted are STORE_FAST, and of those, 5.8% were preceded by LOAD_CONST(None).

A RETURN_NONE opcode might also make sense.

@ericsnowcurrently
Copy link
Collaborator

ericsnowcurrently commented Sep 15, 2021

FYI, regarding "immortal objects" see python/cpython#28335.

(UPDATE: This should be python/cpython#24828)

@iritkatriel
Copy link
Collaborator

static inline void _Py_DECREF(
#if defined(Py_REF_DEBUG) && !(defined(Py_LIMITED_API) && Py_LIMITED_API+0 >= 0x030A0000)
    const char *filename, int lineno,
@@ -497,17 +502,29 @@ static inline void _Py_DECREF(
    // Non-limited C API and limited C API for Python 3.9 and older access
    // directly PyObject.ob_refcnt.
#ifdef Py_REF_DEBUG
-    _Py_RefTotal--;
+    if (!Py_IsNone(op)) {
+        _Py_RefTotal--;
+    }
#endif
    if (--op->ob_refcnt != 0) {
#ifdef Py_REF_DEBUG
        if (op->ob_refcnt < 0) {
-            _Py_NegativeRefcount(filename, lineno, op);
+            if (Py_IsNone(op)) {
+                op->ob_refcnt = PY_SSIZE_T_MAX / 2;
+            }
+            else {
+                _Py_NegativeRefcount(filename, lineno, op);
+            }
        }
#endif
    }
    else {
-        _Py_Dealloc(op);
+        if (Py_IsNone(op)) {
+            op->ob_refcnt = PY_SSIZE_T_MAX / 2;
+        }
+        else {
+            _Py_Dealloc(op);
+        }
    }
#endif
}

@iritkatriel
Copy link
Collaborator

FYI, regarding "immortal objects" see python/cpython#28335.

That's not the link you pasted in the chat (which I have now lost).

@gvanrossum
Copy link
Collaborator Author

In the chat, Eric posted this: python/cpython#24828

@iritkatriel
Copy link
Collaborator

His proposal is to set the function docstring explicitly in the bytecode that creates the function -- then the docstring would become part of the co_consts of the parent code object, most likely a class or module. I am beginning to warm up to this idea, though not completely -- I have a feeling there could be a tooling need for knowing the docstring of a (function) code object without looking at its parent, but it's just intuition. See bpo-36521 for further discussion.

The docstring would also be on the function object, so you could access it from there. Arguably it's a property of the function and not of the function's code object. Or am I missing something?

@gvanrossum
Copy link
Collaborator Author

The docstring would also be on the function object, so you could access it from there. Arguably it's a property of the function and not of the function's code object. Or am I missing something?

I am thinking of a situation (say scanning PYC files) where you don't want to execute any code. Without execution, you have no function objects. So the doc string as specified in the source has to be somewhere in some code object.

Function objects exist primarily to store all the dynamic state wrapping code objects:

  • First and foremost, a pointer to the globals
  • Then, default argument values (which are computed dynamically when the function executes)
  • Next, "function attributes" (arbitrary things set on the function by attribute assignment)
  • Finally, annotations (also set dynamically, similarly to default argument values)

There are also a bunch of "metadata" attributes that are set on the function from the code object: __name__, __qualname__, and __doc__. Of these, __doc__ is unique in that it doesn't correspond to an attribute on the code object. (Then again, another metadata attribute is co_filename, which doesn't exist on the function object; and neither is co_firstlineno.)

@iritkatriel
Copy link
Collaborator

iritkatriel commented Sep 22, 2021

I added total size of co_consts to the script: faster-cpython/tools#5

Here are some results.

Total co_consts size goes from 735,736 to 718,616, so about 2.3% smaller.

It will get even better (if) when we move None docstrings out of co_consts.

Old:

iritkatriel@Irits-MBP faster-cpython-tools % ../cpython/python.exe scripts/count_opcodes.py --singles 20 packages/*
packages/boto3-1.9.99.tar.gz: 38 files; 495 code objects; 7,218 lines; 14,273 opcodes; 13,778 opcode pairs; 7,127.0 cache_size; 1,401.0 cache wasted; 2,863 ops quickened; 37,616 co_consts size
packages/botocore-1.9.9.tar.gz: 253 files; 5,362 code objects; 68,939 lines; 191,786 opcodes; 186,424 opcode pairs; 92,619.0 cache_size; 20,519.0 cache wasted; 36,050 ops quickened; 371 prev extended args; 475,968 co_consts size
packages/s3transfer-0.5.0.tar.gz: 50 files; 1,839 code objects; 18,456 lines; 61,123 opcodes; 59,284 opcode pairs; 30,951.0 cache_size; 4,797.0 cache wasted; 13,077 ops quickened; 140,016 co_consts size
packages/six-1.9.0.tar.gz: 4 files; 227 code objects; 1,945 lines; 7,627 opcodes; 7,400 opcode pairs; 4,311.0 cache_size; 721.0 cache wasted; 1,795 ops quickened; 128 prev extended args; 18,736 co_consts size
packages/urllib3-1.9.1.tar.gz: 55 files; 706 code objects; 9,602 lines; 29,268 opcodes; 28,562 opcode pairs; 15,172.0 cache_size; 3,028.0 cache wasted; 6,072 ops quickened; 15 prev extended args; 63,400 co_consts size
Total: 400 files; 8,629 code objects; 106,160 lines; 304,077 opcodes; 295,448 opcode pairs; 150,180.0 cache_size; 30,466.0 cache wasted; 59,857 ops quickened; 514 prev extended args; 735,736 co_consts size

Top 20 single opcodes:
LOAD_CONST               60,316  19.84%
LOAD_FAST                46,660  15.34%
LOAD_ATTR                19,014   6.25%
POP_TOP                  18,111   5.96%
LOAD_METHOD              16,006   5.26%
STORE_NAME               15,350   5.05%
CALL_METHOD              14,113   4.64%
LOAD_GLOBAL              12,389   4.07%
STORE_FAST               11,721   3.85%
CALL_FUNCTION            10,747   3.53%
RETURN_VALUE             10,230   3.36%
MAKE_FUNCTION             8,229   2.71%
LOAD_NAME                 5,651   1.86%
STORE_ATTR                3,626   1.19%
BUILD_MAP                 3,030   1.00%
POP_JUMP_IF_FALSE         3,028   1.00%
BINARY_SUBSCR             2,686   0.88%
IMPORT_NAME               2,639   0.87%
IMPORT_FROM               2,385   0.78%
DUP_TOP                   2,242   0.74%

With LOAD_NONE:

iritkatriel@Irits-MBP faster-cpython-tools % ../cpython-1/python.exe scripts/count_opcodes.py --singles 20 packages/*
packages/boto3-1.9.99.tar.gz: 38 files; 495 code objects; 7,218 lines; 14,273 opcodes; 13,778 opcode pairs; 7,127.0 cache_size; 1,401.0 cache wasted; 2,863 ops quickened; 36,152 co_consts size
packages/botocore-1.9.9.tar.gz: 253 files; 5,362 code objects; 68,939 lines; 191,783 opcodes; 186,421 opcode pairs; 92,618.0 cache_size; 20,518.0 cache wasted; 36,050 ops quickened; 368 prev extended args; 466,104 co_consts size
packages/s3transfer-0.5.0.tar.gz: 50 files; 1,839 code objects; 18,456 lines; 61,123 opcodes; 59,284 opcode pairs; 30,951.0 cache_size; 4,797.0 cache wasted; 13,077 ops quickened; 136,816 co_consts size
packages/six-1.9.0.tar.gz: 4 files; 227 code objects; 1,945 lines; 7,626 opcodes; 7,399 opcode pairs; 4,310.0 cache_size; 720.0 cache wasted; 1,795 ops quickened; 127 prev extended args; 18,200 co_consts size
packages/urllib3-1.9.1.tar.gz: 55 files; 706 code objects; 9,602 lines; 29,268 opcodes; 28,562 opcode pairs; 15,172.0 cache_size; 3,028.0 cache wasted; 6,072 ops quickened; 15 prev extended args; 61,344 co_consts size
Total: 400 files; 8,629 code objects; 106,160 lines; 304,073 opcodes; 295,444 opcode pairs; 150,178.0 cache_size; 30,464.0 cache wasted; 59,857 ops quickened; 510 prev extended args; 718,616 co_consts size

Top 20 single opcodes:
LOAD_CONST               49,382  16.24%
LOAD_FAST                46,660  15.34%
LOAD_ATTR                19,014   6.25%
POP_TOP                  18,111   5.96%
LOAD_METHOD              16,006   5.26%
STORE_NAME               15,350   5.05%
CALL_METHOD              14,113   4.64%
LOAD_GLOBAL              12,389   4.07%
STORE_FAST               11,721   3.85%
LOAD_NONE                10,934   3.60%
CALL_FUNCTION            10,747   3.53%
RETURN_VALUE             10,230   3.36%
MAKE_FUNCTION             8,229   2.71%
LOAD_NAME                 5,651   1.86%
STORE_ATTR                3,626   1.19%
BUILD_MAP                 3,030   1.00%
POP_JUMP_IF_FALSE         3,028   1.00%
BINARY_SUBSCR             2,686   0.88%
IMPORT_NAME               2,639   0.87%
IMPORT_FROM               2,385   0.78%

@iritkatriel
Copy link
Collaborator

Here are top 30 opcode pairs with LOAD_NONE:

Note: LOAD_NONE --> RETURN_VALUE 7,439 2.52%

Top 30 opcode pairs:
LOAD_FAST            --> LOAD_ATTR                12,187   4.12%
LOAD_CONST           --> LOAD_CONST               11,575   3.92%
LOAD_FAST            --> LOAD_METHOD              10,756   3.64%
STORE_NAME           --> LOAD_CONST                8,541   2.89%
LOAD_CONST           --> MAKE_FUNCTION             8,229   2.79%
CALL_METHOD          --> POP_TOP                   8,171   2.77%
LOAD_NONE            --> RETURN_VALUE              7,439   2.52%
STORE_FAST           --> LOAD_FAST                 7,052   2.39%
LOAD_METHOD          --> LOAD_FAST                 6,355   2.15%
MAKE_FUNCTION        --> STORE_NAME                6,323   2.14%
LOAD_FAST            --> LOAD_FAST                 5,277   1.79%
POP_TOP              --> LOAD_FAST                 5,131   1.74%
POP_TOP              --> LOAD_NONE                 4,696   1.59%
LOAD_METHOD          --> LOAD_CONST                4,577   1.55%
LOAD_FAST            --> LOAD_CONST                4,147   1.40%
LOAD_GLOBAL          --> LOAD_FAST                 4,016   1.36%
LOAD_CONST           --> CALL_METHOD               3,984   1.35%
LOAD_ATTR            --> LOAD_METHOD               3,832   1.30%
LOAD_CONST           --> LOAD_FAST                 3,693   1.25%
LOAD_GLOBAL          --> LOAD_ATTR                 3,681   1.25%
LOAD_METHOD          --> CALL_METHOD               3,575   1.21%
LOAD_ATTR            --> LOAD_CONST                3,509   1.19%
LOAD_ATTR            --> LOAD_FAST                 3,203   1.08%
LOAD_FAST            --> STORE_ATTR                2,976   1.01%
LOAD_FAST            --> CALL_METHOD               2,849   0.96%
POP_TOP              --> POP_TOP                   2,755   0.93%
CALL_METHOD          --> STORE_FAST                2,517   0.85%
IMPORT_FROM          --> STORE_NAME                2,360   0.80%
LOAD_FAST            --> CALL_FUNCTION             2,295   0.78%
LOAD_CONST           --> STORE_NAME                2,167   0.73%

@gvanrossum
Copy link
Collaborator Author

While I do think we should do RETURN_NONE, I wouldn't expect the world of it -- dynamically, there can be only one RETURN_NONE instruction per call (by definition :-).

@iritkatriel
Copy link
Collaborator

With this PR we got:

Total co_consts size goes from 48,822 to 46,682, about 4.3% less.

If I add this change: iritkatriel/cpython#30
(it adds a flag to the function to indicate whether its docstring is not None, and excludes None from the co_consts) then the total size of co_consts reduces to 40,878, which is consistent with the previous result that None is 20% of the consts.

@iritkatriel
Copy link
Collaborator

iritkatriel commented Sep 27, 2021

With only the docstring change (but no LOAD_NONE), it's 48,302 . So most of the Nones are from docstrings.

@gvanrossum
Copy link
Collaborator Author

Either,there’s a typo or that last conclusion is wrong (48,302 is hardly less than 48,822)?

@iritkatriel
Copy link
Collaborator

Sorry, I misread the numbers, you're right. So it's the other way around - adding the docstring change to LOAD_NONE helps a lot, but docstring alone not so much.

@gvanrossum
Copy link
Collaborator Author

So there are 520 code objects that have a None at position 0 that's only used to indicate there's no docstring (these code objects don't load None on the stack at all).

And there are 2140 code objects that have a None in at another position used to load None only (we can deduce that these code objects have a docstring at position 0).

But 7944 code objects have a None in position 0 that is used both to load None and to indicate there's no docstring.

So we should do both things if we care about the size of co_consts.

Counter to this, we don't measure a real speedup, we save only 64 kBif all those packages you counted are imported in one process, and the marshal encoding of None is a single byte (TYPE_NONE) so it only saves 8 kB there.

I say we give up. :-( It's too bad, I had high hopes for each of these until we did the work.

@iritkatriel
Copy link
Collaborator

Yes, it doesn't seem to be worth another opcode.

Is there a chance that a MAKE_INT opcode will do anything?

@gvanrossum
Copy link
Collaborator Author

I expect it's the same story. The common small ints are already cached, so LOAD_CONST is more efficient than anything else. We'd save much less because there just are fewer of these than None loads (6.27% of the top 25 constants are ints), although the marshal size is more (5 bytes -- there's no "short int" type in marshal, and it probably wouldn't matter).

@iritkatriel
Copy link
Collaborator

One more thing I haven't tried yet is LOAD_CONST(None)+STORE_FAST:
#65 (comment)

@gvanrossum
Copy link
Collaborator Author

But if I read that correctly, it's only ~0.2% of opcode pairs that gets improved.

Maybe we should create a framework for measuring dynamic opcode pairs, using e.g. a benchmark suite like PyPerformance as a basis.

@gvanrossum
Copy link
Collaborator Author

There are still some ideas here that we might do, so I'm moving it to Discussions.

@faster-cpython faster-cpython locked and limited conversation to collaborators Dec 2, 2021
@gramster gramster moved this to Todo in Fancy CPython Board Jan 10, 2022
@gramster gramster moved this from Todo to Other in Fancy CPython Board Jan 10, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants