Faster startup -- Experiment B -- streamline and tweak #65

gvanrossum · 2021-07-12T19:55:59Z

This is a placeholder for a description of Experiment B (see #32 (comment))

try to speed up unmarshalling of code objects by low-level optimizations
streamline code objects (fewer redundant, computed fields; fewer small objects)
opcodes for immediate values:
- MAKE_INT creates an int from oparg; for 0-255 this is a sure "integer cache" hit that we can inline
- LOAD_COMMON_CONSTANT: oparg indexes in a second switch (or array?) with shared immutable common constants, e.g.
  - 0: None
  - 1: False
  - 2: True
  - 3: Ellipsis
  - 4: AssertionError (subsumes LOAD_ASSERTION_ERROR)
  - 5: ''
  - 6: ()
  - 7-11 (or some other range): negative values -1 through -5 (also sure hits in the integer cache)
Benefit of MAKE_INT and LOAD_COMMON_CONSTANT is that they reduce the size of co_consts (and hence of PYC files) without adding to the size of the instruction array. Downside is that the speed and simplicity of LOAD_CONST is hard to beat, so it will be hard to measure the difference in benchmarks. But None, False, True and small ints make up a lot of the constants (we can easily do research on this).

gvanrossum · 2021-07-12T22:23:58Z

FWIW, in the 100 most popular PyPI packages, these were the 50 most common constants:

Total: 239 errors; 9,992 files; 218,124 code objects; 3,369,978 lines; 1,614,918 constants

Top 50 constants:
                                    None :     208223 (12.89%)
                                       0 :      41652 (2.58%)
                                       1 :      34245 (2.12%)
                                    True :      21520 (1.33%)
                                       2 :      19403 (1.20%)
                                   False :      16121 (1.00%)
                                       3 :      10646 (0.66%)
                                       5 :       6599 (0.41%)
                                       4 :       6413 (0.40%)
                                      -1 :       6359 (0.39%)
                                      '' :       5680 (0.35%)
                              ('dtype',) :       5539 (0.34%)
                                      10 :       5538 (0.34%)
                                     'a' :       4088 (0.25%)
                              ('match',) :       3505 (0.22%)
                                   'foo' :       3393 (0.21%)
                                      () :       2992 (0.19%)
                                     1.0 :       2940 (0.18%)
                                     'b' :       2837 (0.18%)
                                       6 :       2639 (0.16%)
                                 (None,) :       2562 (0.16%)
                                     'A' :       2385 (0.15%)
                                       8 :       2217 (0.14%)
                                  'name' :       2149 (0.13%)
                                'return' :       2132 (0.13%)
                                       7 :       2124 (0.13%)
                                     0.5 :       2006 (0.12%)
                              ('index',) :       1956 (0.12%)
                                     'x' :       1949 (0.12%)
                               ('name',) :       1946 (0.12%)
                                       9 :       1940 (0.12%)
                                     0.0 :       1939 (0.12%)
                               (1, 2, 3) :       1892 (0.12%)
                                     100 :       1859 (0.12%)
                                     '/' :       1624 (0.10%)
                                   'bar' :       1608 (0.10%)
                               ('axis',) :       1605 (0.10%)
                                      12 :       1574 (0.10%)
                                      20 :       1557 (0.10%)
                                    '\n' :       1552 (0.10%)
                                     'B' :       1538 (0.10%)
                                    'id' :       1459 (0.09%)
                        ('primary_key',) :       1421 (0.09%)
                                     ' ' :       1405 (0.09%)
                                     2.0 :       1386 (0.09%)
                                     'c' :       1365 (0.08%)
                                     '.' :       1362 (0.08%)
                                  'data' :       1283 (0.08%)
                                 'utf-8' :       1213 (0.08%)
                              '__main__' :       1186 (0.07%)

gvanrossum · 2021-07-16T00:08:29Z

Crazy idea: combine these two instructions. It could do something like

switch (oparg) {
case 0: value = Py_None; break;
case 1: value = Py_False; break;
// Several more cases
default:
    value = PyLong_FromLong(oparg - 50);
    PUSH(value);
    DISPATCH();
}
Py_INCREF(value);
PUSH(value);
DISPATCH();

The idea would be that oparg values in the range 10 to 255 translate to -40 to 215 -- that should cover the most common integers. (Or we could use a different bias, e.g. only go down to -5, since cached small ints cover the range of -5 to 255.)

Then again, a dedicated MAKE_INT instruction would be faster. (We could still support negative values by using a bias value.)

gvanrossum · 2021-08-11T21:01:47Z

I missed empty bytes (b'') for LOAD_COMMON_CONSTANT. There are probably a few other empty immutable values I missed (empty frozen set?). I also believe that (None,) may be generated in some case by the compiler, but I can't recall where exactly (closures? comprehensions?)

methane · 2021-08-12T00:21:57Z

(None,) is common for simple functions without docstring.

gvanrossum · 2021-08-12T00:26:21Z

Oh, it's co_consts. Maybe we should add the docstring to an explicit field in the code object. Then we can load it on demand -- another little save on code load time, perhaps.

iritkatriel · 2021-08-19T19:56:15Z

I started working on MAKE_INT here iritkatriel/cpython#26
Currently some test_modulefinder tests are still failing.

I think this opcode might be more appropriately named LOAD_SMALL_INT.

Initial perf numbers show this to be slightly slower than LOAD_CONST (I'll post something here soon). It might help if I create a function that takes the index in the small int cache and does the lookup. Right now we translate the opcode (which is the index) to the value and then the lookup function transforms the value back to the index.

iritkatriel · 2021-08-19T20:15:11Z

New:

% ./python.exe -m timeit "x=2"
20000000 loops, best of 5: 11.8 nsec per loop
>>> dis.dis("x=2")
  1           0 MAKE_INT                 7 (2)
              2 STORE_NAME               0 (x)
              4 LOAD_CONST               0 (None)
              6 RETURN_VALUE

Old:

% ./python.exe -m timeit "x = 2"
20000000 loops, best of 5: 10.8 nsec per loop

>>> dis.dis("x=2")
  1           0 LOAD_CONST               0 (2)
              2 STORE_NAME               0 (x)
              4 LOAD_CONST               1 (None)
              6 RETURN_VALUE

Sanity check:

New:

% ./python.exe -m timeit "x=300"
20000000 loops, best of 5: 10.7 nsec per loop

>>> dis.dis("x=300")
  1           0 LOAD_CONST               0 (300)
              2 STORE_NAME               0 (x)
              4 LOAD_CONST               1 (None)
              6 RETURN_VALUE

Old:

% ./python.exe -m timeit "x = 300"
20000000 loops, best of 5: 10.6 nsec per loop

same dis as above.

iritkatriel · 2021-08-19T21:05:00Z

I'm not getting very stable results from pyperformance, but here's a table anyway.

+-------------------------+---------+-----------------------+
| Benchmark               | old     | new                   |
+=========================+=========+=======================+
| xml_etree_iterparse     | 124 ms  | 119 ms: 1.04x faster  |
+-------------------------+---------+-----------------------+
| richards                | 77.0 ms | 75.0 ms: 1.03x faster |
+-------------------------+---------+-----------------------+
| fannkuch                | 554 ms  | 540 ms: 1.03x faster  |
+-------------------------+---------+-----------------------+
| scimark_lu              | 186 ms  | 182 ms: 1.02x faster  |
+-------------------------+---------+-----------------------+
| python_startup          | 14.1 ms | 13.8 ms: 1.02x faster |
+-------------------------+---------+-----------------------+
| scimark_sparse_mat_mult | 6.52 ms | 6.41 ms: 1.02x faster |
+-------------------------+---------+-----------------------+
| python_startup_no_site  | 10.2 ms | 10.1 ms: 1.02x faster |
+-------------------------+---------+-----------------------+
| telco                   | 7.41 ms | 7.31 ms: 1.01x faster |
+-------------------------+---------+-----------------------+
| unpack_sequence         | 61.1 ns | 60.5 ns: 1.01x faster |
+-------------------------+---------+-----------------------+
| logging_format          | 9.25 us | 9.17 us: 1.01x faster |
+-------------------------+---------+-----------------------+
| sympy_integrate         | 25.8 ms | 25.6 ms: 1.01x faster |
+-------------------------+---------+-----------------------+
| sympy_expand            | 613 ms  | 609 ms: 1.01x faster  |
+-------------------------+---------+-----------------------+
| pickle_list             | 4.10 us | 4.12 us: 1.01x slower |
+-------------------------+---------+-----------------------+
| spectral_norm           | 183 ms  | 185 ms: 1.01x slower  |
+-------------------------+---------+-----------------------+
| scimark_sor             | 212 ms  | 214 ms: 1.01x slower  |
+-------------------------+---------+-----------------------+
| chameleon               | 10.4 ms | 10.6 ms: 1.01x slower |
+-------------------------+---------+-----------------------+
| scimark_monte_carlo     | 101 ms  | 102 ms: 1.01x slower  |
+-------------------------+---------+-----------------------+
| pickle                  | 11.2 us | 11.4 us: 1.01x slower |
+-------------------------+---------+-----------------------+
| unpickle_list           | 4.83 us | 4.90 us: 1.01x slower |
+-------------------------+---------+-----------------------+
| regex_dna               | 187 ms  | 190 ms: 1.02x slower  |
+-------------------------+---------+-----------------------+
| pathlib                 | 49.5 ms | 50.6 ms: 1.02x slower |
+-------------------------+---------+-----------------------+
| logging_silent          | 198 ns  | 205 ns: 1.03x slower  |
+-------------------------+---------+-----------------------+
| hexiom                  | 10.2 ms | 10.5 ms: 1.03x slower |
+-------------------------+---------+-----------------------+
| 2to3                    | 362 ms  | 380 ms: 1.05x slower  |
+-------------------------+---------+-----------------------+
| pickle_dict             | 24.0 us | 25.4 us: 1.06x slower |
+-------------------------+---------+-----------------------+
| Geometric mean          | (ref)   | 1.00x faster          |
+-------------------------+---------+-----------------------+


Benchmark hidden because not significant (33): xml_etree_generate, float, sqlalchemy_declarative, unpickle, logging_simple, django_template, nbody, mako, pidigits, meteor_contest, json_loads, raytrace, regex_compile, sqlite_synth, pyflate, chaos, sympy_sum, sqlalchemy_imperative, xml_etree_process, sympy_str, crypto_pyaes, deltablue, dulwich_log, scimark_fft, xml_etree_parse, json_dumps, nqueens, pickle_pure_python, regex_effbot, regex_v8, unpickle_pure_python, go, tornado_http

iritkatriel · 2021-08-19T21:41:13Z

So I did the optimisation I mentioned above and now I get:
iritkatriel@Irits-MBP cpython % ./python.exe -m timeit "x = 2"
20000000 loops, best of 5: 11.1 nsec per loop
iritkatriel@Irits-MBP cpython % ./python.exe -m timeit "x = 2"
20000000 loops, best of 5: 10.8 nsec per loop
iritkatriel@Irits-MBP cpython % ./python.exe -m timeit "x = 2"
20000000 loops, best of 5: 10.9 nsec per loop

which is comparable to LOAD_CONST.

The change is this:
I added in Include/internal/pycore_long.h

+static inline PyObject* _PyLong_SmallIntLookiup_internal(size_t index)
+{
+    PyInterpreterState *interp = _PyInterpreterState_GET();
+    assert(0 <= index && index < _PY_NSMALLNEGINTS + _PY_NSMALLPOSINTS);
+    PyObject *obj = (PyObject*)interp->small_ints[index];
+    // must not be called before _PyLong_Init() nor after _PyLong_Fini().
+    assert(obj != NULL);
+    return obj;
+}

and then changed in coeval.c:

         TARGET(MAKE_INT): {
-            PyObject *value = _PyLong_GetSmallInt(oparg - _PY_MAKE_INT_BIAS);
+            PyObject *value = _PyLong_SmallIntLookiup_internal(oparg);
+            Py_INCREF(value);
             PUSH(value);
             DISPATCH();
         }

gvanrossum · 2021-08-19T22:23:45Z

Cool! So the compiler applies the bias? That makes the bytecode dependent on its value — easy to miss changing the magic number if the bias changes.

I also wonder whether it would be faster or slower if instead we just hard-coded a few popular values (e.g. -1 .. 3) in LOAD_COMMON_CONSTANT?

iritkatriel · 2021-08-19T22:27:29Z

I think that might be tidier. There is too much Mark-style breaking of the rules in what I did.

iritkatriel · 2021-08-24T13:37:24Z

Here's a rewrite with one common-consts array for ints and other types:
https://github.com/iritkatriel/cpython/pull/27/files

For now I only populate ints, next I will uncomment the other types (and fix the test).

methane · 2021-09-02T04:57:57Z

I think we need to check the common consts again. (None,) is common for co_consts, but it is not a target of LOAD_CONST.
LOAD_CONST loads a code object containing the co_consts, or None from the co_consts. But LOAD_CONST don't load co_const itself.

methane · 2021-09-02T06:17:52Z

Uh, I found it. It is used for function defaults.

#    def truncate(self, pos=None):

401          32 LOAD_CONST              31 ((None,))
             34 LOAD_CONST               7 (<code object truncate at 0x7fd5084a4a40, file "Lib/_pyio.py", line 401>)
             36 MAKE_FUNCTION            1 (defaults)
             38 STORE_NAME               7 (truncate)

markshannon · 2021-09-02T11:27:54Z

I think we should only use LOAD_COMMON_CONST for a small set of genuinely common (or necessary, like AssertError) constants.

How many of the constants in #65 (comment) are in run-once code that would be better implemented as bytecode?

The reason I ask is that creating objects in the interpreter is likely to be faster than in the marshal module.
E.g. If the const ("x", "y") is only used once, then LOAD_CONST "x"; LOAD_CONST "y"; BUILD_TUPLE 2 is likely to be faster than LOAD_CONST ("x", "y") as it moves the work of building of the tuple from the marshal module to the interpreter.

Small integer constants and single character constants might be better implemented with their own instructions, LOAD_INT and LOAD_CHAR.

gvanrossum · 2021-09-02T21:45:14Z

I would personally only put empty sequences in there (i.e., (), b"", "", and frozenset()), some zeros (0, 0.0, maybe 0j), and Ellipsis and AssertionError. Even (None,) seems excessive -- I'd much sooner add 1 and -1 and maybe 2.

There's a potentially clever ("Mark-style") trick to use this for small integers without having a large array -- in the interpreter state, allocate the array of singletons (see above) right before the array of small ints, so that over-indexing the singletons array gets you a small int. (To make it less hackish we could have a single array and adjust the macros for getting a small int from its value -- it's all fixed offsets relative to the start of the interpreter state struct anyways.)

However, I wouldn't do that yet, because IIUC @ericsnowcurrently is planning to replace the array of small integer pointers with an array of small integer structs -- see #79. Extending that idea for None, (), "" etc. would seem cumbersome.

iritkatriel · 2021-09-03T13:03:16Z

Small integer constants and single character constants might be better implemented with their own instructions, LOAD_INT and LOAD_CHAR.

Right, so do we want one opcode for all common consts or do we want LOAD_INT, LOAD_CHAR and LOAD_COMMON_CONST for a few others?

gvanrossum · 2021-09-03T16:08:45Z

I would start with just the special constants, and include integer (only) zero. Next small integers, and forget about characters.-- --Guido (mobile)

iritkatriel · 2021-09-09T17:26:08Z

I started doing some prep work that we will need for any kind LOAD_CONST split - basically to refactor LOAD_CONST handling into a way that can be generalized later.

dis module: python/cpython#28258
C code: python/cpython#28262

Also this to make the tests easier to work with: python/cpython#28247

iritkatriel · 2021-09-09T17:28:16Z

Are there any objections to renaming the LOAD_CONST opcode to LOAD_CO_CONST?

If we do that then LOAD_CONST is a collective name for LOAD_CO_CONST, LOAD_NONE, LOAD_COMMON_CONST, etc, and it can appear as such in macros that generate one of those based on the value. If we leave it LOAD_CONST I think it's confusing.

gvanrossum · 2021-09-10T00:36:23Z

I think renaming LOAD_CONST is not a great idea. It defies 30+ years of tradition.

iritkatriel · 2021-09-10T14:39:48Z

I've done LOAD_NONE (on top of the three prep PRs, so I made a PR into my branch that contains the prep): iritkatriel/cpython#29

It's showing a small speedup, see on the PR.

Note that the docstring Nones are still in co_const.

gvanrossum · 2021-09-10T19:18:05Z

I wonder if the best compromise for the docstring isn't to define a new flag bit (we have plenty of spare bits) that means "co_consts[0] is the docstring". Leave it unset for code objects without docstrings (including classes, lambdas, comprehensions), set it for functions that have a docstring.

iritkatriel · 2021-09-10T22:02:43Z

I wonder if the best compromise for the docstring isn't to define a new flag bit (we have plenty of spare bits) that means "co_consts[0] is the docstring". Leave it unset for code objects without docstrings (including classes, lambdas, comprehensions), set it for functions that have a docstring.

We can try: iritkatriel/cpython#30

methane · 2021-09-11T05:09:32Z

As I commented at here, removing docstring from co_consts make it possible to merge many co_consts tuples of simple methods with docstring.

co_consts of most simple methods will be empty tuple. Empty tuple is singleton. So it is nearly zero-cost.

gvanrossum · 2021-09-12T14:22:23Z

How about we try something super simple that makes None exceptional? See what the refleaks tests say.

iritkatriel · 2021-09-12T22:15:06Z

The ref leak test rely on _Py_RefTotal so it's just a matter of not updating it for None, like:

 #ifdef Py_REF_DEBUG
-    _Py_RefTotal++;
+    if (!Py_IsNone(op)) {
+        _Py_RefTotal++;
+    }
 #endif

I'll try to get some perf numbers for this tomorrow.

iritkatriel · 2021-09-13T10:47:11Z

main branch:

iritkatriel@Irits-MBP cpython % repeat 5 ./python.exe -m timeit "x=None"
20000000 loops, best of 5: 10.8 nsec per loop
20000000 loops, best of 5: 10.7 nsec per loop
20000000 loops, best of 5: 10.7 nsec per loop
20000000 loops, best of 5: 10.7 nsec per loop
20000000 loops, best of 5: 10.7 nsec per loop

LOAD_NONE:

iritkatriel@Irits-MBP cpython-1 % repeat 5 ./python.exe -m timeit "x=None"
20000000 loops, best of 5: 10.2 nsec per loop
20000000 loops, best of 5: 10.2 nsec per loop
20000000 loops, best of 5: 10.2 nsec per loop
20000000 loops, best of 5: 10.2 nsec per loop
20000000 loops, best of 5: 10.3 nsec per loop

LOAD_NONE + immortal None:

iritkatriel@Irits-MBP cpython-1 % repeat 5 ./python.exe -m timeit "x=None"
20000000 loops, best of 5: 9.77 nsec per loop
50000000 loops, best of 5: 9.79 nsec per loop
20000000 loops, best of 5: 9.86 nsec per loop
50000000 loops, best of 5: 9.81 nsec per loop
50000000 loops, best of 5: 9.78 nsec per loop

iritkatriel · 2021-09-13T13:33:08Z

Before I try your branch, here are some quick numbers about the potential of optimizing LOAD_CONST(None) + STORE_FAST. In a fraction of the packages I have downloaded, 3.7% of all opcodes counted are STORE_FAST, and of those, 5.8% were preceded by LOAD_CONST(None).

A RETURN_NONE opcode might also make sense.

ericsnowcurrently · 2021-09-15T16:07:56Z

FYI, regarding "immortal objects" see ~~python/cpython#28335~~.

(UPDATE: This should be python/cpython#24828)

iritkatriel · 2021-09-15T16:13:48Z

static inline void _Py_DECREF(
#if defined(Py_REF_DEBUG) && !(defined(Py_LIMITED_API) && Py_LIMITED_API+0 >= 0x030A0000)
    const char *filename, int lineno,
@@ -497,17 +502,29 @@ static inline void _Py_DECREF(
    // Non-limited C API and limited C API for Python 3.9 and older access
    // directly PyObject.ob_refcnt.
#ifdef Py_REF_DEBUG
-    _Py_RefTotal--;
+    if (!Py_IsNone(op)) {
+        _Py_RefTotal--;
+    }
#endif
    if (--op->ob_refcnt != 0) {
#ifdef Py_REF_DEBUG
        if (op->ob_refcnt < 0) {
-            _Py_NegativeRefcount(filename, lineno, op);
+            if (Py_IsNone(op)) {
+                op->ob_refcnt = PY_SSIZE_T_MAX / 2;
+            }
+            else {
+                _Py_NegativeRefcount(filename, lineno, op);
+            }
        }
#endif
    }
    else {
-        _Py_Dealloc(op);
+        if (Py_IsNone(op)) {
+            op->ob_refcnt = PY_SSIZE_T_MAX / 2;
+        }
+        else {
+            _Py_Dealloc(op);
+        }
    }
#endif
}

iritkatriel · 2021-09-15T17:25:54Z

FYI, regarding "immortal objects" see python/cpython#28335.

That's not the link you pasted in the chat (which I have now lost).

gvanrossum · 2021-09-15T18:08:25Z

In the chat, Eric posted this: python/cpython#24828

iritkatriel · 2021-09-21T16:40:44Z

His proposal is to set the function docstring explicitly in the bytecode that creates the function -- then the docstring would become part of the co_consts of the parent code object, most likely a class or module. I am beginning to warm up to this idea, though not completely -- I have a feeling there could be a tooling need for knowing the docstring of a (function) code object without looking at its parent, but it's just intuition. See bpo-36521 for further discussion.

The docstring would also be on the function object, so you could access it from there. Arguably it's a property of the function and not of the function's code object. Or am I missing something?

gvanrossum · 2021-09-21T18:57:37Z

The docstring would also be on the function object, so you could access it from there. Arguably it's a property of the function and not of the function's code object. Or am I missing something?

I am thinking of a situation (say scanning PYC files) where you don't want to execute any code. Without execution, you have no function objects. So the doc string as specified in the source has to be somewhere in some code object.

Function objects exist primarily to store all the dynamic state wrapping code objects:

First and foremost, a pointer to the globals
Then, default argument values (which are computed dynamically when the function executes)
Next, "function attributes" (arbitrary things set on the function by attribute assignment)
Finally, annotations (also set dynamically, similarly to default argument values)

There are also a bunch of "metadata" attributes that are set on the function from the code object: __name__, __qualname__, and __doc__. Of these, __doc__ is unique in that it doesn't correspond to an attribute on the code object. (Then again, another metadata attribute is co_filename, which doesn't exist on the function object; and neither is co_firstlineno.)

iritkatriel · 2021-09-22T12:10:02Z

I added total size of co_consts to the script: faster-cpython/tools#5

Here are some results.

Total co_consts size goes from 735,736 to 718,616, so about 2.3% smaller.

It will get even better (if) when we move None docstrings out of co_consts.

Old:

iritkatriel@Irits-MBP faster-cpython-tools % ../cpython/python.exe scripts/count_opcodes.py --singles 20 packages/*
packages/boto3-1.9.99.tar.gz: 38 files; 495 code objects; 7,218 lines; 14,273 opcodes; 13,778 opcode pairs; 7,127.0 cache_size; 1,401.0 cache wasted; 2,863 ops quickened; 37,616 co_consts size
packages/botocore-1.9.9.tar.gz: 253 files; 5,362 code objects; 68,939 lines; 191,786 opcodes; 186,424 opcode pairs; 92,619.0 cache_size; 20,519.0 cache wasted; 36,050 ops quickened; 371 prev extended args; 475,968 co_consts size
packages/s3transfer-0.5.0.tar.gz: 50 files; 1,839 code objects; 18,456 lines; 61,123 opcodes; 59,284 opcode pairs; 30,951.0 cache_size; 4,797.0 cache wasted; 13,077 ops quickened; 140,016 co_consts size
packages/six-1.9.0.tar.gz: 4 files; 227 code objects; 1,945 lines; 7,627 opcodes; 7,400 opcode pairs; 4,311.0 cache_size; 721.0 cache wasted; 1,795 ops quickened; 128 prev extended args; 18,736 co_consts size
packages/urllib3-1.9.1.tar.gz: 55 files; 706 code objects; 9,602 lines; 29,268 opcodes; 28,562 opcode pairs; 15,172.0 cache_size; 3,028.0 cache wasted; 6,072 ops quickened; 15 prev extended args; 63,400 co_consts size
Total: 400 files; 8,629 code objects; 106,160 lines; 304,077 opcodes; 295,448 opcode pairs; 150,180.0 cache_size; 30,466.0 cache wasted; 59,857 ops quickened; 514 prev extended args; 735,736 co_consts size

Top 20 single opcodes:
LOAD_CONST               60,316  19.84%
LOAD_FAST                46,660  15.34%
LOAD_ATTR                19,014   6.25%
POP_TOP                  18,111   5.96%
LOAD_METHOD              16,006   5.26%
STORE_NAME               15,350   5.05%
CALL_METHOD              14,113   4.64%
LOAD_GLOBAL              12,389   4.07%
STORE_FAST               11,721   3.85%
CALL_FUNCTION            10,747   3.53%
RETURN_VALUE             10,230   3.36%
MAKE_FUNCTION             8,229   2.71%
LOAD_NAME                 5,651   1.86%
STORE_ATTR                3,626   1.19%
BUILD_MAP                 3,030   1.00%
POP_JUMP_IF_FALSE         3,028   1.00%
BINARY_SUBSCR             2,686   0.88%
IMPORT_NAME               2,639   0.87%
IMPORT_FROM               2,385   0.78%
DUP_TOP                   2,242   0.74%

With LOAD_NONE:

iritkatriel@Irits-MBP faster-cpython-tools % ../cpython-1/python.exe scripts/count_opcodes.py --singles 20 packages/*
packages/boto3-1.9.99.tar.gz: 38 files; 495 code objects; 7,218 lines; 14,273 opcodes; 13,778 opcode pairs; 7,127.0 cache_size; 1,401.0 cache wasted; 2,863 ops quickened; 36,152 co_consts size
packages/botocore-1.9.9.tar.gz: 253 files; 5,362 code objects; 68,939 lines; 191,783 opcodes; 186,421 opcode pairs; 92,618.0 cache_size; 20,518.0 cache wasted; 36,050 ops quickened; 368 prev extended args; 466,104 co_consts size
packages/s3transfer-0.5.0.tar.gz: 50 files; 1,839 code objects; 18,456 lines; 61,123 opcodes; 59,284 opcode pairs; 30,951.0 cache_size; 4,797.0 cache wasted; 13,077 ops quickened; 136,816 co_consts size
packages/six-1.9.0.tar.gz: 4 files; 227 code objects; 1,945 lines; 7,626 opcodes; 7,399 opcode pairs; 4,310.0 cache_size; 720.0 cache wasted; 1,795 ops quickened; 127 prev extended args; 18,200 co_consts size
packages/urllib3-1.9.1.tar.gz: 55 files; 706 code objects; 9,602 lines; 29,268 opcodes; 28,562 opcode pairs; 15,172.0 cache_size; 3,028.0 cache wasted; 6,072 ops quickened; 15 prev extended args; 61,344 co_consts size
Total: 400 files; 8,629 code objects; 106,160 lines; 304,073 opcodes; 295,444 opcode pairs; 150,178.0 cache_size; 30,464.0 cache wasted; 59,857 ops quickened; 510 prev extended args; 718,616 co_consts size

Top 20 single opcodes:
LOAD_CONST               49,382  16.24%
LOAD_FAST                46,660  15.34%
LOAD_ATTR                19,014   6.25%
POP_TOP                  18,111   5.96%
LOAD_METHOD              16,006   5.26%
STORE_NAME               15,350   5.05%
CALL_METHOD              14,113   4.64%
LOAD_GLOBAL              12,389   4.07%
STORE_FAST               11,721   3.85%
LOAD_NONE                10,934   3.60%
CALL_FUNCTION            10,747   3.53%
RETURN_VALUE             10,230   3.36%
MAKE_FUNCTION             8,229   2.71%
LOAD_NAME                 5,651   1.86%
STORE_ATTR                3,626   1.19%
BUILD_MAP                 3,030   1.00%
POP_JUMP_IF_FALSE         3,028   1.00%
BINARY_SUBSCR             2,686   0.88%
IMPORT_NAME               2,639   0.87%
IMPORT_FROM               2,385   0.78%

iritkatriel · 2021-09-22T12:25:41Z

Here are top 30 opcode pairs with LOAD_NONE:

Note: LOAD_NONE --> RETURN_VALUE 7,439 2.52%

Top 30 opcode pairs:
LOAD_FAST            --> LOAD_ATTR                12,187   4.12%
LOAD_CONST           --> LOAD_CONST               11,575   3.92%
LOAD_FAST            --> LOAD_METHOD              10,756   3.64%
STORE_NAME           --> LOAD_CONST                8,541   2.89%
LOAD_CONST           --> MAKE_FUNCTION             8,229   2.79%
CALL_METHOD          --> POP_TOP                   8,171   2.77%
LOAD_NONE            --> RETURN_VALUE              7,439   2.52%
STORE_FAST           --> LOAD_FAST                 7,052   2.39%
LOAD_METHOD          --> LOAD_FAST                 6,355   2.15%
MAKE_FUNCTION        --> STORE_NAME                6,323   2.14%
LOAD_FAST            --> LOAD_FAST                 5,277   1.79%
POP_TOP              --> LOAD_FAST                 5,131   1.74%
POP_TOP              --> LOAD_NONE                 4,696   1.59%
LOAD_METHOD          --> LOAD_CONST                4,577   1.55%
LOAD_FAST            --> LOAD_CONST                4,147   1.40%
LOAD_GLOBAL          --> LOAD_FAST                 4,016   1.36%
LOAD_CONST           --> CALL_METHOD               3,984   1.35%
LOAD_ATTR            --> LOAD_METHOD               3,832   1.30%
LOAD_CONST           --> LOAD_FAST                 3,693   1.25%
LOAD_GLOBAL          --> LOAD_ATTR                 3,681   1.25%
LOAD_METHOD          --> CALL_METHOD               3,575   1.21%
LOAD_ATTR            --> LOAD_CONST                3,509   1.19%
LOAD_ATTR            --> LOAD_FAST                 3,203   1.08%
LOAD_FAST            --> STORE_ATTR                2,976   1.01%
LOAD_FAST            --> CALL_METHOD               2,849   0.96%
POP_TOP              --> POP_TOP                   2,755   0.93%
CALL_METHOD          --> STORE_FAST                2,517   0.85%
IMPORT_FROM          --> STORE_NAME                2,360   0.80%
LOAD_FAST            --> CALL_FUNCTION             2,295   0.78%
LOAD_CONST           --> STORE_NAME                2,167   0.73%

gvanrossum · 2021-09-22T17:32:21Z

While I do think we should do RETURN_NONE, I wouldn't expect the world of it -- dynamically, there can be only one RETURN_NONE instruction per call (by definition :-).

iritkatriel · 2021-09-27T13:25:34Z

With this PR we got:

Total co_consts size goes from 48,822 to 46,682, about 4.3% less.

If I add this change: iritkatriel/cpython#30
(it adds a flag to the function to indicate whether its docstring is not None, and excludes None from the co_consts) then the total size of co_consts reduces to 40,878, which is consistent with the previous result that None is 20% of the consts.

iritkatriel · 2021-09-27T14:05:35Z

With only the docstring change (but no LOAD_NONE), it's 48,302 . ~~So most of the Nones are from docstrings.~~

gvanrossum · 2021-09-27T14:19:51Z

Either,there’s a typo or that last conclusion is wrong (48,302 is hardly less than 48,822)?

iritkatriel · 2021-09-27T14:49:28Z

Sorry, I misread the numbers, you're right. So it's the other way around - adding the docstring change to LOAD_NONE helps a lot, but docstring alone not so much.

gvanrossum · 2021-09-27T17:34:18Z

So there are 520 code objects that have a None at position 0 that's only used to indicate there's no docstring (these code objects don't load None on the stack at all).

And there are 2140 code objects that have a None in at another position used to load None only (we can deduce that these code objects have a docstring at position 0).

But 7944 code objects have a None in position 0 that is used both to load None and to indicate there's no docstring.

So we should do both things if we care about the size of co_consts.

Counter to this, we don't measure a real speedup, we save only 64 kBif all those packages you counted are imported in one process, and the marshal encoding of None is a single byte (TYPE_NONE) so it only saves 8 kB there.

I say we give up. :-( It's too bad, I had high hopes for each of these until we did the work.

iritkatriel · 2021-09-27T17:40:26Z

Yes, it doesn't seem to be worth another opcode.

Is there a chance that a MAKE_INT opcode will do anything?

gvanrossum · 2021-09-27T17:45:22Z

I expect it's the same story. The common small ints are already cached, so LOAD_CONST is more efficient than anything else. We'd save much less because there just are fewer of these than None loads (6.27% of the top 25 constants are ints), although the marshal size is more (5 bytes -- there's no "short int" type in marshal, and it probably wouldn't matter).

iritkatriel · 2021-09-27T17:59:44Z

One more thing I haven't tried yet is LOAD_CONST(None)+STORE_FAST:
#65 (comment)

gvanrossum · 2021-09-27T18:21:07Z

But if I read that correctly, it's only ~0.2% of opcode pairs that gets improved.

Maybe we should create a framework for measuring dynamic opcode pairs, using e.g. a benchmark suite like PyPerformance as a basis.

gvanrossum · 2021-12-02T21:16:11Z

There are still some ideas here that we might do, so I'm moving it to Discussions.

gvanrossum changed the title ~~Faster startup -- Experiment B~~ Faster startup -- Experiment B -- streamline and tweak Jul 12, 2021

gvanrossum mentioned this issue Jul 12, 2021

Improve startup time. #32

Closed

gvanrossum mentioned this issue Aug 2, 2021

Faster startup -- Experiment C -- lazy unmarshalling #66

Closed

gvanrossum mentioned this issue Aug 9, 2021

[profiling] Profile/analyze runtime startup (and shutdown). #75

Open

4 tasks

iritkatriel mentioned this issue Aug 26, 2021

Experiment b commonconst iritkatriel/cpython#27

Closed

iritkatriel mentioned this issue Oct 18, 2021

Merge co_names and co_consts into a single array. #92

Closed

faster-cpython locked and limited conversation to collaborators Dec 2, 2021

gvanrossum closed this as completed Dec 2, 2021

gramster added this to Fancy CPython Board Jan 10, 2022

gramster moved this to Todo in Fancy CPython Board Jan 10, 2022

gramster moved this from Todo to Other in Fancy CPython Board Jan 10, 2022

gramster removed this from Fancy CPython Board Jan 24, 2022

This issue was moved to a discussion.

Faster startup -- Experiment B -- streamline and tweak #65

Faster startup -- Experiment B -- streamline and tweak #65

Comments

gvanrossum commented Jul 12, 2021 • edited Loading

gvanrossum commented Jul 12, 2021

gvanrossum commented Jul 16, 2021 • edited Loading

gvanrossum commented Aug 11, 2021

methane commented Aug 12, 2021

gvanrossum commented Aug 12, 2021

iritkatriel commented Aug 19, 2021

iritkatriel commented Aug 19, 2021

iritkatriel commented Aug 19, 2021

iritkatriel commented Aug 19, 2021

gvanrossum commented Aug 19, 2021

iritkatriel commented Aug 19, 2021

iritkatriel commented Aug 24, 2021

methane commented Sep 2, 2021

methane commented Sep 2, 2021

markshannon commented Sep 2, 2021

gvanrossum commented Sep 2, 2021

iritkatriel commented Sep 3, 2021

gvanrossum commented Sep 3, 2021 via email

iritkatriel commented Sep 9, 2021

iritkatriel commented Sep 9, 2021

gvanrossum commented Sep 10, 2021 via email

iritkatriel commented Sep 10, 2021

gvanrossum commented Sep 10, 2021

iritkatriel commented Sep 10, 2021

methane commented Sep 11, 2021

gvanrossum commented Sep 12, 2021 via email • edited Loading

iritkatriel commented Sep 12, 2021 • edited Loading

iritkatriel commented Sep 13, 2021

iritkatriel commented Sep 13, 2021

ericsnowcurrently commented Sep 15, 2021 • edited by gvanrossum Loading

iritkatriel commented Sep 15, 2021

iritkatriel commented Sep 15, 2021

gvanrossum commented Sep 15, 2021

iritkatriel commented Sep 21, 2021

gvanrossum commented Sep 21, 2021

iritkatriel commented Sep 22, 2021 • edited Loading

iritkatriel commented Sep 22, 2021

gvanrossum commented Sep 22, 2021

iritkatriel commented Sep 27, 2021

iritkatriel commented Sep 27, 2021 • edited Loading

gvanrossum commented Sep 27, 2021

iritkatriel commented Sep 27, 2021

gvanrossum commented Sep 27, 2021

iritkatriel commented Sep 27, 2021

gvanrossum commented Sep 27, 2021

iritkatriel commented Sep 27, 2021

gvanrossum commented Sep 27, 2021

gvanrossum commented Dec 2, 2021

This issue was moved to a discussion.

gvanrossum commented Jul 12, 2021 •

edited

Loading

gvanrossum commented Jul 16, 2021 •

edited

Loading

gvanrossum commented Sep 12, 2021 via email •

edited

Loading

iritkatriel commented Sep 12, 2021 •

edited

Loading

ericsnowcurrently commented Sep 15, 2021 •

edited by gvanrossum

Loading

iritkatriel commented Sep 22, 2021 •

edited

Loading

iritkatriel commented Sep 27, 2021 •

edited

Loading