Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pylint -j0 always uses 1 core on real hardware in Linux #7338

Open
dsilakov opened this issue Aug 22, 2022 · 7 comments
Open

pylint -j0 always uses 1 core on real hardware in Linux #7338

dsilakov opened this issue Aug 22, 2022 · 7 comments
Labels
Bug 🪲 multiprocessing Needs PR This issue is accepted, sufficiently specified and now needs an implementation

Comments

@dsilakov
Copy link

Bug description

-j0 should automatically detect number of processors available. However, after the changes that introduce _cpu_count() function for calculation available resources in clouds, this works incorrectly on real hardware.

A potential problem lies here: https://github.com/PyCQA/pylint/blob/main/pylint/lint/run.py#L59
If we reach this code (we don't have cpu.cfs_quota_us set) then we always divide /sys/fs/cgroup/cpu/cpu.shares by 1024. Maybe this is correct for AWS, but e.g. in ubuntu 20.04 cpu.shares is set to 1024 by default. So it doesn't matter how many cpu cores do you have, pylint -j0 will always use only one of them.

The whole idea of using cpu.shares in such a way looks strange, since it just declares an allowed "share" of CPU resources for the particular process.

Configuration

Ubuntu 20.04 x86_64, real hw, 12 cores

Command used

$ time pylint -j0 pylint/*
$ time pylint -j1 pylint/*
$ time pylint -j12 pylint/*

Pylint output

$ time pylint -j0 pylint/*
30 sec

$ time pylint -j1 pylint/*
30 sec

$ time pylint -j12 pylint/*
8 sec

Expected behavior

$ time pylint -j0 pylint/*
8 sec

$ time pylint -j1 pylint/*
30 sec

$ time pylint -j12 pylint/*
8 sec

Pylint version

pylint 2.14.5
astroid 2.11.7
Python 3.8.10

OS / Environment

Ubuntu 20.04 x86_64

Additional dependencies

No response

@dsilakov dsilakov added the Needs triage 📥 Just created, needs acknowledgment, triage, and proper labelling label Aug 22, 2022
@Pierre-Sassoulas Pierre-Sassoulas added Bug 🪲 multiprocessing and removed Needs triage 📥 Just created, needs acknowledgment, triage, and proper labelling labels Aug 22, 2022
@DanielNoord
Copy link
Collaborator

@dsilakov I must say I'm no expert in containers and CPU's so I'll likely need some help fixing this.

Based on discussion in https://bugs.python.org/issue36054 and the implementation of CPU counting in JVM in https://github.com/openjdk/jdk/blob/d5686b87f31d6c57ec6b3e5e9c85a04209dbac7a/src/hotspot/os/linux/os_linux.cpp#L5304-L5336
https://github.com/openjdk/jdk/blob/2d5137e403e16b694800b2ffe18c3640396b757e/src/hotspot/os/linux/osContainer_linux.cpp#L517-L591

Am I correct in thinking that we incorrectly determine whether cgroups is activated? The third step in the JVM process is what we mimic with _cpu_count but I guess in your case we shouldn't even end up there?

@Pierre-Sassoulas
Copy link
Member

Related to #6098 and #6903

@dsilakov
Copy link
Author

To be more precise - in JVM, the magic you are referencing to happens inside the OSContainer class and JVM seems to have some trickier logic when detecting if it is running "containerized" (under cgroup control) or not.

Having /sys/fs/cgroup/cpu/cpu.shares and other cpu.shares for particular processes on the host system is completely legit - for example, systemd uses them to limit CPU resources. However I don't see a generic way to transform these "shares" to real cores since this depends on how many other processes do you have. Maybe division by 1024 indeed works inside AWS, but this still looks like some kind of heuristic.

A good explanation is provided for example in https://www.batey.info/cgroup-cpu-shares-for-docker.html:

The cpu_share has no meaning in isolation. For example, setting a cpu_share of 512 gives you no information about how much CPU time the container will get. Setting one container’s cpu_share to 512 and another container’s to 1024 means that the second container will get double the amount of CPU time as the first. That said, it still gives no information about the actual amount of CPU time each container will get, only the relative amount between them.

@dsilakov
Copy link
Author

dsilakov commented Aug 23, 2022

Heh, after a closer look, JVM also contains a set of hacks for treating cpu.shares equal to 1024:

https://github.com/openjdk/jdk/blob/master/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp#L278

https://github.com/openjdk/jdk/blob/master/src/java.base/linux/classes/jdk/internal/platform/cgroupv1/CgroupV1Subsystem.java#L247

Honestly I don't know what should be a proper fix for pylint here. For our scenarios, I've just commented out "elif os.path.isfile('/sys/fs/cgroup/cpu/cpu.shares')" part. In our K8s cluster and dockers we do see /sys/fs/cgroup/cpu/cpu.cfs_quota_us so the code still works fine inside the containers. But I can't check AWS for which the problematic elif was intended for.

@DanielNoord
Copy link
Collaborator

Yeah, I'm not an user of AWS myself but I'm wondering if there is some other way we can use to detect if we are in an AWS environment instead of what we currently use in the elif.

@Pierre-Sassoulas
Copy link
Member

We had a duplicate opened in #9537 and the op said pytest does things correctly. So here's how pytest-xdist does it:

https://github.com/pytest-dev/pytest-xdist/blob/470bc83a253d6a05d2f282e981c61e438dbd0302/src/xdist/plugin.py#L14-L52

@Pierre-Sassoulas Pierre-Sassoulas added the Needs PR This issue is accepted, sufficiently specified and now needs an implementation label Apr 4, 2024
@dsilakov
Copy link
Author

dsilakov commented Apr 5, 2024

We had a duplicate opened in #9537 and the op said pytest does things correctly. So here's how pytest-xdist does it:

https://github.com/pytest-dev/pytest-xdist/blob/470bc83a253d6a05d2f282e981c61e438dbd0302/src/xdist/plugin.py#L14-L52

That solution is still not perfect, if you run it inside docker container than it won't take container limitations into account.

E.g., inside a container launched via docker --cpus 2 ... all functions used in that patch (psutil.cpu_count(), len(sched_getaffinity(0)), multiprocess.cpu_count() will all return the actual number of CPU cores on the host, not 2. So this brings us to the original issue - it was an attempt to fix issue with container limitations that break -j0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug 🪲 multiprocessing Needs PR This issue is accepted, sufficiently specified and now needs an implementation
Projects
None yet
Development

No branches or pull requests

3 participants