Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node e2e: fix CPUManager related jobs. #30545

Merged
merged 1 commit into from
Sep 6, 2023

Conversation

ruquanzhao
Copy link
Member

@ruquanzhao ruquanzhao commented Aug 29, 2023

This PR fixes failing/not working jobs of CPU manager:

  • ci-kubernetes-node-kubelet-serial-cpu-manager
  • pull-kubernetes-node-kubelet-serial-cpu-manager
  • pull-kubernetes-node-kubelet-serial-cpu-manager-kubetest2
  • kubernetes-node-kubelet-containerd-resource-managers

Changes:

  • Use n2d-standard-32 to test complicated NUMA-related options of CPUManager.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. area/config Issues or PRs related to code in /config area/jobs sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Aug 29, 2023
@ruquanzhao
Copy link
Member Author

I have verified on my environment using:

go run /root/.gvm/pkgsets/go1.20/global/src/k8s.io/kubernetes/test/e2e_node/runner/remote/run_remote.go \
  --cleanup -vmodule=*=4 --ssh-env=gce --results-dir=/tmp/e2e-node-results --project=MYPROJECT \
  --use-dockerized-build=true --target-build-arch=linux/amd64 \
   --zone=asia-southeast1-b  --ssh-user=root --ssh-key=/root/.ssh/id_rsa \
   --ginkgo-flags='--nodes=1 --focus="\[Feature:CPUManager\]"' \
   --test_args='--container-runtime-endpoint=unix:///run/containerd/containerd.sock --container-runtime-process-name=/usr/bin/containerd --container-runtime-pid-file= --kubelet-flags="--cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/containerd.service" --extra-log="{\"name\": \"containerd.log\", \"journalctl\": [\"-u\", \"containerd*\"]}"' \
   --test-timeout=5h0m0s --image-config-file=/home/prow/go/src/k8s.io/test-infra/jobs/e2e_node/containerd/image-config-serial-cpu-manager.yaml

Here is the tailed output:

[SynchronizedAfterSuite]
test/e2e_node/e2e_node_suite_test.go:265
  I0829 07:33:28.998974    2651 e2e_node_suite_test.go:268] Stopping node services...
  I0829 07:33:28.998996    2651 server.go:260] Kill server "services"
  I0829 07:33:28.999004    2651 server.go:297] Killing process 3083 (services) with -TERM
  I0829 07:33:29.039252    2651 server.go:260] Kill server "kubelet"
  I0829 07:33:29.042957    2651 server.go:321] Stopping systemd unit for server "kubelet" with unit name: "kubelet-20230829T152345.service"
  I0829 07:33:29.110906    2651 services.go:153] Fetching log files...
  I0829 07:33:29.110999    2651 services.go:162] Get log file "containerd-installation.log" with journalctl command [-u containerd-installation].
  I0829 07:33:29.114397    2651 services.go:162] Get log file "kern.log" with journalctl command [-k].
  I0829 07:33:29.120312    2651 services.go:162] Get log file "cloud-init.log" with journalctl command [-u cloud*].
  I0829 07:33:29.125917    2651 services.go:162] Get log file "containerd.log" with journalctl command [-u containerd*].
  I0829 07:33:29.155218    2651 e2e_node_suite_test.go:273] Tests Finished
[SynchronizedAfterSuite] PASSED [0.156 seconds]
------------------------------
[ReportAfterSuite] Kubernetes e2e JUnit report
test/e2e/framework/test_context.go:585
[ReportAfterSuite] PASSED [0.003 seconds]
------------------------------

Ran 5 of 544 Specs in 564.727 seconds
SUCCESS! -- 5 Passed | 0 Failed | 0 Pending | 539 Skipped
PASS

Ginkgo ran 1 suite in 9m24.820041049s
Test Suite Passed

And below is the information of lscpu in n2d-standard-32:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  32
  On-line CPU(s) list:   0-31
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 7B12
    CPU family:          23
    Model:               49
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           2
    Stepping:            0
    BogoMIPS:            4499.99
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 h
                         t syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_ap
                         icid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdra
                         nd hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd i
                         brs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_
                         ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save umip rdpid
Virtualization features: 
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   512 KiB (16 instances)
  L1i:                   512 KiB (16 instances)
  L2:                    8 MiB (16 instances)
  L3:                    64 MiB (4 instances)
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-7,16-23
  NUMA node1 CPU(s):     8-15,24-31
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
  Srbds:                 Not affected
  Tsx async abort:       Not affected

@ruquanzhao
Copy link
Member Author

/cc @ffromani @dims @SergeyKanzhelev

@pacoxu
Copy link
Member

pacoxu commented Aug 29, 2023

Added in #29717
/assign @swatisehgal

@bart0sh
Copy link
Contributor

bart0sh commented Aug 29, 2023

/test pull-kubernetes-node-kubelet-serial-cpu-manager-kubetest2
/test pull-kubernetes-node-kubelet-serial-containerd-kubetest2

@k8s-ci-robot
Copy link
Contributor

@bart0sh: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test pull-test-infra-gubernator
  • /test pull-test-infra-integration
  • /test pull-test-infra-prow-checkconfig
  • /test pull-test-infra-prow-image-build-test
  • /test pull-test-infra-unit-test
  • /test pull-test-infra-verify-cri-o
  • /test pull-test-infra-verify-lint

The following commands are available to trigger optional jobs:

  • /test pull-test-infra-bazel
  • /test pull-test-infra-prow-checkconfig-loose
  • /test pull-test-infra-unit-test-race-detector-nonblocking

Use /test all to run the following jobs that were automatically triggered:

  • pull-test-infra-gubernator
  • pull-test-infra-integration
  • pull-test-infra-prow-checkconfig
  • pull-test-infra-prow-checkconfig-loose
  • pull-test-infra-unit-test
  • pull-test-infra-unit-test-race-detector-nonblocking
  • pull-test-infra-verify-lint

In response to this:

/test pull-kubernetes-node-kubelet-serial-cpu-manager-kubetest2
/test pull-kubernetes-node-kubelet-serial-containerd-kubetest2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bart0sh
Copy link
Contributor

bart0sh commented Aug 29, 2023

/cc

@k8s-ci-robot k8s-ci-robot requested a review from bart0sh August 29, 2023 09:30
@swatisehgal
Copy link
Contributor

/cc

Copy link
Contributor

@swatisehgal swatisehgal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, appreciate you taking the time to look into this.

Currently the config file names seem to indicate that the config is cpu manager specific even though that is not the case. Perhaps we can rename jobs/e2e_node/containerd/image-config-serial-cpu-manager.yaml and jobs/e2e_node/containerd/ubuntu-init-cpu-manager.yaml to something more generic so we can use it for testing other components that also need multi-numa test infrastructure.

@ruquanzhao
Copy link
Member Author

@swatisehgal Thanks for your review !

How about we create somethings like jobs/e2e_node/containerd/image-config-serial-resources-managers.yaml andjobs/e2e_node/containerd/image-config-serial-resources-managers.yaml, and use them to test cpu_manager, memory_manager and topology manager?

@swatisehgal
Copy link
Contributor

@swatisehgal Thanks for your review !

How about we create somethings like jobs/e2e_node/containerd/image-config-serial-resources-managers.yaml andjobs/e2e_node/containerd/image-config-serial-resources-managers.yaml, and use them to test cpu_manager, memory_manager and topology manager?

That sounds good. Thanks!

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this fix! seems fine to me but I must admit I have few gaps I need to close about test-infra config in general. A question inside.

@@ -0,0 +1,13 @@
#cloud-config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is derived from ubuntu-init-hugepages-1G-allocation.yaml which in turn seems a fixed and a stripped down version of init.yaml. Is that the case?
Two followup questions: do we need cni-plugins 0.7.5 or can we use 1.0.1? if we can use 1.0.1, can we just use init.yaml?

The reason I'm asking is because unless we have clearly different needs (like hugepages has!) I think is best to reuse existing common config and avoid proliferation of lane-specific config if we can help it, to keep the complexity at bay.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ffromani
Copy link
Contributor

thanks for reverting to the init.yaml! once renamed the files as discussed in #30545 (comment) , LGTM.

Use n2d-standard-32 in resource-managers node e2e tests for NUMA-related options.

Signed-off-by: Ruquan Zhao <[email protected]>
@ruquanzhao
Copy link
Member Author

ruquanzhao commented Sep 1, 2023

thanks for reverting to the init.yaml! once renamed the files as discussed in #30545 (comment) , LGTM.

Serveral testcases skipped in ci-kubernetes-node-kubelet-containerd-resource-managers, since the machine type n1-stardard-4(in image-config-serial-resource-managers.yaml) doen't support multi numa. Changed it to n2d-standard-32, to make less testcases skip.
And use the existing image-config-serial-resource-managers.yaml to run CPU manager tests.

@ffromani
Copy link
Contributor

ffromani commented Sep 1, 2023

thanks for reverting to the init.yaml! once renamed the files as discussed in #30545 (comment) , LGTM.

Serveral testcases skipped in ci-kubernetes-node-kubelet-containerd-resource-managers, since the machine type n1-stardard-4(in image-config-serial-resource-managers.yaml) doen't support multi numa. Changed it to n2d-standard-32, to make less testcases skip. And use the existing image-config-serial-resource-managers.yaml to run CPU manager tests.

thanks, I think this is fine. I'll review shortly.

@ruquanzhao
Copy link
Member Author

/test pull-test-infra-unit-test

Copy link
Contributor

@swatisehgal swatisehgal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 1, 2023
@swatisehgal
Copy link
Contributor

/hold
to ensure final review from @ffromani
/assign @SergeyKanzhelev
for approval

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 1, 2023
@ffromani
Copy link
Contributor

ffromani commented Sep 1, 2023

/lgtm
/unhold

thanks!

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 1, 2023
swatisehgal added a commit to swatisehgal/test-infra that referenced this pull request Sep 1, 2023
Also, removing `jobs/e2e_node/image-config-serial-multi-numa.yaml`
as it is no longer needed.

Follow-up from: test-infra kubernetes/pull/30545
Signed-off-by: Swati Sehgal <[email protected]>
swatisehgal added a commit to swatisehgal/test-infra that referenced this pull request Sep 6, 2023
Also, removing `jobs/e2e_node/image-config-serial-multi-numa.yaml`
as it is no longer needed.

Follow-up from: test-infra kubernetes/pull/30545
Signed-off-by: Swati Sehgal <[email protected]>
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ruquanzhao, SergeyKanzhelev, swatisehgal

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 6, 2023
@k8s-ci-robot k8s-ci-robot merged commit b9e84c9 into kubernetes:master Sep 6, 2023
@k8s-ci-robot
Copy link
Contributor

@ruquanzhao: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

  • key node-kubelet.yaml using file config/jobs/kubernetes/sig-node/node-kubelet.yaml
  • key sig-node-presubmit.yaml using file config/jobs/kubernetes/sig-node/sig-node-presubmit.yaml

In response to this:

This PR fixes failing/not working jobs of CPU manager:

  • ci-kubernetes-node-kubelet-serial-cpu-manager
  • pull-kubernetes-node-kubelet-serial-cpu-manager
  • pull-kubernetes-node-kubelet-serial-cpu-manager-kubetest2
  • kubernetes-node-kubelet-containerd-resource-managers

Changes:

  • Use n2d-standard-32 to test complicated NUMA-related options of CPUManager.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/config Issues or PRs related to code in /config area/jobs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

7 participants