Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates for Neuron testing #561

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Updates for Neuron testing #561

wants to merge 1 commit into from

Conversation

vigh-m
Copy link

@vigh-m vigh-m commented Jan 13, 2025

Description of changes:
With the merge of bottlerocket-os/bottlerocket#4345, Bottlerocket is adding a new setting device-ownership-from-security-context. This setting will be available on all kubernetes variants. This is to support the same setting in the containerd CRI plugin as described here

With this change in Bottlerocket, containers are able to get ownership of requested devices without needing root or privileged capabilities.

This PR is to add this setting to the neuron test cases and update the test Jobs specfiles to specify the user ID and Group ID for ownership of the neuron devices.


This change also bumps the Neuron SDK versions to use the latest available from upstream.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@mattcjo
Copy link
Contributor

mattcjo commented Jan 13, 2025

@vigh-m Could you please pull in the latest changs? The Neuron versions should have been bumped in a recent PR that was merged - #559

@vigh-m
Copy link
Author

vigh-m commented Jan 13, 2025

@mattcjo I missed that change. Rebased and pushed. Thanks!

Comment on lines +87 to +90
securityContext:
runAsUser: 1000
runAsGroup: 2000
fsGroup: 3000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example test run utilizing these new parameters/values?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the test with command: kubetest2 eksapi --up --static-cluster-name=$CLUSTER_NAME --instance-types=trn1.32xlarge --kubernetes-version=1.30 --emit-metrics --nodes=2 --region=us-west-2 --test=multi -- --fail-fast=false -- exec neuron.test --efaEnabled=false --installDevicePlugin=false --test.v --feature=multi-node --neuronTestImage=$IMAGE

Truncated result:

               size(B)    count(elems)    type    time:avg(us)    algbw(GB/s)    busbw(GB/s)
                     8               2    fp32          743.61           0.00           0.00
                    16               4    fp32          965.83           0.00           0.00
                    32               8    fp32          738.93           0.00           0.00
                    64              16    fp32          270.23           0.00           0.00
                   128              32    fp32           47.65           0.00           0.01
                   256              64    fp32           49.33           0.01           0.01
                   512             128    fp32           55.19           0.01           0.02
                  1024             256    fp32          957.89           0.00           0.00
                  2048             512    fp32          519.55           0.00           0.01
                  4096            1024    fp32           518.2           0.01           0.02
                  8192            2048    fp32          729.97           0.01           0.02
                 16384            4096    fp32          745.17           0.02           0.04
                 32768            8192    fp32          963.45           0.03           0.07
                 65536           16384    fp32          983.57           0.07           0.13
                131072           32768    fp32         1200.48           0.11           0.21
                262144           65536    fp32          748.51           0.35           0.69
                524288          131072    fp32           83.31           6.29          12.39
               1048576          262144    fp32          117.81           8.90          17.52
               2097152          524288    fp32           190.9          10.99          21.63
               4194304         1048576    fp32          340.06          12.33          24.28
               8388608         2097152    fp32          573.91          14.62          28.78
              16777216         4194304    fp32          785.65          21.35          42.04
              33554432         8388608    fp32         1062.72          31.57          62.16
              67108864        16777216    fp32         2240.49          29.95          58.97
             134217728        33554432    fp32         4142.66          32.40          63.79
             268435456        67108864    fp32         7670.62          35.00          68.90
             536870912       134217728    fp32        14565.23          36.86          72.57
            1073741824       268435456    fp32        28590.03          37.56          73.94
            2147483648       536870912    fp32        56862.62          37.77          74.35
        Avg bus bandwidth:	21.4670GB/s

--- PASS: TestNeuronNodes (800.27s)
    --- SKIP: TestNeuronNodes/single-node (0.00s)
    --- PASS: TestNeuronNodes/multi-node (800.27s)
        --- PASS: TestNeuronNodes/multi-node/NCCOM_test_succeeds (800.25s)
PASS
I0207 19:09:26.180036     211 cloudwatch.go:79] emitted 1 metrics to namespace: kubetest2/eksapi

Comment on lines +30 to +33
securityContext:
runAsUser: 1000
runAsGroup: 2000
fsGroup: 3000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing... Do you have an example test run utilizing these new parameters/values?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the test with kubetest2 eksapi --up --static-cluster-name=$CLUSTER_NAME --instance-types=inf2.xlarge --kubernetes-version=1.30 --emit-metrics --nodes=2 --region=us-west-2 --test=multi -- --fail-fast=false -- exec neuron.test --efaEnabled=false --test.v --skip-features=multi-node --neuronTestImage=$IMAGE

Truncated result:

=== RUN   TestNeuronNodes/multi-node
    env.go:438: Skipping feature: "multi-node": name matched
--- PASS: TestNeuronNodes (395.07s)
    --- PASS: TestNeuronNodes/single-node (395.07s)
        --- PASS: TestNeuronNodes/single-node/Single_node_test_Job_succeeds (395.01s)
    --- SKIP: TestNeuronNodes/multi-node (0.00s)
PASS

@@ -20,15 +21,17 @@ func generateUserData(format string, cluster *Cluster) (string, bool, error) {
case "bottlerocket":
t = templates.UserDataBottlerocket
userDataIsMimePart = false
deviceOwnershipFromSecurityContext = "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is off by default on Bottlerocket, but we want it default on in all tests? If so, let's just hard-code it in userdata_bottlerocket.toml

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Updated

@vigh-m vigh-m force-pushed the add-feature branch 2 times, most recently from 7422afb to 7ba4b5c Compare February 7, 2025 22:25
@vigh-m
Copy link
Author

vigh-m commented Feb 7, 2025

⬆️ Addressed comments and rebased to latest upstream

Copy link
Member

@cartermckinnon cartermckinnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants