-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates for Neuron testing #561
base: main
Are you sure you want to change the base?
Conversation
@mattcjo I missed that change. Rebased and pushed. Thanks! |
securityContext: | ||
runAsUser: 1000 | ||
runAsGroup: 2000 | ||
fsGroup: 3000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example test run utilizing these new parameters/values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran the test with command: kubetest2 eksapi --up --static-cluster-name=$CLUSTER_NAME --instance-types=trn1.32xlarge --kubernetes-version=1.30 --emit-metrics --nodes=2 --region=us-west-2 --test=multi -- --fail-fast=false -- exec neuron.test --efaEnabled=false --installDevicePlugin=false --test.v --feature=multi-node --neuronTestImage=$IMAGE
Truncated result:
size(B) count(elems) type time:avg(us) algbw(GB/s) busbw(GB/s)
8 2 fp32 743.61 0.00 0.00
16 4 fp32 965.83 0.00 0.00
32 8 fp32 738.93 0.00 0.00
64 16 fp32 270.23 0.00 0.00
128 32 fp32 47.65 0.00 0.01
256 64 fp32 49.33 0.01 0.01
512 128 fp32 55.19 0.01 0.02
1024 256 fp32 957.89 0.00 0.00
2048 512 fp32 519.55 0.00 0.01
4096 1024 fp32 518.2 0.01 0.02
8192 2048 fp32 729.97 0.01 0.02
16384 4096 fp32 745.17 0.02 0.04
32768 8192 fp32 963.45 0.03 0.07
65536 16384 fp32 983.57 0.07 0.13
131072 32768 fp32 1200.48 0.11 0.21
262144 65536 fp32 748.51 0.35 0.69
524288 131072 fp32 83.31 6.29 12.39
1048576 262144 fp32 117.81 8.90 17.52
2097152 524288 fp32 190.9 10.99 21.63
4194304 1048576 fp32 340.06 12.33 24.28
8388608 2097152 fp32 573.91 14.62 28.78
16777216 4194304 fp32 785.65 21.35 42.04
33554432 8388608 fp32 1062.72 31.57 62.16
67108864 16777216 fp32 2240.49 29.95 58.97
134217728 33554432 fp32 4142.66 32.40 63.79
268435456 67108864 fp32 7670.62 35.00 68.90
536870912 134217728 fp32 14565.23 36.86 72.57
1073741824 268435456 fp32 28590.03 37.56 73.94
2147483648 536870912 fp32 56862.62 37.77 74.35
Avg bus bandwidth: 21.4670GB/s
--- PASS: TestNeuronNodes (800.27s)
--- SKIP: TestNeuronNodes/single-node (0.00s)
--- PASS: TestNeuronNodes/multi-node (800.27s)
--- PASS: TestNeuronNodes/multi-node/NCCOM_test_succeeds (800.25s)
PASS
I0207 19:09:26.180036 211 cloudwatch.go:79] emitted 1 metrics to namespace: kubetest2/eksapi
securityContext: | ||
runAsUser: 1000 | ||
runAsGroup: 2000 | ||
fsGroup: 3000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing... Do you have an example test run utilizing these new parameters/values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran the test with kubetest2 eksapi --up --static-cluster-name=$CLUSTER_NAME --instance-types=inf2.xlarge --kubernetes-version=1.30 --emit-metrics --nodes=2 --region=us-west-2 --test=multi -- --fail-fast=false -- exec neuron.test --efaEnabled=false --test.v --skip-features=multi-node --neuronTestImage=$IMAGE
Truncated result:
=== RUN TestNeuronNodes/multi-node
env.go:438: Skipping feature: "multi-node": name matched
--- PASS: TestNeuronNodes (395.07s)
--- PASS: TestNeuronNodes/single-node (395.07s)
--- PASS: TestNeuronNodes/single-node/Single_node_test_Job_succeeds (395.01s)
--- SKIP: TestNeuronNodes/multi-node (0.00s)
PASS
@@ -20,15 +21,17 @@ func generateUserData(format string, cluster *Cluster) (string, bool, error) { | |||
case "bottlerocket": | |||
t = templates.UserDataBottlerocket | |||
userDataIsMimePart = false | |||
deviceOwnershipFromSecurityContext = "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is off by default on Bottlerocket, but we want it default on in all tests? If so, let's just hard-code it in userdata_bottlerocket.toml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Updated
7422afb
to
7ba4b5c
Compare
⬆️ Addressed comments and rebased to latest upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description of changes:
With the merge of bottlerocket-os/bottlerocket#4345, Bottlerocket is adding a new setting
device-ownership-from-security-context
. This setting will be available on all kubernetes variants. This is to support the same setting in the containerd CRI plugin as described hereWith this change in Bottlerocket, containers are able to get ownership of requested devices without needing root or privileged capabilities.
This PR is to add this setting to the neuron test cases and update the test Jobs specfiles to specify the user ID and Group ID for ownership of the neuron devices.
This change also bumps the Neuron SDK versions to use the latest available from upstream.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.