Propagate physical CPU count and RAM size up to Nexus #2299

smklein · 2023-02-01T20:00:46Z

Propagate "CPU count" and "Physical RAM" information from Sled Agents up to Nexus, and through the external API.

This information will be used in the future to make provisioning decisions (e.g., do I have enough CPU / RAM to run the services I expect to run on a particular Sled).

It's also currently being plumbed through the operator API for visibility, though that representation of hardware may change to align with a more "holistic view of inventory".

rmustacc

I know that you're still working on this as a draft @smklein, but I wanted to add a bunch of notes about the current data we're grabbing from the system, what it represents, and the corresponding nomenclature. There are a few unrelated comments in the external JSON just because this is being updated in this change. If it's more appropriate to handle that a different way, that's fine. I just noticed it because it was updated in this change.

Happy to talk more if it'd help on the specifics live if that'd help. Thanks for working on this.

rmustacc · 2023-02-06T17:00:27Z

nexus/db-model/src/schema.rs

+        cpus -> Int8,
+        physical_ram -> Int8,


See the notes elsewhere, but I strongly suggest changing these names to reflect what they actually are. cpus is too generic and physical_ram doesn't reflect that it's actually not all physical dram in the system, but rather what is raw usable by the OS due to complications.

Okay, I've updated these to online_logical_cpus and usable_physical_ram.

rmustacc · 2023-02-06T17:24:54Z

openapi/nexus-internal.json

+          "physical_ram": {
+            "description": "Amount of RAM used by the sled.",
+            "allOf": [
+              {
+                "$ref": "#/components/schemas/ByteCount"
+              }
+            ]
+          },


A few notes on this and things that I think we want to think about. Commenting here versus the public API bit below because this is the more detailed one.

Right now this isn't actually used RAM. The sysconf commands are telling you the amount of subset of VA space covered by DRAM that the OS thinks it can actually use.

It's not clear what we want to use by the term physical here. I think we want to be more precise.

Fundamentally when we're thinking about provisioning and the breakdown of memory we need to consider the following different aspects:

There is a total amount of DRAM installed the system which is determined by how many sticks of DRAM are installed in the system and summing the size of that.

A subset of that is actually trained and usable by the DRAM controller. This may not happen due to post-package repair, through some other action in the APCB , or because DRAM training has failed.

DRAM is laid out in three contiguous VA ranges. Subsets of that DRAM overlap with other items in the processor leading to it not being usable by the OS for various purposes. This is represented by the number of pages that we have created in the system and this is the number that sled agent is currently grabbing.

Some amount of DRAM is being put into a reservoir for use by bhyve (at some point). This probably is what represents provisionable DRAM for use by end users.

How we end up breaking apart the remaining pieces between what's in the reservoir versus not.

So all in all, I guess this is a long way of saying physical_ram doesn't really feel like the right term here and while useful for starting to worry about overprovisioning (assuming nexus will manually assume some reservation for the kernel and set appropriate controls for other zones), but it also isn't the most useful for a customer immediately without the surrounding context which is going to be hard to educate. They'd expect to see 1 TiB and then perhaps understand what percent of that is being used directly for guests versus others.

I realize that this is very pedantic, but if I assume we're only going to get this one field in for the MVP (I mean optimistically more, but assuming the worst case), I think it's important that we communicate what it actually reflects.

One positive thing, I think it's good that we're using ram and not dram here as eventually I suspect we'll have systems with mixes of non-DRAM technology for this, e.g. HBM2.

The wording "used" was a mistake on my part; I'm aware this is not supposed to be the in-usage value, but rather, the amount the OS is capable of using in totality. I've updated the naming and documentation.

rmustacc · 2023-02-06T17:42:56Z

openapi/nexus.json

        "type": "object",
        "properties": {
          "baseboard": {
            "$ref": "#/components/schemas/Baseboard"
          },
+          "cpus": {


Note the comments on the internal API are all valid here, just that they were written before this was updated. Though the ram here gets it correct that it's what's available.

openapi/nexus.json

rmustacc · 2023-02-06T17:56:35Z

sled-agent/src/hardware/illumos/sysconf.rs

+
+/// Returns the amount of RAM on this sled, in bytes.
+pub fn physical_ram_bytes() -> Result<u64, Error> {
+    let phys_pages = sysconf("physical pages", libc::_SC_PHYS_PAGES)?;


So I think this number is useful for the control plane, but it'll be less than the total DRAM that's installed and eventually we'll want to distinguish these two items. A customer would be confused to see 1011 GiB (using the sn21 example of prtconf -m showing 1036271 MiB). Just a note that we'll want to be doing something different later here.

Agreed that we will want to distinguish between the two, but for provisioning and accounting, we care about "what can the OS and all services on the OS use". I've updated the name here to usable_physical_ram and documented the intent in the APIs.

openapi/nexus.json

rmustacc · 2023-02-06T18:07:36Z

openapi/nexus.json

          "rack_id": {
+            "description": "The rack to which this Sled is currently attached.",


I think we may want to think carefully about the term rack_id as we have no good way of actually relating this to a physical rack. It's really a logical control plane instance id for a rack.

I mention this because this suggests that it can change due to moving the sled, but if we move all the sleds from rack A to rack B, it won't change. So I think we may want to be more precise. This may be something that should come up as a different follow up along with the bit below.

Is there a better identifier for us to use for a rack?

The control plane attaches UUID labels to nearly every object it's aware of for uniformity, even if there is some other key value that can uniquely identify it (for example, baseboard/version/revision uniquely identifies a sled, but it still has a UUID).

I personally find it confusing to call this a rack ID when it's not tied to the identity of the rack, which is not self-discoverable. However, this also came up in oxidecomputer/rfd#563. It may be that it's just me that find this term confusing as opposed to some of the others that @andrewjstone mentioned in that context there.

Fundamentally we can't accurately indicate a physical rack that a sled is in because we don't have a way to determine that. We can make approximations and similar, but that can also be out of date based on certain service operations. Either way, I think it's probably best to just not worry about this in this PR. I should have probably just filed a separate issue, but I suspect if the rack ID is clear to everyone else, then this is mostly a documentation problem about making it clear what it can and can't represent. Though I do still think we want to make sure during initial set up we scan information about the rack nameplate.

Meghan actually if there was something we needed to do during manufacturing or setup to include information from the rack's nameplate.

@smklein this came up during the RFD-267 huddle too. Essentially it seems like we're going to be mostly unable to tell what sled occupies which rack (accurately, anyway). I believe this is going to be fairly problematic in a multi-rack deployment in that it both breaks some of the assumptions in the relational hierarchy and it'll also be difficult to communicate to an operator the physical location of a piece of hardware in their overall installation.

rmustacc · 2023-02-06T18:14:08Z

sled-agent/src/hardware/illumos/sysconf.rs

+
+/// Returns the number of physical processors on this sled.
+pub fn online_processor_count() -> Result<u32, Error> {
+    Ok(u32::try_from(sysconf(


Separate to this, sysconf returns a long. In an ILP64 this is an i64. I realize we're unlikely to have more than a u32 of processors, but everywhere else you treat cpus as an i64.

I updated to the following:

I'm converting the value to a u32 immediately, throwing an error if it is not convertible, and using that to represent processor count.

There is a small wrinkle where I need to convert it to an i64 to store it in the DB (Postgres does not like unsigned types) but it's u32 in-memory and in the API everywhere esle.

Hmm. I guess if the db is requiring it to be a singed type, almost makes me think we should force i64 the rest of the way just so we don't have to ask the question what happens if the db ends up with an unrepresentable value. But I defer to you on this as I don't know how diesel and the general error handling here is expected to flow. I realize there is a constraint being placed on the column, so it probably isn't worth worrying about that.

I thought we didn't want to enable negative values for CPUs? By using a u32 here, I validate the value from the sled. If I propagated an i64 all the way up to Nexus, it would be possible to use an "invalid" value for CPU count for much longer. Shouldn't we be throwing an error as early as we reasonably can?

I'm honestly not sure what the right answer is to the fact unsigned types are for chumps in the db. Seems like probably best to just hope the constraint we have works and is never violated and so it can always serialize / de-serialize as a u32 and then what you're saying makes sense. Just whenever I see a constraint on that, my mind goes to what happens if that gets violated because of issue XXX and the wrong thing is on disk. As I said above, this probably isn't worth worrying about.

sled-agent/src/hardware/illumos/sysconf.rs

openapi/nexus.json

smklein

Thanks for the feedback @rmustacc ; I've tried to update both the names, types, and documentation to be more accurate.

sled-agent/src/hardware/illumos/sysconf.rs

smklein · 2023-02-07T20:14:09Z

sled-agent/src/hardware/illumos/sysconf.rs

+
+/// Returns the number of physical processors on this sled.
+pub fn online_processor_count() -> Result<u32, Error> {
+    Ok(u32::try_from(sysconf(


I updated to the following:

I'm converting the value to a u32 immediately, throwing an error if it is not convertible, and using that to represent processor count.

There is a small wrinkle where I need to convert it to an i64 to store it in the DB (Postgres does not like unsigned types) but it's u32 in-memory and in the API everywhere esle.

smklein · 2023-02-07T20:15:10Z

sled-agent/src/hardware/illumos/sysconf.rs

+
+/// Returns the amount of RAM on this sled, in bytes.
+pub fn physical_ram_bytes() -> Result<u64, Error> {
+    let phys_pages = sysconf("physical pages", libc::_SC_PHYS_PAGES)?;


Agreed that we will want to distinguish between the two, but for provisioning and accounting, we care about "what can the OS and all services on the OS use". I've updated the name here to usable_physical_ram and documented the intent in the APIs.

openapi/nexus.json

sled-agent/src/hardware/illumos/sysconf.rs

openapi/nexus.json

smklein · 2023-02-07T20:23:30Z

sled-agent/src/hardware/illumos/sysconf.rs

+    Ok(r.try_into()?)
+}
+
+/// Returns the number of physical processors on this sled.


Updated to "online processors", and the names of the variables to "online_logical_cpus" to be pedantic.

smklein · 2023-02-07T20:24:32Z

openapi/nexus-internal.json

+          "cpus": {
+            "description": "Number of CPUs online on the sled.",
+            "type": "integer",
+            "format": "int64"
+          },


Updated the name to online_logical_cpus to encapsulate this a little better.

In the limit, we should be using online_physical_cpus as a better factor for provisioning decisions.

smklein · 2023-02-07T20:25:22Z

openapi/nexus.json

          "rack_id": {
+            "description": "The rack to which this Sled is currently attached.",


Is there a better identifier for us to use for a rack?

The control plane attaches UUID labels to nearly every object it's aware of for uniformity, even if there is some other key value that can uniquely identify it (for example, baseboard/version/revision uniquely identifies a sled, but it still has a UUID).

openapi/nexus.json

rmustacc · 2023-02-07T23:21:16Z

openapi/nexus.json

          "rack_id": {
+            "description": "The rack to which this Sled is currently attached.",


I personally find it confusing to call this a rack ID when it's not tied to the identity of the rack, which is not self-discoverable. However, this also came up in oxidecomputer/rfd#563. It may be that it's just me that find this term confusing as opposed to some of the others that @andrewjstone mentioned in that context there.

Fundamentally we can't accurately indicate a physical rack that a sled is in because we don't have a way to determine that. We can make approximations and similar, but that can also be out of date based on certain service operations. Either way, I think it's probably best to just not worry about this in this PR. I should have probably just filed a separate issue, but I suspect if the rack ID is clear to everyone else, then this is mostly a documentation problem about making it clear what it can and can't represent. Though I do still think we want to make sure during initial set up we scan information about the rack nameplate.

rmustacc · 2023-02-07T23:22:49Z

sled-agent/src/hardware/illumos/sysconf.rs

+
+/// Returns the number of physical processors on this sled.
+pub fn online_processor_count() -> Result<u32, Error> {
+    Ok(u32::try_from(sysconf(


Hmm. I guess if the db is requiring it to be a singed type, almost makes me think we should force i64 the rest of the way just so we don't have to ask the question what happens if the db ends up with an unrepresentable value. But I defer to you on this as I don't know how diesel and the general error handling here is expected to flow. I realize there is a constraint being placed on the column, so it probably isn't worth worrying about that.

rmustacc · 2023-02-08T00:15:14Z

openapi/nexus.json

@@ -13717,13 +13717,17 @@
            "type": "string",
            "format": "uuid"
          },
+          "online_logical_cpus": {


I've been struggling with the nomenclature here. This is probably fine, but I'd like to write up the different thoughts here as part of trying to think about future things. Part of my concern is what does the term cpu or processor mean and trying to come up with something that is understandable to lots of people, but I suspect this is a fool's errand in the end.

Fundamentally in the world that we are operating in, we have CPUs (referring in this instance to thing that we buy from someone) that are installed into sockets. They have some number of cores and some number of logical units beneath that that represent schedulable work (trying to avoid terms for a moment). Those logical units are called different things. For example Intel calls it "hyper-threading" and sometimes threads. In cpuid leaf 0xb and 0x1f they describe it as SMT. The AMD PPR is more willing to use the term threads in family 0x17 and 0x19. The amd64 level calls them threads.

But amd64 also had a thing they did in family 0x15 that lead the next level to be "logical cores" because of how they used CMT (clustered multi-threading). Which means the question of what is or isn't a core is weird, though this also isn't a CPU or thing that we expect to see again.

What we currently call a logical CPU corresponds to threads, which is what most folks seem to agree on. I've tried to look this up in ARM, but don't see great consensus. RISC-V does call this same thing a HART, or hardware thread, an says a core is made up of one or more HARTs.

From a provisioning perspective in the short and longer term we're interested in how many usable logical CPUs and cores there are. From an operator inventory perspective and overall capacity planning (versus in the moment), where we assume all hardware is usable, the things I think about are: asking how many sockets does a server have (more informational), how many total cores are there and how many threads/logical CPUs here are.

Finally, the last thing with nomenclature here is what do we think of when we're provisioning guests, which is the fundamental customer-facing abstraction. Right now the API file describes folks as passing ncpus and defines it as a CPU which right now maps to logical CPUs / threads in the system.

So playing this out, it seems like we'd expect a server object if we continue this trend to have:

online_logical_cpus (still not sure if we should maybe call this cores)

online_physical_cores

total_logical_cpus

total_physical_cpus

total_physical_sockets

We may opt to leave out sockets here because it's not immediately useful. It's also worth noting that in a core-aware scheduling world, online_logical_cpus is not the same as what's usable. It's also not clear to me what we want to present with respect to what the actual provisionable CPU is from an operator's perspective, that is what is usable for guests, which is something that they care about, because it's probably not going to be all 128 even on a good day. This almost suggests hat we want a third category of usable_xxx, though this is starting to get complicated and more nuanced. Anyways, that's probably more on this than is useful.

I'm worried that if we get too nuanced here it'll make the API harder to document and to reason about. I don't want to derail progress and perhaps it's something we can just put a pin in, but do other platforms define these terms in some way we could borrow? What would an operator typically see if they were using VMWare or Open stack or something?

I have an uninformed perspective of this so my opinion counts for little here, but I wouldn't understand what online_logical_cpus means without reading documentation. Does online imply that there are offline versions too?

@zephraph : Yes, the online does imply that it's possible for some of these to be offline.

https://illumos.org/man/8/psrinfo is one such way of inspecting this info, though that man page also has an opinion about naming of CPUs / processors / cores.

psrinfo displays information about processors. Each physical socket may contain multiple cores, which in turn can contain multiple virtual processors (also referred to as CPUs). Each virtual processor is an entity with its own ID, capable of executing independent threads.

For example:

gimlet-sn21 # psrinfo -v | head -n 10 Status of virtual processor 0 as of: 02/10/2023 16:21:15 on-line since 10/01/2000 06:11:08. The i386 processor operates at 2000 MHz, and has an i387 compatible floating point processor. Status of virtual processor 1 as of: 02/10/2023 16:21:15 on-line since 10/01/2000 06:11:08. The i386 processor operates at 2000 MHz, and has an i387 compatible floating point processor. ...

(It shows 128 of these "virtual processors")

Using a command line psradm, these CPUs can be turned "offline": https://illumos.org/man/8/psradm

But to summarize some of this info on a real Gimlet, here's what I'm seeing:

"report the total number of CPU cores"

gimlet-sn21 # psrinfo -ct 64

"report the total number of 'things capable of executing hardware threads'"

gimlet-sn21 # psrinfo -t 128

"report the number of virtual processors seen through sysconf"

gimlet-sn21 # getconf -a | grep NPROCESSORS_ONLN NPROCESSORS_ONLN: 128

Thanks for the detailed info!

zephraph · 2023-02-08T16:16:18Z

common/src/sql/dbinit.sql

@@ -86,6 +86,10 @@ CREATE TABLE omicron.public.sled (
    part_number STRING(63) NOT NULL,
    revision INT8 NOT NULL,

+    /* CPU & RAM summary for the sled */
+    online_logical_cpus INT8 CHECK (online_logical_cpus BETWEEN 0 AND 4294967295) NOT NULL,


Just to clarify... online in this context means present and available for use?

Yes, though this gets nuanced in the future. See #2299 (comment) and the fact that we could have online logical cpus that aren't usable for VM guests in a core-aware scheduling world because you wouldn't have a working twin (but it could be usable for others).

nexus/db-model/src/unsigned.rs

smklein · 2023-02-10T17:56:56Z

I went ahead and updated the Nexus-level tracking of compute resources as "usable_hardware_threads", since that seems to be the unit we're going to care about in the near-term.

The terms "CPUs" and "Processors" are a little too vague for my liking - perhaps it'll be more clear when we're describing inventory, but from an accounting / overprovisioning point-of-view, I think we care about "how many things can we run at the same time".

We will likely add "usable_cores" in the future, in a core-aware world, but I think it will be reasonable to make that jump when we get there.

smklein · 2023-02-20T16:58:49Z

Whenever y'all have time, this is ready for another look - I think I've addressed all feedback that I can.

smklein · 2023-03-15T00:30:24Z

One more ping on this

rmustacc · 2023-03-16T23:16:22Z

Sorry for the delay in getting back to you @smklein. I think the approach we have for threads makes sense for now. I'm still not really convinced that the DRAM value we're providing is going to be actionable and is going to need a lot more docs to make sense of because it doesn't help an operator answer the question of how much capacity is actually available for VMs on this sled. All that said, I don't see this as a reason to hold this up at all. It's a value we need in nexus, regardless of whether or not it's something we want to expose to the operator.

Propagate physical CPU count and RAM size up to Nexus

741604a

smklein added Sled Agent Related to the Per-Sled Configuration and Management nexus Related to nexus labels Feb 1, 2023

smklein added 3 commits February 6, 2023 10:07

Cleanup test utilities

c85cbdd

Openapi cleanup, refactoring

7710f79

Remove unused file

85f69fb

smklein marked this pull request as ready for review February 6, 2023 17:47

smklein requested review from jclulow and zephraph February 6, 2023 17:47

rmustacc reviewed Feb 6, 2023

View reviewed changes

smklein added 2 commits February 7, 2023 15:09

Review feedback

15d87dc

Remove service address

376900d

smklein commented Feb 7, 2023

View reviewed changes

smklein added 2 commits February 7, 2023 15:32

Quench clippy's insatiable thirst

224b54e

Fix DB name for logical CPUs

bfdbbaa

rmustacc reviewed Feb 8, 2023

View reviewed changes

Merge branch 'main' into cpus-and-ram

56ae5ab

zephraph reviewed Feb 8, 2023

View reviewed changes

nexus/db-model/src/unsigned.rs Outdated Show resolved Hide resolved

smklein added 3 commits February 10, 2023 10:47

Merge branch 'main' into cpus-and-ram

fa2cbae

remove u16.rs

2476e98

usable_hardware_threads

a56ea41

Merge branch 'main' into cpus-and-ram

c8018b0

smklein mentioned this pull request Feb 24, 2023

Tracking Issue for Managing Low Disk Space Conditions #1700

Open

13 tasks

smklein added 2 commits March 20, 2023 12:54

Merge branch 'main' into cpus-and-ram

2b2b14c

fix mismerge

67343e3

Move libc dependency

df5e4f3

smklein merged commit 52d815e into main Mar 20, 2023

smklein deleted the cpus-and-ram branch March 20, 2023 20:41

		"rack_id": {
		"description": "The rack to which this Sled is currently attached.",

Propagate physical CPU count and RAM size up to Nexus #2299

Propagate physical CPU count and RAM size up to Nexus #2299

Conversation

smklein commented Feb 1, 2023 • edited Loading

rmustacc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein Feb 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein commented Feb 10, 2023

smklein commented Feb 20, 2023

smklein commented Mar 15, 2023

rmustacc commented Mar 16, 2023

smklein commented Feb 1, 2023 •

edited

Loading

smklein Feb 10, 2023 •

edited

Loading