-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propagate physical CPU count and RAM size up to Nexus #2299
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that you're still working on this as a draft @smklein, but I wanted to add a bunch of notes about the current data we're grabbing from the system, what it represents, and the corresponding nomenclature. There are a few unrelated comments in the external JSON just because this is being updated in this change. If it's more appropriate to handle that a different way, that's fine. I just noticed it because it was updated in this change.
Happy to talk more if it'd help on the specifics live if that'd help. Thanks for working on this.
nexus/db-model/src/schema.rs
Outdated
cpus -> Int8, | ||
physical_ram -> Int8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the notes elsewhere, but I strongly suggest changing these names to reflect what they actually are. cpus is too generic and physical_ram doesn't reflect that it's actually not all physical dram in the system, but rather what is raw usable by the OS due to complications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've updated these to online_logical_cpus
and usable_physical_ram
.
openapi/nexus-internal.json
Outdated
"physical_ram": { | ||
"description": "Amount of RAM used by the sled.", | ||
"allOf": [ | ||
{ | ||
"$ref": "#/components/schemas/ByteCount" | ||
} | ||
] | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few notes on this and things that I think we want to think about. Commenting here versus the public API bit below because this is the more detailed one.
- Right now this isn't actually
used
RAM. Thesysconf
commands are telling you the amount of subset of VA space covered by DRAM that the OS thinks it can actually use. - It's not clear what we want to use by the term physical here. I think we want to be more precise.
Fundamentally when we're thinking about provisioning and the breakdown of memory we need to consider the following different aspects:
- There is a total amount of DRAM installed the system which is determined by how many sticks of DRAM are installed in the system and summing the size of that.
- A subset of that is actually trained and usable by the DRAM controller. This may not happen due to post-package repair, through some other action in the APCB , or because DRAM training has failed.
- DRAM is laid out in three contiguous VA ranges. Subsets of that DRAM overlap with other items in the processor leading to it not being usable by the OS for various purposes. This is represented by the number of pages that we have created in the system and this is the number that sled agent is currently grabbing.
- Some amount of DRAM is being put into a reservoir for use by bhyve (at some point). This probably is what represents provisionable DRAM for use by end users.
- How we end up breaking apart the remaining pieces between what's in the reservoir versus not.
So all in all, I guess this is a long way of saying physical_ram
doesn't really feel like the right term here and while useful for starting to worry about overprovisioning (assuming nexus will manually assume some reservation for the kernel and set appropriate controls for other zones), but it also isn't the most useful for a customer immediately without the surrounding context which is going to be hard to educate. They'd expect to see 1 TiB and then perhaps understand what percent of that is being used directly for guests versus others.
I realize that this is very pedantic, but if I assume we're only going to get this one field in for the MVP (I mean optimistically more, but assuming the worst case), I think it's important that we communicate what it actually reflects.
One positive thing, I think it's good that we're using ram
and not dram
here as eventually I suspect we'll have systems with mixes of non-DRAM technology for this, e.g. HBM2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording "used" was a mistake on my part; I'm aware this is not supposed to be the in-usage value, but rather, the amount the OS is capable of using in totality. I've updated the naming and documentation.
openapi/nexus.json
Outdated
"type": "object", | ||
"properties": { | ||
"baseboard": { | ||
"$ref": "#/components/schemas/Baseboard" | ||
}, | ||
"cpus": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the comments on the internal API are all valid here, just that they were written before this was updated. Though the ram here gets it correct that it's what's available.
|
||
/// Returns the amount of RAM on this sled, in bytes. | ||
pub fn physical_ram_bytes() -> Result<u64, Error> { | ||
let phys_pages = sysconf("physical pages", libc::_SC_PHYS_PAGES)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think this number is useful for the control plane, but it'll be less than the total DRAM that's installed and eventually we'll want to distinguish these two items. A customer would be confused to see 1011 GiB (using the sn21 example of prtconf -m
showing 1036271 MiB). Just a note that we'll want to be doing something different later here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that we will want to distinguish between the two, but for provisioning and accounting, we care about "what can the OS and all services on the OS use". I've updated the name here to usable_physical_ram
and documented the intent in the APIs.
openapi/nexus.json
Outdated
"rack_id": { | ||
"description": "The rack to which this Sled is currently attached.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may want to think carefully about the term rack_id as we have no good way of actually relating this to a physical rack. It's really a logical control plane instance id for a rack.
I mention this because this suggests that it can change due to moving the sled, but if we move all the sleds from rack A to rack B, it won't change. So I think we may want to be more precise. This may be something that should come up as a different follow up along with the bit below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better identifier for us to use for a rack?
The control plane attaches UUID labels to nearly every object it's aware of for uniformity, even if there is some other key value that can uniquely identify it (for example, baseboard/version/revision uniquely identifies a sled, but it still has a UUID).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find it confusing to call this a rack ID when it's not tied to the identity of the rack, which is not self-discoverable. However, this also came up in oxidecomputer/rfd#563. It may be that it's just me that find this term confusing as opposed to some of the others that @andrewjstone mentioned in that context there.
Fundamentally we can't accurately indicate a physical rack that a sled is in because we don't have a way to determine that. We can make approximations and similar, but that can also be out of date based on certain service operations. Either way, I think it's probably best to just not worry about this in this PR. I should have probably just filed a separate issue, but I suspect if the rack ID is clear to everyone else, then this is mostly a documentation problem about making it clear what it can and can't represent. Though I do still think we want to make sure during initial set up we scan information about the rack nameplate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meghan actually if there was something we needed to do during manufacturing or setup to include information from the rack's nameplate.
@smklein this came up during the RFD-267 huddle too. Essentially it seems like we're going to be mostly unable to tell what sled occupies which rack (accurately, anyway). I believe this is going to be fairly problematic in a multi-rack deployment in that it both breaks some of the assumptions in the relational hierarchy and it'll also be difficult to communicate to an operator the physical location of a piece of hardware in their overall installation.
|
||
/// Returns the number of physical processors on this sled. | ||
pub fn online_processor_count() -> Result<u32, Error> { | ||
Ok(u32::try_from(sysconf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate to this, sysconf
returns a long. In an ILP64 this is an i64. I realize we're unlikely to have more than a u32 of processors, but everywhere else you treat cpus as an i64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated to the following:
I'm converting the value to a u32 immediately, throwing an error if it is not convertible, and using that to represent processor count.
There is a small wrinkle where I need to convert it to an i64 to store it in the DB (Postgres does not like unsigned types) but it's u32 in-memory and in the API everywhere esle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I guess if the db is requiring it to be a singed type, almost makes me think we should force i64 the rest of the way just so we don't have to ask the question what happens if the db ends up with an unrepresentable value. But I defer to you on this as I don't know how diesel and the general error handling here is expected to flow. I realize there is a constraint being placed on the column, so it probably isn't worth worrying about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we didn't want to enable negative values for CPUs? By using a u32 here, I validate the value from the sled. If I propagated an i64 all the way up to Nexus, it would be possible to use an "invalid" value for CPU count for much longer. Shouldn't we be throwing an error as early as we reasonably can?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm honestly not sure what the right answer is to the fact unsigned types are for chumps in the db. Seems like probably best to just hope the constraint we have works and is never violated and so it can always serialize / de-serialize as a u32 and then what you're saying makes sense. Just whenever I see a constraint on that, my mind goes to what happens if that gets violated because of issue XXX and the wrong thing is on disk. As I said above, this probably isn't worth worrying about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback @rmustacc ; I've tried to update both the names, types, and documentation to be more accurate.
|
||
/// Returns the number of physical processors on this sled. | ||
pub fn online_processor_count() -> Result<u32, Error> { | ||
Ok(u32::try_from(sysconf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated to the following:
I'm converting the value to a u32 immediately, throwing an error if it is not convertible, and using that to represent processor count.
There is a small wrinkle where I need to convert it to an i64 to store it in the DB (Postgres does not like unsigned types) but it's u32 in-memory and in the API everywhere esle.
|
||
/// Returns the amount of RAM on this sled, in bytes. | ||
pub fn physical_ram_bytes() -> Result<u64, Error> { | ||
let phys_pages = sysconf("physical pages", libc::_SC_PHYS_PAGES)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that we will want to distinguish between the two, but for provisioning and accounting, we care about "what can the OS and all services on the OS use". I've updated the name here to usable_physical_ram
and documented the intent in the APIs.
Ok(r.try_into()?) | ||
} | ||
|
||
/// Returns the number of physical processors on this sled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to "online processors", and the names of the variables to "online_logical_cpus" to be pedantic.
openapi/nexus-internal.json
Outdated
"cpus": { | ||
"description": "Number of CPUs online on the sled.", | ||
"type": "integer", | ||
"format": "int64" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the name to online_logical_cpus
to encapsulate this a little better.
In the limit, we should be using online_physical_cpus
as a better factor for provisioning decisions.
openapi/nexus.json
Outdated
"rack_id": { | ||
"description": "The rack to which this Sled is currently attached.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better identifier for us to use for a rack?
The control plane attaches UUID labels to nearly every object it's aware of for uniformity, even if there is some other key value that can uniquely identify it (for example, baseboard/version/revision uniquely identifies a sled, but it still has a UUID).
openapi/nexus.json
Outdated
"rack_id": { | ||
"description": "The rack to which this Sled is currently attached.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find it confusing to call this a rack ID when it's not tied to the identity of the rack, which is not self-discoverable. However, this also came up in oxidecomputer/rfd#563. It may be that it's just me that find this term confusing as opposed to some of the others that @andrewjstone mentioned in that context there.
Fundamentally we can't accurately indicate a physical rack that a sled is in because we don't have a way to determine that. We can make approximations and similar, but that can also be out of date based on certain service operations. Either way, I think it's probably best to just not worry about this in this PR. I should have probably just filed a separate issue, but I suspect if the rack ID is clear to everyone else, then this is mostly a documentation problem about making it clear what it can and can't represent. Though I do still think we want to make sure during initial set up we scan information about the rack nameplate.
|
||
/// Returns the number of physical processors on this sled. | ||
pub fn online_processor_count() -> Result<u32, Error> { | ||
Ok(u32::try_from(sysconf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I guess if the db is requiring it to be a singed type, almost makes me think we should force i64 the rest of the way just so we don't have to ask the question what happens if the db ends up with an unrepresentable value. But I defer to you on this as I don't know how diesel and the general error handling here is expected to flow. I realize there is a constraint being placed on the column, so it probably isn't worth worrying about that.
openapi/nexus.json
Outdated
@@ -13717,13 +13717,17 @@ | |||
"type": "string", | |||
"format": "uuid" | |||
}, | |||
"online_logical_cpus": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been struggling with the nomenclature here. This is probably fine, but I'd like to write up the different thoughts here as part of trying to think about future things. Part of my concern is what does the term cpu or processor mean and trying to come up with something that is understandable to lots of people, but I suspect this is a fool's errand in the end.
Fundamentally in the world that we are operating in, we have CPUs (referring in this instance to thing that we buy from someone) that are installed into sockets. They have some number of cores and some number of logical units beneath that that represent schedulable work (trying to avoid terms for a moment). Those logical units are called different things. For example Intel calls it "hyper-threading" and sometimes threads. In cpuid leaf 0xb and 0x1f they describe it as SMT. The AMD PPR is more willing to use the term threads in family 0x17 and 0x19. The amd64 level calls them threads.
But amd64 also had a thing they did in family 0x15 that lead the next level to be "logical cores" because of how they used CMT (clustered multi-threading). Which means the question of what is or isn't a core is weird, though this also isn't a CPU or thing that we expect to see again.
What we currently call a logical CPU corresponds to threads, which is what most folks seem to agree on. I've tried to look this up in ARM, but don't see great consensus. RISC-V does call this same thing a HART, or hardware thread, an says a core is made up of one or more HARTs.
From a provisioning perspective in the short and longer term we're interested in how many usable logical CPUs and cores there are. From an operator inventory perspective and overall capacity planning (versus in the moment), where we assume all hardware is usable, the things I think about are: asking how many sockets does a server have (more informational), how many total cores are there and how many threads/logical CPUs here are.
Finally, the last thing with nomenclature here is what do we think of when we're provisioning guests, which is the fundamental customer-facing abstraction. Right now the API file describes folks as passing ncpus
and defines it as a CPU which right now maps to logical CPUs / threads in the system.
So playing this out, it seems like we'd expect a server object if we continue this trend to have:
- online_logical_cpus (still not sure if we should maybe call this cores)
- online_physical_cores
- total_logical_cpus
- total_physical_cpus
- total_physical_sockets
We may opt to leave out sockets here because it's not immediately useful. It's also worth noting that in a core-aware scheduling world, online_logical_cpus
is not the same as what's usable. It's also not clear to me what we want to present with respect to what the actual provisionable CPU is from an operator's perspective, that is what is usable for guests, which is something that they care about, because it's probably not going to be all 128 even on a good day. This almost suggests hat we want a third category of usable_xxx
, though this is starting to get complicated and more nuanced. Anyways, that's probably more on this than is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that if we get too nuanced here it'll make the API harder to document and to reason about. I don't want to derail progress and perhaps it's something we can just put a pin in, but do other platforms define these terms in some way we could borrow? What would an operator typically see if they were using VMWare or Open stack or something?
I have an uninformed perspective of this so my opinion counts for little here, but I wouldn't understand what online_logical_cpus
means without reading documentation. Does online
imply that there are offline
versions too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zephraph : Yes, the online
does imply that it's possible for some of these to be offline
.
https://illumos.org/man/8/psrinfo is one such way of inspecting this info, though that man page also has an opinion about naming of CPUs / processors / cores.
psrinfo displays information about processors. Each physical socket may
contain multiple cores, which in turn can contain multiple virtual
processors (also referred to as CPUs). Each virtual processor is an
entity with its own ID, capable of executing independent threads.
For example:
gimlet-sn21 # psrinfo -v | head -n 10
Status of virtual processor 0 as of: 02/10/2023 16:21:15
on-line since 10/01/2000 06:11:08.
The i386 processor operates at 2000 MHz,
and has an i387 compatible floating point processor.
Status of virtual processor 1 as of: 02/10/2023 16:21:15
on-line since 10/01/2000 06:11:08.
The i386 processor operates at 2000 MHz,
and has an i387 compatible floating point processor.
...
(It shows 128 of these "virtual processors")
Using a command line psradm
, these CPUs can be turned "offline": https://illumos.org/man/8/psradm
But to summarize some of this info on a real Gimlet, here's what I'm seeing:
"report the total number of CPU cores"
gimlet-sn21 # psrinfo -ct
64
"report the total number of 'things capable of executing hardware threads'"
gimlet-sn21 # psrinfo -t
128
"report the number of virtual processors seen through sysconf"
gimlet-sn21 # getconf -a | grep NPROCESSORS_ONLN
NPROCESSORS_ONLN: 128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed info!
common/src/sql/dbinit.sql
Outdated
@@ -86,6 +86,10 @@ CREATE TABLE omicron.public.sled ( | |||
part_number STRING(63) NOT NULL, | |||
revision INT8 NOT NULL, | |||
|
|||
/* CPU & RAM summary for the sled */ | |||
online_logical_cpus INT8 CHECK (online_logical_cpus BETWEEN 0 AND 4294967295) NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify... online in this context means present and available for use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, though this gets nuanced in the future. See #2299 (comment) and the fact that we could have online logical cpus that aren't usable for VM guests in a core-aware scheduling world because you wouldn't have a working twin (but it could be usable for others).
I went ahead and updated the Nexus-level tracking of compute resources as "usable_hardware_threads", since that seems to be the unit we're going to care about in the near-term. The terms "CPUs" and "Processors" are a little too vague for my liking - perhaps it'll be more clear when we're describing inventory, but from an accounting / overprovisioning point-of-view, I think we care about "how many things can we run at the same time". We will likely add "usable_cores" in the future, in a core-aware world, but I think it will be reasonable to make that jump when we get there. |
Whenever y'all have time, this is ready for another look - I think I've addressed all feedback that I can. |
One more ping on this |
Sorry for the delay in getting back to you @smklein. I think the approach we have for threads makes sense for now. I'm still not really convinced that the DRAM value we're providing is going to be actionable and is going to need a lot more docs to make sense of because it doesn't help an operator answer the question of how much capacity is actually available for VMs on this sled. All that said, I don't see this as a reason to hold this up at all. It's a value we need in nexus, regardless of whether or not it's something we want to expose to the operator. |
Propagate "CPU count" and "Physical RAM" information from Sled Agents up to Nexus, and through the external API.
This information will be used in the future to make provisioning decisions (e.g., do I have enough CPU / RAM to run the services I expect to run on a particular Sled).
It's also currently being plumbed through the operator API for visibility, though that representation of hardware may change to align with a more "holistic view of inventory".