[Umbrella issue] How we monitor k8s-infra ? #2588

ameukam · 2021-08-20T14:10:15Z

We initially had this conversation in #401.

Also kubernetes/test-infra#23317 (comment):

FYI @ameukam we don't have this feature enabled in kubernetes.io at the moment but will want to take a look at it soon

Some questions from thockin:

Cluster monitoring
a) What should we use?
GKE Workload metrics : https://cloud.google.com/stackdriver/docs/solutions/gke/managing-metrics#workload-metrics
Managed service for Prometheus : https://cloud.google.com/stackdriver/docs/managed-prometheus
b) How do we set it up with git-ops?
- #1376
- #1624
c) What exactly are we concerned about (signals)?
d) How are alerts delivered to a group of people?
e) How do we manage that group?
f) Do we need an on-call rotation?

App monitoring
a) Same tool as cluster monitoring?
b) What is the minimum expectation for an app to be deployed into community space
c) How do we manage groups of alerts for each app (ggroups?)
d) How do we manage on-call for each app?

GCP quotas monitoring
How do we monitoring them ?

More questions can be added.

/milestone v1.23
/are infra

The text was updated successfully, but these errors were encountered:

ameukam · 2021-08-20T14:37:15Z

For this milestone, I would like to focus on how to flew out methods and practices about how we should do monitoring for k8s-infra.

/area infra

spiffxp · 2021-09-02T19:25:14Z

/priority important-longterm

jimdaga · 2021-09-16T00:48:20Z

If Prometheus is the tool picked I'm happy to jump in and help, I have a decent amount of experience.

Related to: - Ref: kubernetes#2588 Bootstrap a new suberepo that will host the Terraform resources consuming the GCP monitoring API. Signed-off-by: Arnaud Meukam <[email protected]>

Related to: - Ref: kubernetes#2588 Bootstrap a new suberepo that will host the Terraform resources consuming the GCP monitoring API. I also bumped the terraform provider for this subrepo and will the other declarations of the provider in a followup PR. Signed-off-by: Arnaud Meukam <[email protected]>

Related: - Part of: kubernetes#2588 - Fixes: kubernetes#2942 - followup of: kubernetes#2898 Ensure service account tf-monitoring-deployer can be used in build cluster prow-build-trusted Signed-off-by: Arnaud Meukam <[email protected]>

ameukam · 2021-12-06T17:33:45Z

/milestone v1.24

k8s-triage-robot · 2022-03-06T17:43:57Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ameukam · 2022-03-06T18:13:57Z

/remove-lifecycle stale

ameukam · 2022-05-12T02:59:56Z

/milestone clear

k8s-triage-robot · 2022-08-10T04:01:35Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ameukam · 2022-08-10T14:27:35Z

/remove-lifecycle stale

k8s-triage-robot · 2022-11-08T15:20:31Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

riaankleinhans · 2022-11-08T17:23:44Z

/remove-lifecycle stale

k8s-triage-robot · 2023-02-06T17:31:23Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-03-08T17:48:20Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

riaankleinhans · 2023-03-12T21:23:40Z

/remove-lifecycle rotten

k8s-triage-robot · 2023-06-10T22:04:57Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

riaankleinhans · 2023-06-11T16:42:37Z

/remove-lifecycle stale

ameukam · 2023-07-27T21:17:05Z

/lifecycle frozen

ameukam · 2024-12-06T20:51:24Z

there is an effort to deploy an unified stack for monitoring.

See: #7377

/close

k8s-ci-robot · 2024-12-06T20:51:29Z

@ameukam: Closing this issue.

In response to this:

there is an effort to deploy an unified stack for monitoring.

See: #7377

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ameukam added the wg/k8s-infra label Aug 20, 2021

k8s-ci-robot added this to the v1.23 milestone Aug 20, 2021

k8s-ci-robot added the area/infra Infrastructure management, infrastructure design, code in infra/ label Aug 20, 2021

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Sep 2, 2021

k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. and removed wg/k8s-infra labels Sep 29, 2021

ameukam mentioned this issue Oct 8, 2021

Add Terraform subrepo for monitoring #2898

Merged

spiffxp mentioned this issue Oct 14, 2021

terraform/k8s-infra-prow-build: add dashboards #2938

Merged

ameukam mentioned this issue Oct 15, 2021

kubernetes-public: Fix workload identity binding #2951

Merged

ameukam added this to SIG K8S Infra Dec 6, 2021

k8s-ci-robot modified the milestones: v1.23, v1.24 Dec 6, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 6, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 6, 2022

k8s-ci-robot removed this from the v1.24 milestone May 12, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 8, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 8, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 6, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 8, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 12, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 10, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2023

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jul 27, 2023

k8s-ci-robot closed this as completed Dec 6, 2024

github-project-automation bot moved this to Done in SIG K8S Infra Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Umbrella issue] How we monitor k8s-infra ? #2588

[Umbrella issue] How we monitor k8s-infra ? #2588

ameukam commented Aug 20, 2021 •

edited

Loading

ameukam commented Aug 20, 2021

spiffxp commented Sep 2, 2021

jimdaga commented Sep 16, 2021

ameukam commented Dec 6, 2021

k8s-triage-robot commented Mar 6, 2022

ameukam commented Mar 6, 2022

ameukam commented May 12, 2022

k8s-triage-robot commented Aug 10, 2022

ameukam commented Aug 10, 2022

k8s-triage-robot commented Nov 8, 2022

riaankleinhans commented Nov 8, 2022

k8s-triage-robot commented Feb 6, 2023

k8s-triage-robot commented Mar 8, 2023

riaankleinhans commented Mar 12, 2023

k8s-triage-robot commented Jun 10, 2023

riaankleinhans commented Jun 11, 2023

ameukam commented Jul 27, 2023

ameukam commented Dec 6, 2024

k8s-ci-robot commented Dec 6, 2024

[Umbrella issue] How we monitor k8s-infra ? #2588

[Umbrella issue] How we monitor k8s-infra ? #2588

Comments

ameukam commented Aug 20, 2021 • edited Loading

ameukam commented Aug 20, 2021

spiffxp commented Sep 2, 2021

jimdaga commented Sep 16, 2021

ameukam commented Dec 6, 2021

k8s-triage-robot commented Mar 6, 2022

ameukam commented Mar 6, 2022

ameukam commented May 12, 2022

k8s-triage-robot commented Aug 10, 2022

ameukam commented Aug 10, 2022

k8s-triage-robot commented Nov 8, 2022

riaankleinhans commented Nov 8, 2022

k8s-triage-robot commented Feb 6, 2023

k8s-triage-robot commented Mar 8, 2023

riaankleinhans commented Mar 12, 2023

k8s-triage-robot commented Jun 10, 2023

riaankleinhans commented Jun 11, 2023

ameukam commented Jul 27, 2023

ameukam commented Dec 6, 2024

k8s-ci-robot commented Dec 6, 2024

ameukam commented Aug 20, 2021 •

edited

Loading