Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kops 1.31 not reading cluster server API endpoint from local "~/.kube/config" #17262

Open
salavessa opened this issue Feb 10, 2025 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@salavessa
Copy link

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

Client version: 1.31.0 (git-v1.31.0)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

n/a

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
kops rolling-update cluster
or
kops reconcile cluster --yes

5. What happened after the commands executed?

Using cluster from kubectl context: dev.private-dns.com

Unable to reach the kubernetes API.
Use --cloudonly to do a rolling-update without confirming progress with the k8s API

Error: error listing nodes in cluster: Get "https://api.dev.private-dns.com/api/v1/nodes": dial tcp: lookup api.dev.private-dns.com on 127.0.0.53:53: no such host

6. What did you expect to happen?
The kubernetes cluster server API endpoint used by kops to be the one configured in the local ~/.kube/config file (under .clusters[]|select(.name=="<cluster_name>").cluster.server) instead of the "default" value inferred by the cluster_name (e.g. https://api.dev.public-dns.com:8443 vs https://api.dev.private-dns.com).

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

n/a

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

$ kops rolling-update cluster
I0210 11:33:55.071763    2246 loader.go:395] Config loaded from file:  /home/user/.kube/config
Using cluster from kubectl context: dev.private-dns.com

I0210 11:33:55.072646    2246 factory.go:98] state store s3://some-s3-bucket
I0210 11:33:55.073014    2246 s3context.go:350] unable to read /sys/devices/virtual/dmi/id/product_uuid, assuming not running on EC2: open /sys/devices/virtual/dmi/id/product_uuid: permission denied
I0210 11:33:55.073252    2246 s3context.go:192] defaulting region to "us-east-1"
I0210 11:33:55.423361    2246 s3context.go:228] found bucket in region "us-west-2"
I0210 11:33:55.424839    2246 s3fs.go:405] Reading file "s3://some-s3-bucket/dev.private-dns.com/config"
I0210 11:33:56.045827    2246 clientset.go:114] Using keystore path: "s3://some-s3-bucket/dev.private-dns.com/pki"
I0210 11:33:56.046624    2246 aws_utils.go:52] Querying EC2 for all valid regions
I0210 11:33:56.478535    2246 aws_cloud.go:2023] Querying EC2 for all valid zones in region "us-west-2"
I0210 11:33:57.192625    2246 s3fs.go:405] Reading file "s3://some-s3-bucket/dev.private-dns.com/pki/private/kubernetes-ca/keyset.yaml"
I0210 11:33:57.379893    2246 certificate.go:104] Parsing pem block: "CERTIFICATE"
I0210 11:33:57.380479    2246 privatekey.go:194] Parsing pem block: "RSA PRIVATE KEY"
I0210 11:33:58.065080    2246 envvar.go:172] "Feature gate default state" feature="WatchListClient" enabled=false
I0210 11:33:58.065453    2246 envvar.go:172] "Feature gate default state" feature="InformerResourceVersion" enabled=false
I0210 11:33:58.065894    2246 round_trippers.go:466] curl -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kops" 'https://api.dev.private-dns.com/api/v1/nodes'
I0210 11:33:58.085164    2246 round_trippers.go:495] HTTP Trace: DNS Lookup for api.dev.private-dns.com resolved to []
I0210 11:33:58.085590    2246 round_trippers.go:553] GET https://api.dev.private-dns.com/api/v1/nodes  in 19 milliseconds
I0210 11:33:58.085702    2246 round_trippers.go:570] HTTP Statistics: DNSLookup 18 ms Dial 0 ms TLSHandshake 0 ms Duration 19 ms
I0210 11:33:58.085796    2246 round_trippers.go:577] Response Headers:
Unable to reach the kubernetes API.
Use --cloudonly to do a rolling-update without confirming progress with the k8s API

Error: error listing nodes in cluster: Get "https://api.dev.private-dns.com/api/v1/nodes": dial tcp: lookup api.dev.private-dns.com on 127.0.0.53:53: no such host

9. Anything else do we need to know?
This undesired behavior is only happening since kops 1.31 during the rolling-update phase (either via kops rolling-update cluster or kops reconcile cluster --yes). All earlier kops versions (including kops 1.30.4) are working as expected.

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 10, 2025
@rifelpet
Copy link
Member

cc @justinsb

@zetaab
Copy link
Member

zetaab commented Feb 20, 2025

this is little bit same thing. We are facing issues with rolling-update with 1.31

// Generate a relatively short-lived certificate / kubeconfig
createKubecfgOptions := kubeconfig.CreateKubecfgOptions{
Admin: 1 * time.Hour,
}

it seems that there is hardcoded 1 hour kubeconfig. Why it does not check is there existing kubeconfig and use that? We are facing issues with big cluster updates (that takes longer than one hour). Now we need run updates multiple times.

This is also reported in slack by rekiri

We have an issue with kOps 1.31 in large clusters where an upgrade takes more than an hour with rolling update, the upgrade process fails to Unauthorized timeout when validating and listing nodes in the cluster. kops export kubecfg --admin=... seems to export credentials just fine, but rolling-update command still seems to hit the one hour token timeout. Is there something I'm missing or what could be causing it?

@zetaab
Copy link
Member

zetaab commented Feb 20, 2025

this change was added by @justinsb in #17154 and #17225

its causing these issues

@justinsb
Copy link
Member

Ah, so we changed this to use its own kubeconfig automatically because otherwise reconcile cluster would fail if we were running it before we created the cluster (i.e. from the start).

I think there are two issues here:

  • The 1 hour timeout is too short (or we need to regenerate automatically).
  • The original problem, where the value in ~/.kube/config is not the same as we are generating.

The second one is slightly surprising, in that we should be calling into the same code. Maybe there are some additional flags being passed when creating the ~/.kube/config that we need to honor @salavessa ? If not do you know why the hostname is different?

@salavessa
Copy link
Author

Hi @justinsb

Every time we perform a kops operation which "overwrites" the cluster's server endpoint in the local ~/.kube/config (e.g. kops update cluster --name=<cluster_name> --yes or kops export kubecfg --name=<cluster_name> --admin) we also perform a kubectl config to adjust the server to a "valid/visible" endpoint for the host where kops is being executed from, through something like: kubectl config set-cluster <cluster_name> --server=<visible_cluster_api_endpoint>.

It's not ideal and would be good that kops would stop updating the corresponding cluster's "server" endpoint in ~/.kube/config if the configuration is already present, but that extra kubectl config ... step is something that doesn't really cause any pain.

If reading the cluster api endpoint from local file is not possible for the reconcile cluster functionality (at least during cluster creation time), maybe it would be alright if the API endpoint could be specified/overridden when running kops, e.g. through a new --kubectl-cluster-server command option, KOPS_<XXX> environment variable or similar?

For a bit more context and so you know our use case:

  • kops clusters are created with private topology and use API DNS (instead of the default "load balancer")
  • additional SAN's for the <visible_cluster_api_endpoint> are included in kops cluster config
  • the host from where kops CLI is executed doesn't have direct access (nor DNS resolution capabilities) to the cluster network and needs to "proxy" through a different private network (via <visible_cluster_api_endpoint>) in order to access the kubernetes API

Additionally to the use case above, but not related with this issue, the only thing that's far from ideal in kops is that, although we use private topology and DNS for api access, kops still creates the public "utility" subnets which will always be empty/unused - this becomes somewhat of an issue for our model because we have limited subnet addressing space due to "peering" across many other networks (and overlapping is not possible), but that's "another story".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants