Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tested 2.0.0-rc.3 on Nomad, test failed #330

Closed
resmo opened this issue Nov 21, 2022 · 9 comments
Closed

Tested 2.0.0-rc.3 on Nomad, test failed #330

resmo opened this issue Nov 21, 2022 · 9 comments
Assignees

Comments

@resmo
Copy link
Contributor

resmo commented Nov 21, 2022

Hi, I just wanted to give some feedback about the 2.0.0-rc.3 on Nomad. We have a working CSI plugin running version 1.6.0.

Nomad job:

job "hcloud-csi" {
  datacenters = ["dc1"]
  namespace   = "default"
  type        = "system"

  group "monolith" {
    task "plugin" {
      driver = "docker"

      config {
        image      = "hetznercloud/hcloud-csi-driver:1.6.0"
        privileged = true
      }

      env {
        CSI_ENDPOINT   = "unix://csi/csi.sock"
        ENABLE_METRICS = true
        HCLOUD_TOKEN   = "..."
      }

      csi_plugin {
        id        = "csi.hetzner.cloud"
        type      = "monolith"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

after switching to 2.0.0-rc.3, job fails with

Error: CSI plugin failed probe: timeout while connecting to gRPC socket: failed to stat socket: stat /opt/nomad/data/client/csi/plugins/f6e8b9ec-de9c-329d-35d1-0289cc0fa182/csi.sock: no such file or directory

switching back to 1.6.0 recovered the job.

Let me know if I can help out to find any issues related to this.

@apricote apricote self-assigned this Nov 21, 2022
@resmo
Copy link
Contributor Author

resmo commented Nov 21, 2022

with 1.6.0:

# ls -l /opt/nomad/data/client/csi/plugins/*/
srwxr-xr-x 1 root root 0 Nov 21 10:52 csi.sock

@apricote
Copy link
Member

I am no Nomad expert, so please take the following with a grain of salt. I will continue to investigate this with a coworker.

I assume this happens because we changed our image to include two seperate binaries for node and controller. See this PR for details of the change: #266

To fix this, you would need to change your manifest to have seperate jobs for csi_plugin.type = controller and csi_plugin.type = node, as well as specify the binary, similarly to how we do it in our kubernetes manifests:

For node:

command: [/bin/hcloud-csi-driver-node]

For controller:

https://github.com/hetznercloud/csi-driver/blob/ce413a923c316b5e7a67f6bb1c4d98366c4d1fe7/deploy/kubernetes/controller/deployment.yaml#LL39

@resmo
Copy link
Contributor Author

resmo commented Nov 21, 2022

Thanks for the hint!

I was able to got it running:

controller job (only 1 alloc needed -> service)

job "hcloud-csi-controller" {
  datacenters = ["dc1"]
  namespace   = "default"
  type        = "service"

  group "controller" {
    task "plugin" {
      driver = "docker"

      config {
        image   = "hetznercloud/hcloud-csi-driver:2.0.0"
        command = "bin/hcloud-csi-driver-controller"
      }

      env {
        CSI_ENDPOINT   = "unix://csi/csi.sock"
        ENABLE_METRICS = true
        HCLOUD_TOKEN   = "..."
      }

      csi_plugin {
        id        = "csi.hetzner.cloud"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

node job:

job "hcloud-csi-node" {
  datacenters = ["dc1"]
  namespace   = "default"
  type        = "system"

  group "node" {
    task "plugin" {
      driver = "docker"

      config {
        image      = "hetznercloud/hcloud-csi-driver:2.0.0"
        command    = "bin/hcloud-csi-driver-node"
        privileged = true
      }

      env {
        CSI_ENDPOINT   = "unix://csi/csi.sock"
        ENABLE_METRICS = true
      }

      csi_plugin {
        id        = "csi.hetzner.cloud"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

volume definition:

type      = "csi"
id        = "volume-mixed"
name      = "volume-mixed"
plugin_id = "csi.hetzner.cloud"

capability {
  access_mode     = "single-node-writer"
  attachment_mode = "file-system"
}

mount_options {
  fs_type     = "ext4"
  mount_flags = ["discard", "defaults"]
}

create volume:

$ nomad volume create vol-mixed.hcl
Created external volume 24844804 with ID volume-mixed

Verify with hcloud cli:

 $ hcloud volume list
ID         NAME                          SIZE    SERVER                   LOCATION
...
24844804   volume-mixed                  10 GB   -                        fsn1

@resmo resmo closed this as completed Nov 21, 2022
@apricote
Copy link
Member

General note: We currently do not explicitly support Nomad. We do not test against Nomad. I will open an issue for this support, so we can make sure to test for any breakages and documented upgrade procedures in the future.


The deployment should work with v2.0.0 with the following manifests:

job "hcloud-csi-node" {
  datacenters = ["dc1"]
  namespace   = "default"
  type        = "system"

  group "hcloud-csi-node" {
    task "plugin" {
      driver = "docker"

      config {
        image      = "hetznercloud/hcloud-csi-driver:2.0.0"
        privileged = true
        command    = "/bin/hcloud-csi-driver-node"
      }

      env {
        CSI_ENDPOINT   = "unix://csi/csi.sock"
      }

      csi_plugin {
        id        = "csi.hetzner.cloud"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

job "hcloud-csi-controller" {
  datacenters = ["dc1"]
  namespace   = "default"

  group "hcloud-csi-controller" {
    task "plugin" {
      driver = "docker"

      config {
        image      = "hetznercloud/hcloud-csi-driver:2.0.0"
        privileged = true
        command    = "/bin/hcloud-csi-driver-controller"
      }

      env {
        CSI_ENDPOINT   = "unix://csi/csi.sock"
        HCLOUD_TOKEN   = "..."
      }

      csi_plugin {
        id        = "csi.hetzner.cloud"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

I am going to close the issue, if you still have problems with the deployment, please feel free to reopen or create a new issue.

@resmo
Copy link
Contributor Author

resmo commented Nov 21, 2022

perfect, if you continue to create RCs, I will test them and give feedback.

@apricote
Copy link
Member

perfect, if you continue to create RCs, I will test them and give feedback.

Actually, v2.0.0 is already released!

@resmo
Copy link
Contributor Author

resmo commented Nov 21, 2022

Yes, I know. ;) I meant in future versions.

@resmo
Copy link
Contributor Author

resmo commented Nov 23, 2022

Just for the record.

I had an issue with CSI 2.0.0 today (after nomad upgrade 1.4.3). I was not able to deploy a job with a CSI volume:

    2022-11-23T15:58:04+01:00: Task Group "db" (failed to place 1 allocation):
      * Class "public": 3 nodes excluded by filter
      * Constraint "did not meet topology requirement": 3 nodes excluded by filter
    2022-11-23T15:58:04+01:00: Evaluation "fac96ff2" waiting for additional capacity to place remainder
==> 2022-11-23T15:58:04+01:00: Monitoring deployment "a1048165"

I tried a few things:

  • downgrade nomad / consul
  • reboot nomad nodes
  • uninstall/reinstalll csi plugin
  • purge/recreate job having the volume
  • deploy a new nomad client

But what helped was to downgrade to hcloud csi 1.6.0...

@apricote
Copy link
Member

Thanks for the report @resmo, this is a result of #302 where we changed to topology labels.

It looks like we need to revert that change, but this will require manual user intervention for all volumes created with 2.x.

If you want, you can create a separate issue for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants