Skip to main content

Part 4: Maintaining a DeepSquare cluster

As you can see, ClusterFactory is quite heavy. So here are the recommended practices to maintain your cluster with ClusterFactory.

Updating, Backup, Restore, Ejecting a controller

About maintaining the Kubernetes Cluster, read that here.

Upgrading is safe and seamless so don't hesitate.

To update ClusterFactory, git fetch upstream and git merge upstream/<ref>. Always solve the merge conflits. Updates to the core and argo directories will go to core.example and argo.example.

Updating the software stack

To update the containers:

In each values.yaml file, you can override the tag of the container.

If you are using dev containers, just kill the pods. If the imagePullPolicy is Always, it will pull a new image.

To update a helm subchart:

You can update the Helm subchart by bumping the version inside the Chart.yaml.

To update the smart-contract address:

Update the LDAP connector secrets and supervisor values.

If you are using dev containers, the smart-contract address is always the latest smart-contract release.

If you are using a stable version, the smart-contract address will be the one used on app.deepsquare.run.

To update the compute plane:

If you are using root=live:https://sos-ch-dk-2.exo.io/osimages/squareos-9.2/squareos-9.2.squashfs in your kernel command line parameters, rebooting the nodes will always update the OS image.

To update the postscripts:

Update the git repository that you've created. Rebooting the nodes will execute the latest postscripts.

In case of major upgrade (new OS image + new container images):

Make sure the SLURM version matches (slurmd --version) between the SLURM controller and SLURM daemon. Besides that, everything is decoupled enough that there is no problem to update each component individually.

To customize the OS image:

ClusterFactory provides the Packer recipe for building the SquareOS image. Edit and use it to add your software.

Monitoring

We recommend to deploy a prometheus stack on Kubernetes and deploy node exporters on the compute nodes.

Use the stack with the Prometheus Operator so you can use the ServiceMonitor resources to easily configure the services to be monitored instead of using a static configuration.

Single Responsibility

ClusterFactory tries to follow GitOps. You must force yourself to not use kubectl and cfctl directly.

Core resources (Volumes, Namespaces, AppProject) will always need to be deployed with kubectl. The rest can be integrated inside a Kustomization bundle or a Helm application.

You can expose the ArgoCD Dashboard so that Developpers can use the ArgoCD Dashboard to deploy their application.

You can also share the public certificate of sealed-secret by running:

kubeseal --fetch-cert --controller-namespace sealed-secrets --controller-name sealed-secrets > tls.crt

Then, developers can seal a secret by using kubeseal with tls.crt:

kubeseal --cert tls.crt -o yaml -f my-secret.yaml.local

Having issues ?

Feel free to ask us questions or ask for help on Discord!

The product is still young and the documentation certainly needs some fine-tuning.