Overview

About ClusterFactory

ClusterFactory is a modern and very powerful cluster manager.

It brings together best-in-class solutions from the HPC, Cloud, and DevOps industries to deploy and manage compute clusters in a declarative way in combination with the GitOps practice.

ClusterFactory is the easiest method to make your infrastructure ready to join the DeepSquare Grid.

To learn more about ClusterFactory, please have a look to its dedicated documentation

ClusterFactory allows DeepSquare to maintain a common software stack to facilitate provisioning, maintenance and writing infrastructure as code for bare-metal and cloud clusters.

What is deployed with ClusterFactory

Do remember that ClusterFactory acts as the control plane of the whole cluster. It plays a crucial role in managing the entire cluster. It hosts various components necessary for the proper functioning of DeepSquare's compute plane.

ClusterFactory will be primarily used to deploy the following stacks:

The vanilla Kubernetes stack: This stack provides the foundation for running Kubernetes services within the cluster.
The network stack:
- Traefik: It acts as the main entry point for Kubernetes services, functioning as a Layer 7 router and load balancer.
- MetalLB: This component announces IP addresses for Kubernetes LoadBalancer services.
- Multus: It enables multiple network interfaces with CNI plugins and serves as a secondary entry point for Kubernetes services that require the L2 network layer of the local network.
- CoreDNS: This is the primary domain name server for the entire cluster.
The GitOps stack:
- ArgoCD: It facilitates continuous deployment based on Git as the source of truth.
- cert-manager: A powerful tool for generating and managing TLS certificates.
- sealed-secrets: This component allows encrypted secrets to be stored on Git.
The provisioning stack:
- Grendel: It is an all-in-one bare-metal provisioner that incorporates, a DHCP server, a PXE server, an HTTP server, a TFTP server, and an IPMI controller.
The software stack:
- NFS CSI driver and local-path provisioner: These components enable storage provisioning for Kubernetes.
- MariaDB: The database used by SLURM, a batch job scheduler for High-Performance Computing (HPC).
- 389ds: The LDAP server utilized by SLURM and the compute nodes.
- Provider LDAP connector: This DeepSquare solution automatically registers DeepSquare users with the LDAP server.
- SLURM: The batch job scheduler responsible for managing HPC workloads.
  - A login container: It serves as the main entry point for submitting SLURM batch scripts.
  - The controller container: It manages the SLURM system.
  - The database container: This connects MariaDB with the SLURM controller.
- The provider Supervisor: This DeepSquare solution bridges DeepSquare with SLURM.
- CVMFS Stratum 1: A CVMFS server that replicates software exported on the compute nodes for the DeepSquare Grid.

Take time to learn about these softwares as they will be used during the deployment process.

Architecture of a deployed cluster with ClusterFactory

The goal is to deploy the stacks, but it can be too complex to understand how everything is connected, so let's start with the basics.

At the beginning, there is a bare-metal Kubernetes Cluster

When setting up a Kubernetes Cluster, there are two types of nodes that can be deployed: the Kubernetes controllers and the Kubernetes workers. The controllers are like the brain of the Kubernetes system and are responsible for managing different aspects of the cluster such as the deployment, scaling, and rollout of applications. On the other hand, the workers are the ones responsible for running the actual workloads and executing the containers.

k0s Controller processes

k0s worker processes

Usually, it's recommended to have an odd number of controllers for better fault tolerance. However, for simplicity, in this guide, we will combine the controllers and workers. If you want to separate the controllers and workers, it's recommended to follow the K0s control plane high availability guide.

After setting up the controllers and workers, the Kubernetes API will be available, which you can connect to by fetching the kubeconfig. You can then use tools like Lens and kubectl to manage the cluster.

The Kubernetes workers will have pods (group of containers) running on it which are controlled by either a ReplicatSet/Deployment, a StatefulSet or a DaemonSet.

If you don't know what these resources are, these are the different types of Kubernetes workloads! To learn more, check out the Kubernetes documentation.

Then, we add the network stack for inter-pod and external-to-service communications

On Kubernetes, the smallest unit of workload that can be deployed is a Pod, which itself is a group of containers. Each pod is assigned a static IP for pod-to-external communication. To facilitate inter-pod and external-to-pod communications, it is necessary to utilize a Kubernetes Service object.

Why is this necessary? In a Kubernetes cluster, pods are ephemeral and their IP addresses can change frequently. To ensure stable network communication, Kubernetes introduces the concept of a Service. A Service assigns a consistent IP address and DNS name to a group of pods, ensuring continuous connectivity even as pods change. This abstraction eliminates the need for clients to connect directly to individual pods, providing a reliable and scalable solution for intra-cluster communication.

For inter-pod communication, ClusterIP Services are commonly used, which is the default type of Service. To further facilitate communication between pods, Kubernetes automatically creates DNS records for services and pods.

Therefore, we deploy CoreDNS first. CoreDNS serves as the domain name server within the Kubernetes cluster, providing names to Kubernetes pods and services. It can also act as the main DNS for the compute plane.

We also intend to expose Traefik, the primary L7 router and entrypoint. So instead of doing a simple ClusterIP Service, we need to use a Kubernetes LoadBalancer Service that allows us to attach an external IP to the Service.

To achieve this, we utilize MetalLB, which enables the advertisement of the IP to the router, enabling routing of packets destined for Traefik.

Lastly, some pods require direct access to a local network. For example, the Grendel provisioning system uses DHCP, an L2 protocol that requires the DHCP server to be on the same network as the client broadcasting a DHCP discovery message. If the DHCP server is not in the same network as the client (e.g., Grendel is inside the Kubernetes network), a technical limitation prevents a DHCP discovery message from traversing a router.

To overcome this, we utilize Multus in combination with a CNI IPVLAN plugin. This setup allows the Grendel pod to be directly connected to the local network, permitting DHCP communication.

The network stack is necessary to expose the different workloads of the Kubernetes Cluster, from pod-to-pod, external-to-service and external-to-pod.

architecture-cf-de-Page-2.drawio

We add GitOps to ease the continuous deployment and infrastructure automation

The foundation of GitOps starts with ArgoCD. ArgoCD automates the deployment and lifecycle management of applications and configurations in Kubernetes clusters, ensuring they are always in the desired state by leveraging GitOps principles. It synchronizes the desired state defined in Git repositories with the actual state of the target environment, providing a reliable and scalable solution for managing complex deployments, promoting collaboration, and ensuring consistency across multiple clusters and environments.

To always follow the GitOps principles, we must also manage our secrets and not store them in plaintext in the Git repository. Therefore, there is two solutions:

cert-manager: A solution to generate and manage TLS certificates using Kubernetes annotations and objects
sealed-secrets: A solution to encrypt secrets, which permits the storage of secrets in the Git repository

With these three solutions, your configuration is:

Declarative: the entire system has to be described
Versioned and immutable: the configurations are version-controlled and promotes infrastructure-as-code practices by allowing rollbacks simple as checking out a commit.
Collaboration: instead of separating the operators from the developers, both of them can use the collaboration tools around git like pull requests to fosters collaboration among teams.
Pulled automatically: Any approved changes to the git repository will be applied to the system
Continuously reconciled: ArgoCD can see the differences between expected configuration and the reality

**GitOps is an approach that combines version control, automation, and collaboration to manage configurations and deployments in a declarative and auditable manner, ensuring consistency and scalability in modern software delivery pipelines. **

architecture-cf-de-Page-3.drawio

We add Grendel to network boot the Compute Nodes

To network boot the compute nodes, we use Grendel which is a combination of PXE (Preboot Execution Environment), TFTP (Trivial File Transfer Protocol), and DHCP (Dynamic Host Configuration Protocol).

The procedure of a network boot using Grendel is the following:

Client Machine Initialization: The network boot process begins when a client machine, configured for network booting, powers on or restarts. Instead of booting from its local storage, the client sends a DHCP request to the network to obtain an IP address and other necessary network configuration parameters.
DHCP Server Response: Upon receiving the DHCP request, the DHCP server, which is configured with the necessary options for network booting, responds to the client with an IP address, the IP address of the TFTP server, and the location of the initial firmware file to be fetched. In this case, the initial firmware file is the iPXE firmware.
iPXE Firmware Download: The client, armed with the IP address of the TFTP server and the filename of the iPXE firmware, initiates a TFTP (Trivial File Transfer Protocol) request to download the iPXE firmware from the TFTP server. The TFTP server responds by sending the iPXE firmware file to the client.
iPXE Execution: Once the iPXE firmware is successfully downloaded, the client executes the iPXE firmware, which takes control of the network stack of the client machine. Within the iPXE firmware, an iPXE script is executed to fetch the initramfs (initial RAM filesystem) and kernel files required for booting. The script defines the location of the initramfs and kernel files, typically specified as URLs or network paths, and instructs iPXE to download these files.
Initramfs Boot: Once the initramfs and kernel files are successfully downloaded, iPXE transfers control to the initramfs. The initramfs is a compressed file system that contains essential tools and drivers necessary for booting the system.
Dracut Configuration: The initramfs, upon initialization, reads the Dracut configuration. Dracut is a modular initramfs infrastructure used in many Linux distributions. The configuration defines the steps required to boot the system, including the detection and loading of hardware drivers, mounting of the root file system, and execution of necessary scripts or hooks.
Live Image Boot: Based on the Dracut configuration, the initramfs mounts the root file system, which is can be a live image of an operating system. The live image allows the client machine to boot into an operating system environment without modifying the local hard drive. Then, the live image boots up, initializing the operating system and presenting the user with a fully functional environment. The client machine can now be used.

Because, Grendel uses DHCP, we use Multus and the IPVLAN CNI plugin to connect the Grendel pod to the local network.

IPVLAN is a Container Network Interface (CNI) plugin that leverages the IPVLAN technology provided by the Linux kernel. IPVLAN allows the creation of a virtual network interface with its own IP address. In this case, the IPVLAN CNI plugin is used in conjunction with Multus to connect the Grendel pod to the local network.

architecture-cf-de-Page-4.drawio

Lastly, we deploy the software stack which permits DeepSquare to work

Knowing that we have deployed all the solutions to ensure that the software stack works perfectly. Here is the diagram that links all the components of the software stack:

architecture-cf-de-Page-5.drawio

This diagram is also valid for a cluster without ClusterFactory.

From left to right:

The LDAP Connector creates DeepSquare users in the 389ds LDAP server.
The Supervisor retrieves jobs from the smart contract and forwards them to a SLURM login.
The SLURM login submits the batch job to the SLURM controller and starts accounting with the SLURM DB, which is connected to a MariaDB.
The SLURM controller transmits the batch job to a compute node via the SLURM daemon and starts running the job.
Job statuses are reported to the Supervisor via the SLURM SPANK plugin on the compute node and the SLURM completion plugin on the SLURM Controller.

Everything is authenticated via 389ds and SSSD (SSSD is also running inside the SLURM login and controller containers).

The CVMFS server is exposed so that the compute node can mount the DeepSquare software on the compute nodes.

Finally, the ArgoCD dashboard is exposed so that Kubernetes engineers can access the argocd dashboard to view the status of the infrastructure.

Using all of these standard tools, this permits your infrastructure to be maintainable and scalable in the long term.

What's next

Now you know what we are going to deploy. This will be the approach to deployment:

Install the prerequisites.
Deploy the ClusterFactory core.
Deploy most of the control plane services.
Deploy the control plane with Grendel.
Learn how to maintain ClusterFactory.

Follow the next chapter for the prerequisites.

Overview

About ClusterFactory​

What is deployed with ClusterFactory​

Architecture of a deployed cluster with ClusterFactory​

At the beginning, there is a bare-metal Kubernetes Cluster​

Then, we add the network stack for inter-pod and external-to-service communications​

We add GitOps to ease the continuous deployment and infrastructure automation​

We add Grendel to network boot the Compute Nodes​

Lastly, we deploy the software stack which permits DeepSquare to work​

What's next​