7 min read

Hands-on Kubernetes Operator Development: Reconcile loop

Hands-on Kubernetes Operator Development: Reconcile loop
Photo by Alex Shute / Unsplash

In the first part of our series, we introduced the concept of Kubernetes Operators and walked through setting up a new project using Kubebuilder. We ended by defining the structure of our custom resource, the Tenant, and discussed how this resource will be used to manage multi-tenant environments in a Kubernetes cluster.

In this second part, we're diving deep into the core of our operator - the reconciliation loop.

What is the Reconciliation Loop?

At the heart of every operator is the reconciliation loop. This is a function that observes the current state of the system and compares it to the desired state, as defined by our Tenant custom resource. If the current and desired states differ, the reconcile function makes the necessary changes to bring the system to its desired state.

Kubernetes Reconciliation Loop Sequence Diagram

The reconciliation loop is called every time a watch event is triggered for the operator's primary resources, in our case, the Tenant custom resources.

In order to react to changes to our CRD, the Tenant operator watches events (create, update, delete) on instances of the Tenant CRD. Whenever a Tenant object is created, updated, or deleted in the Kubernetes cluster, an event is fired and our operator's Reconciler gets triggered. This is already taken care of by kubebuilder (in internal/controller/tenant_controller.go):

// SetupWithManager sets up the controller with the Manager.
func (r *TenantReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&multitenancyv1.Tenant{}).
		Complete(r)
}

The .For(&multitenancyv1.Tenant{}) specifies the type of resource to watch. This function tells the controller to watch for changes to Tenant resources.

Understanding the Reconcile function

In our project, the main reconciler is represented by the TenantReconciler struct. The Client field in this struct is used to read and write Kubernetes objects, and the Scheme field is used to convert between different API versions.

type TenantReconciler struct {
        client.Client
        Scheme    *runtime.Scheme
}

The Reconcile function is what the Operator will execute whenever a Tenant object changes in the Kubernetes cluster. This function is where we define how our Operator should react to these events and take corrective measures to ensure the actual state matches the desired state defined in the Tenant object.

The Reconcile method has the following signature:

// +kubebuilder:rbac:groups=multitenancy.codereliant.io,resources=*,verbs=*
// +kubebuilder:rbac:groups="",resources=namespaces,verbs=*
// +kubebuilder:rbac:groups=rbac.authorization.k8s.io,resources=*,verbs=*
func (r *TenantReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
  // reconcile implementation
}

Let's dissect the components of this function:

  • // +kubebuilder:rbac...: markers for the kubebuilder tool that are used to generate the RBAC rules that the operator needs in order to function. In our case we need access to manage tenants, namespaces and roleBindings.
  • ctx context.Context: The first parameter is a context, which is commonly used in Go to control the execution of functions that might take some time to complete. This can be used to handle timeouts or cancel long-running tasks.
  • req ctrl.Request: This request object contains the information about the event that triggered the reconciliation.
  • ctrl.Result: This is the result that the function returns. It can be used to specify that the function should be requeued and executed again after some time, which can be useful in cases where not all conditions for state transition can be met in a single execution of the function.

It's important to remember that the reconciliation function is idempotent - it can be called multiple times for the same resource, and the result should be the same. This function needs to handle all edge cases that might occur and recover from possible errors.

Reconciliation implementation

As we discussed in the previous post, our controller will create namespaces and roleBindgins based on the Tenant spec. So let's start with implementing the Reconcile function

func (r *TenantReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	log := log.FromContext(ctx)

	tenant := &multitenancyv1.Tenant{}

	log.Info("Reconciling tenant")

    // Fetch the Tenant instance
	if err := r.Get(ctx, req.NamespacedName, tenant); err != nil {
		// Tenant object not found, it might have been deleted
		return ctrl.Result{}, client.IgnoreNotFound(err)
	}

	// Loop through each namespace defined in the Tenant Spec
	// Ensure the namespace exists, and if not, create it
	// Then ensure RoleBindings for each namespace
	for _, ns := range tenant.Spec.Namespaces {
		log.Info("Ensuring Namespace", "namespace", ns)
		if err := r.ensureNamespace(ctx, tenant, ns); err != nil {
			log.Error(err, "unable to ensure Namespace", "namespace", ns)
			return ctrl.Result{}, err
		}

		log.Info("Ensuring Admin RoleBinding", "namespace", ns)
		if err := r.EnsureRoleBinding(ctx, ns, tenant.Spec.AdminGroups, "admin"); err != nil {
			log.Error(err, "unable to ensure Admin RoleBinding", "namespace", ns)
			return ctrl.Result{}, err
		}

		if err := r.EnsureRoleBinding(ctx, ns, tenant.Spec.UserGroups, "edit"); err != nil {
			log.Error(err, "unable to ensure User RoleBinding", "namespace", ns)
			return ctrl.Result{}, err
		}
	}

    // Update the Tenant status with the current state
	tenant.Status.NamespaceCount = len(tenant.Spec.Namespaces)
	tenant.Status.AdminEmail = tenant.Spec.AdminEmail
	if err := r.Status().Update(ctx, tenant); err != nil {
		log.Error(err, "unable to update Tenant status")
		return ctrl.Result{}, err
	}

	return ctrl.Result{}, nil
}

This function attempts to fetch a Tenant instance and, if it exists, loops over each namespace defined in the Tenant spec, ensuring each namespace and corresponding RoleBinding exist. If they do not, the function creates them. Once completed, it updates the Tenant status with the number of namespaces and admin email, mirroring the current state of the resource in the cluster. If at any point an error is encountered during the fetch or update steps, the function will log the error and requeue the request.

Now let's implement the corresponding "ensure" functions, starting with EnsureNamespace:

const (
	tenantOperatorAnnotation = "tenant-operator"
)

func (r *TenantReconciler) ensureNamespace(ctx context.Context, tenant *multitenancyv1.Tenant, namespaceName string) error {
	log := log.FromContext(ctx)

	// Define a namespace object
	namespace := &corev1.Namespace{}

	// Attempt to get the namespace with the provided name
	err := r.Get(ctx, client.ObjectKey{Name: namespaceName}, namespace)
	if err != nil {
		// If the namespace doesn't exist, create it
		if apierrors.IsNotFound(err) {
			log.Info("Creating Namespace", "namespace", namespaceName)
			namespace := &corev1.Namespace{
				ObjectMeta: metav1.ObjectMeta{
					Name: namespaceName,
					Annotations: map[string]string{
						"adminEmail": tenant.Spec.AdminEmail,
						"managed-by": tenantOperatorAnnotation,
					},
				},
			}

			// Attempt to create the namespace
			if err = r.Create(ctx, namespace); err != nil {
				return err
			}
		} else {
			return err
		}
	} else {
		// If the namespace already exists, check for required annotations
		log.Info("Namespace already exists", "namespace", namespaceName)

		// Logic for checking annotations

	return nil
}

Similarly the roleBinding management function:

func (r *TenantReconciler) EnsureRoleBinding(ctx context.Context, namespaceName string, groups []string, clusterRoleName string) error {
  // roleBinding management implementation
}

Verifying the Operator

Now that we're done with the first pass implementation let's test it out! With Kubebuilder and Kind it's as simple as executing make run

$ make run
go fmt ./...
go vet ./...
go run ./cmd/main.go
2023-07-06T20:21:45-07:00	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ":8080"}
2023-07-06T20:21:45-07:00	INFO	setup	starting manager

And the controller is up and running on your local environment. By checking the logs we can see that it's already doing some work, based on the sample Tenant custom resource we created earlier:

2023-07-06T20:21:45-07:00	INFO	Reconciling tenant	{"controller": "tenant", "controllerGroup": "multitenancy.codereliant.io", "controllerKind": "Tenant", "Tenant": {"name":"tenant-sample"}, "namespace": "", "name": "tenant-sample", "reconcileID": "b6b1864b-7b6e-46ab-baa1-e3e4bc741f67"}
2023-07-06T20:21:45-07:00	INFO	Ensuring Namespace	{"controller": "tenant", "controllerGroup": "multitenancy.codereliant.io", "controllerKind": "Tenant", "Tenant": {"name":"tenant-sample"}, "namespace": "", "name": "tenant-sample", "reconcileID": "b6b1864b-7b6e-46ab-baa1-e3e4bc741f67", "namespace": "tenant-sample-ns1"}
2023-07-06T20:21:45-07:00	INFO	Ensuring Admin RoleBinding	{"controller": "tenant", "controllerGroup": "multitenancy.codereliant.io", "controllerKind": "Tenant", "Tenant": {"name":"tenant-sample"}, "namespace": "", "name": "tenant-sample", "reconcileID": "b6b1864b-7b6e-46ab-baa1-e3e4bc741f67", "namespace": "tenant-sample-ns1"}

We can confirm that resources were successfully created:

$ kubectl get namespaces
...
tenant-sample-ns1    Active   28s
tenant-sample-ns2    Active   28s
tenant-sample-ns3    Active   28s

$ kubectl get rolebinding -n tenant-sample-ns1
NAME                         ROLE                AGE
tenant-sample-ns1-admin-rb   ClusterRole/admin   28s
tenant-sample-ns1-edit-rb    ClusterRole/edit    28s

$ kubectl get tenants
NAME            EMAIL                  NAMESPACECOUNT
tenant-sample   admin@yourdomain.com   3

Congrats, your controller is up and running!

Wrapping Up

That concludes the second part of our series of creating a Kubernetes operator from scratch.

But the journey doesn't stop here. There are plenty of aspects we will explore to enhance our operator's functionality and reliability. In the next post we'll talk about cleaning up the resources once the custom resource is deleted.

The code discussed in this blog series can be found in this github repo.