Replicated on two nodes with K3S

Overview

Kolab is deployed in a two node K3S cluster, using local storage on the nodes (no shared storage required).

All data is replicated from the primary to the secondary node via drbd, and it is possible to fail over to the secondary node if the primary is lost.

All external connections are going through HA-Proxy with ip failover via keepalived, so failing over is transparent externally.

Some load-balancing can be achieved by using services on both nodes, but to be able to restore service on a node failure, a single node must be able to handle the complete load.

Both mariadb and HA-Proxy are run outside of k3s for this deployment.

This deployment is more complex in that not all components can be run in k3s, and workloads need to be explicitly assigned to nodes, but it allows to achieve redundancy and a limited amount of load-balancing with minimal resources (only two nodes, using local disks).

Storage

Local disk storage is consumed via local-storage PersistentVolumes, pinning the workloads to a specific node.

Failure modes

One out of 2 nodes can fail without data-loss, and service can be restored with manual intervention, assuming there is enough capacity on a single node.

Deployment method

The deployment is managed by:

Ansible playbooks (configured via inventory.yaml) to manage the virtualized nodes, which run k3s and other dependencies.
A helm chart (configured via values.yaml) to manage the kolab deployment running on k3s.

The ansible playbooks assume that one or more KVM hypervisors are available to run the virtualized nodes, and the following topolgy is assumed:

Control host: A workstation with access to the two hypervisors is used to manage the environment.
Hypervisor A:
- Infrastructure VM 1:
- Worker VM 1:
Hypervisor B:
- Infrastructure VM 2:
- Worker VM 2:

The ansible playbooks are executed from the control host. It is required that passwordless ssh logins are configured (for the user executing the ansible scripts), from the control host to the hypervisors for the virtual machine setup.

First an inventory.yaml file must be created to describe the deployment.
An initial run of the ansible playbook will provision the vms on the hypervisors, setup haproxy, mariadb and a k3s cluster, and deploy kolab in the cluster.
An initial values.yaml will be generated from a template next to the inventory file.

Going forward the inventory.yaml and values.yaml files should be added to version control. Changes to the environment can be applied by either executing the full ansible script, or by only redeploying the helm chart (configured via values.yaml)

Setup Instructions

Executed from the control host.

Create a directory where you intend to manage the kolab installation from
Download the latest release here: https://mirror.apheleia-it.ch/pub/kolab-kubernetes-latest.tar.gz
Extract the tarball.
Copy deployments/k3s-replicated/inventory.example.yaml to inventory.yaml
Adjust the new inventory.yaml, by default it will deploy everything on localhost
Run the ansible provisioning process: ansible-playbook -v -i inventory.yaml -D deployments/k3s-replicated/ansible/playbook.yaml
Navigate to the configured url (if dns is already, prepared, otherwise via /etc/hosts entry)
Login with the admin@YOURDOMAIN user
Future changes can be applied via: ./kolabctl apply

Local registry

It is possible to run a local registry on the primary node, which can then be used to push locally built images to.

Enable in inventory.yaml via the local_registry varable.
Build images locally with podman:
```
build/build.sh
```

Push images to the registry: e.g. via podman.

env PUSH_ARGS="--tls-verify=false" REGISTRY="$PRIMARYIP:5001" build/publish.sh

Use the local registry in the images in the values.yaml file.

Troubleshooting

$ kubectl -n kolab logs pod/roundcube-57d5bdfd6d-8hl48
Defaulted container "roundcube" out of: roundcube, metricsexporter, roundcube-db-wait (init)
Error from server: Get "https://192.168.122.25:10250/containerLogs/kolab/roundcube-57d5bdfd6d-8hl48/roundcube": proxy error from 127.0.0.1:6443 while dialing 192.168.122.25:10250, code 502: 502 Bad Gateway

Seems to happen when connecting e.g. to kolab1 while the pod is on kolab2.

There's an error like this on kolab1:

    Jan 02 17:35:49 kolab1 k3s[102505]: time="2025-01-02T17:35:49+01:00" level=error msg="Sending HTTP/1.1 502 response to 127.0.0.1:46022: failed to find Session for client kolab2"

Mariadb galera recovery

All nodes down

When all nodes have been shutdown it's necessary to bootstrap a new cluster. In this scenario node with the bootstrap flag will be used to bootstrap the cluster. If the nodes could be out of sync, then mysqld --wsrep-recover should be run on both nodes to figure out which node is ahead, which should then be used to bootstrap the cluster (set the bootstrap flag accordingly in inventory.yaml).

One node still active

If one node is still up it remains operational, and the remaining node can be rejoined to the cluster by starting the mariadb service.