test Rook / Ceph for better HA #85

Open
opened 2025-12-08 12:53:13 +00:00 by viceice · 13 comments

Currently we use local / nfs storage, which is replicated bewteen hetzner05 and hetzner06.
There are manual steps to move that to a different node.

We should try Rook1 / Ceph on our three node cluster and check if it has enough performance and resilience.

Currently we use local / nfs storage, which is replicated bewteen hetzner05 and hetzner06. There are manual steps to move that to a different node. We should try Rook[^1] / Ceph on our three node cluster and check if it has enough performance and resilience. [^1]: https://rook.io
Author

we can use /dev/nvme1n1p5 as storage volume, it's only used on hetzner06 for backups, which can be moved to hetzner03:/dev/nvme0n1 (which is totally unused)

we can use `/dev/nvme1n1p5` as storage volume, it's only used on hetzner06 for backups, which can be moved to `hetzner03:/dev/nvme0n1` (which is totally unused)
Author

Another alternate is to use a three node glusterfs on xfs. We use it at work and it's pretty stable.
This would also allow to fail a node without much interruption.
All nodes would be equally and we can use all nodes for scheduling.

Another alternate is to use a three node `glusterfs` on `xfs`. We use it at work and it's pretty stable. This would also allow to fail a node without much interruption. All nodes would be equally and we can use all nodes for scheduling.

I agree it should be tested. But testing anything on hardware that is in production does not strike me as prudent.

I agree it should be tested. But testing anything on hardware that is in production does not strike me as prudent.
Author

@earl-warren wrote in https://invisible.forgejo.org/infrastructure/k8s-cluster/issues/662#issuecomment-5829:

I agree it should be tested. But testing anything on hardware that is in production does not strike me as prudent.

the good thing is that we can test it without interrupting running services. we can simply use a copy of forgejo-code to test performance

I'll be much more comfortable if we've such a replicated system, because a downtime of a node will only cause a few seconds of service outages (after reacting of an admin of cause)

@earl-warren wrote in https://invisible.forgejo.org/infrastructure/k8s-cluster/issues/662#issuecomment-5829: > I agree it should be tested. But testing anything on hardware that is in production does not strike me as prudent. the good thing is that we can test it without interrupting running services. we can simply use a copy of forgejo-code to test performance I'll be much more comfortable if we've such a replicated system, because a downtime of a node will only cause a few seconds of service outages (after reacting of an admin of cause)

@viceice wrote in https://invisible.forgejo.org/infrastructure/k8s-cluster/issues/662#issuecomment-5831:

the good thing is that we can test it without interrupting running services. we can simply use a copy of forgejo-code to test performance

What worries me most is what needs to be installed on the existing machines and the potential impact on the cluster. It involves kernel modules, new services and they all require a specific network setup.

I'll be much more comfortable if we've such a replicated system, because a downtime of a node will only cause a few seconds of service outages (after reacting of an admin of cause)

Me too. As much as I like the robustness and reliability of the current system, the need for manual intervention when storage fails makes it more difficult to deal with in times of crisis.

Here is an idea.

Three new machines are rented to prepare for this with zero risk for the current cluster. They can be setup as three entirely new nodes. And use an entirely different storage network (i.e. not the current VLAN dedicated to DRBD/NFS, two new ones for the client side and backend side). It will need different firewall rules too. The other VLAN dedicated to k8s will be shared between all nodes. This is a level of compartmentalization that sounds very secure, even during the early stages of experimentation.

image

When these new nodes are operating in a way that feels reliable and fast enough (can't be slower than NFS over DRBD over two countries, that's a given 😁), deployments can be migrated to it, gradually. With the option to move the deployment back to the other nodes should something go sideways for some reason.

Assuming this can be completed over three months calendar time, the cost would be €450 and an estimate of time in the range of 100h / ten days and 200h / twenty days of work. You work a lot faster than I do so maybe less if you are the one doing most of the work. In any case I don't believe this is something that can fit in less than 50h / five days of work between installation, documentation, experimentation, mistakes, getting used to new problems, bumping into dead ends, monitoring, stabilizing, finalizing, decommissioning the old nodes. So even in the most optimistic case at a 60€ / hour rate that's 3,000€ of human labor and more likely around 6,000€ altogether.

I can cover the costs of these additional nodes for the period. However I cannot spend the time to lead this effort: it is too much time given my availability and what I am set to do before I go on sabbatical 1 January 2026. But I can certainly be in the passenger seat.

If you are willing to work on that sooner rather than wait 2027, I think choosing the technology stack you are most familiar with (glusterfs) is the most sensible choice. It makes a world of difference if you are familiar with how that can be setup and repaired.

@viceice wrote in https://invisible.forgejo.org/infrastructure/k8s-cluster/issues/662#issuecomment-5831: > the good thing is that we can test it without interrupting running services. we can simply use a copy of forgejo-code to test performance What worries me most is what needs to be installed on the existing machines and the potential impact on the cluster. It involves kernel modules, new services and they all require a specific network setup. > I'll be much more comfortable if we've such a replicated system, because a downtime of a node will only cause a few seconds of service outages (after reacting of an admin of cause) Me too. As much as I like the robustness and reliability of the current system, the need for manual intervention when storage fails makes it more difficult to deal with in times of crisis. Here is an idea. Three new machines are rented to prepare for this with zero risk for the current cluster. They can be setup as three entirely new nodes. And use an entirely different storage network (i.e. not the current VLAN dedicated to DRBD/NFS, two new ones for the client side and backend side). It will need different firewall rules too. The other VLAN dedicated to k8s will be shared between all nodes. This is a level of compartmentalization that sounds very secure, even during the early stages of experimentation. ![image](/attachments/821f81e8-9857-42f1-a896-f3a9c625f154) When these new nodes are operating in a way that feels reliable and fast enough (can't be slower than NFS over DRBD over two countries, that's a given 😁), deployments can be migrated to it, gradually. With the option to move the deployment back to the other nodes should something go sideways for some reason. Assuming this can be completed over three months calendar time, the cost would be €450 and an estimate of time in the range of 100h / ten days and 200h / twenty days of work. You work a lot faster than I do so maybe less if you are the one doing most of the work. In any case I don't believe this is something that can fit in less than 50h / five days of work between installation, documentation, experimentation, mistakes, getting used to new problems, bumping into dead ends, monitoring, stabilizing, finalizing, decommissioning the old nodes. So even in the most optimistic case at a 60€ / hour rate that's 3,000€ of human labor and more likely around 6,000€ altogether. I can cover the costs of these additional nodes for the period. However I cannot spend the time to lead this effort: it is too much time given my availability and what I am set to do before I go on sabbatical 1 January 2026. But I can certainly be in the passenger seat. If you are willing to work on that sooner rather than wait 2027, I think choosing the technology stack you are most familiar with (glusterfs) is the most sensible choice. It makes a world of difference if you are familiar with how that can be setup and repaired.
Author

rook / ceph would not be installed on the host. they would run as pods1 inside the cluster, so not much headaches to think about 😉

glusterfs would need some system packages2. the daemons are running in userspace, so no kernel issues. It's primary an overlay filesystem over multiple normal ext4 / xfs file systems. So i also see no big deals to care.

We can of cause build the glusterfs in 3 new servers and then add those as new k3s nodes when they're running.
We can mask them with custom taints, and transfer some workloads to validate.
Benefit of extending: all manual k8s secrets are already there.
When all runs fine, then we can simple move the floating ip's and decomission the old servers.

rook / ceph would not be installed on the host. they would run as pods[^1] inside the cluster, so not much headaches to think about 😉 glusterfs would need some system packages[^2]. the daemons are running in userspace, so no kernel issues. It's primary an overlay filesystem over multiple normal ext4 / xfs file systems. So i also see no big deals to care. We can of cause build the glusterfs in 3 new servers and then add those as new k3s nodes when they're running. We can mask them with custom taints, and transfer some workloads to validate. Benefit of extending: all manual k8s secrets are already there. When all runs fine, then we can simple move the floating ip's and decomission the old servers. [^1]: https://rook.io/docs/rook/latest-release/Helm-Charts/operator-chart [^2]: https://packages.debian.org/bookworm/glusterfs-server
Author

if we fully separate them, then we (me) should do #242 first

if we fully separate them, then we (me) should do #242 first

@viceice wrote in https://invisible.forgejo.org/infrastructure/k8s-cluster/issues/662#issuecomment-5850:

rook / ceph would not be installed on the host. they would run as pods1 inside the cluster, so not much headaches to think about 😉

Interesting! I did not know it was even possible. In that case my concerns do not apply.

@viceice wrote in https://invisible.forgejo.org/infrastructure/k8s-cluster/issues/662#issuecomment-5850: > rook / ceph would not be installed on the host. they would run as pods[1](#fn:user-content-1) inside the cluster, so not much headaches to think about :wink: Interesting! I did not know it was even possible. In that case my concerns do not apply.

The way I understand it, this will still need to load the ceph/rbd kernel module to mount the ceph block device that a Ceph backed pod will use. However, given that those drivers have been stable and in the kernel for the past ten years or so, I think it is not too much of a risk.

I feel very good about this 😁

P.S. Codeberg also has some Ceph experience which may help figure out the gory details.

The way I understand it, this will still need to load the ceph/rbd kernel module to mount the [ceph block device](https://docs.ceph.com/en/reef/rbd/) that a Ceph backed pod will use. However, given that those drivers have been stable and in the kernel for the past ten years or so, I think it is not too much of a risk. I feel very good about this 😁 P.S. Codeberg also has some Ceph experience which may help figure out the gory details.

The partitions currently mounted on /srv can be emptied and used for Ceph. Upgrade backups can be sent to hetzner01 which has a lot of free space and has a connection fast enough that it does not make a lot of difference compared to local storage. In the context of recovering files that is.

The partitions currently mounted on /srv can be emptied and used for Ceph. Upgrade backups can be sent to hetzner01 which has a **lot** of free space and has a connection fast enough that it does not make a lot of difference compared to local storage. In the context of recovering files that is.
Author

@earl-warren wrote in https://invisible.forgejo.org/infrastructure/k8s-cluster/issues/662#issuecomment-5855:

The partitions currently mounted on /srv can be emptied and used for Ceph. Upgrade backups can be sent to hetzner01 which has a lot of free space and has a connection fast enough that it does not make a lot of difference compared to local storage. In the context of recovering files that is.

also the second disk on hetzer03 is also not used.

@earl-warren wrote in https://invisible.forgejo.org/infrastructure/k8s-cluster/issues/662#issuecomment-5855: > The partitions currently mounted on /srv can be emptied and used for Ceph. Upgrade backups can be sent to hetzner01 which has a **lot** of free space and has a connection fast enough that it does not make a lot of difference compared to local storage. In the context of recovering files that is. also the second disk on hetzer03 is also not used.
Author

if ceph works fine, then i plan to use the second disk of each server as additional storage for hopefully better performance.

if ceph works fine, then i plan to use the second disk of each server as additional storage for hopefully better performance.

hetzner05

debian@hetzner05:~$ sudo lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
nvme0n1     259:0    0 476.9G  0 disk 
`-drbd0     147:0    0 476.9G  0 disk /var/lib/kubelet/pods/8313d01d-92e1-4377-8179-dc23ccb71abf/volumes/kubernetes.io~local-volume/forgejo-next-local-v12
                                      /var/lib/kubelet/pods/e410cfb2-99e1-4af8-a4f0-5504740b2962/volumes/kubernetes.io~local-volume/forgejo-next-local-v13
                                      /var/lib/kubelet/pods/1f782280-fe29-43bf-ae88-4b9f641d16cf/volumes/kubernetes.io~local-volume/forgejo-code-local
                                      /var/lib/kubelet/pods/a6e6a387-74a9-4e25-aa1c-42100966e1b9/volumes/kubernetes.io~local-volume/forgefriends-local-a
                                      /var/lib/kubelet/pods/49c843bd-9433-4877-a77d-932da29a467a/volumes/kubernetes.io~local-volume/forgejo-next-local-v11
                                      /var/lib/kubelet/pods/efaae6ee-8031-4f71-b4a0-90b8745cee30/volumes/kubernetes.io~local-volume/forgejo-code-local-invisible
                                      /precious
nvme1n1     259:1    0 476.9G  0 disk 
|-nvme1n1p1 259:2    0   256M  0 part /boot/efi
|-nvme1n1p2 259:3    0    32G  0 part [SWAP]
|-nvme1n1p3 259:4    0     1G  0 part /boot
|-nvme1n1p4 259:5    0   100G  0 part /var/lib/kubelet/pods/8313d01d-92e1-4377-8179-dc23ccb71abf/volume-subpaths/anubis-bot-policy/anubis/1
|                                     /var/lib/kubelet/pods/8313d01d-92e1-4377-8179-dc23ccb71abf/volume-subpaths/anubis-bot-policy/anubis/0
|                                     /var/lib/kubelet/pods/e410cfb2-99e1-4af8-a4f0-5504740b2962/volume-subpaths/anubis-bot-policy/anubis/0
|                                     /var/lib/kubelet/pods/1f782280-fe29-43bf-ae88-4b9f641d16cf/volume-subpaths/anubis-bot-policy/anubis/0
|                                     /var/lib/kubelet/pods/1f782280-fe29-43bf-ae88-4b9f641d16cf/volume-subpaths/sshd-config/forgejo/2
|                                     /var/lib/kubelet/pods/49c843bd-9433-4877-a77d-932da29a467a/volume-subpaths/anubis-bot-policy/anubis/1
|                                     /var/lib/kubelet/pods/49c843bd-9433-4877-a77d-932da29a467a/volume-subpaths/anubis-bot-policy/anubis/0
|                                     /var/lib/kubelet/pods/d3c72246-43ca-4830-ac0c-cd5e73da07d9/volume-subpaths/anubis-bot-policy/anubis/1
|                                     /var/lib/kubelet/pods/d3c72246-43ca-4830-ac0c-cd5e73da07d9/volume-subpaths/anubis-bot-policy/anubis/0
|                                     /var/lib/kubelet/pods/79aea2df-9036-4418-b492-322e3fd7056e/volume-subpaths/empty-dir/prometheus/3
|                                     /var/lib/kubelet/pods/2193b580-9230-44c5-a1c7-b09742cf9f49/volume-subpaths/empty-dir/prometheus-operator/0
|                                     /var/lib/kubelet/pods/6c2e63a7-22d5-4fda-84a0-5bca589cc215/volume-subpaths/empty-dir/blackbox-exporter/1
|                                     /var/lib/kubelet/pods/4a7b92fc-87d5-4c34-86c4-726bf954f866/volume-subpaths/empty-dir/kube-state-metrics/0
|                                     /var/lib/kubelet/pods/5b62b734-c568-4532-b94b-62ed0f36e7b8/volume-subpaths/empty-dir/node-exporter/0
|                                     /
`-nvme1n1p5 259:6    0 343.7G  0 part /srv

hetzner06

debian@hetzner06:~$ sudo lsblk 
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
nvme1n1     259:0    0 476.9G  0 disk 
|-nvme1n1p1 259:1    0   256M  0 part /boot/efi
|-nvme1n1p2 259:2    0    32G  0 part [SWAP]
|-nvme1n1p3 259:3    0     1G  0 part /boot
|-nvme1n1p4 259:4    0   100G  0 part /var/lib/kubelet/pods/9a95db42-79c0-483a-aea5-ced93a98f579/volume-subpaths/empty-dir/node-exporter/0
|                                     /
`-nvme1n1p5 259:5    0 343.7G  0 part /srv
nvme0n1     259:6    0 476.9G  0 disk 
`-drbd0     147:0    0 476.9G  1 disk 

hetzner03

debian@hetzner03:~$ sudo lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
nvme0n1     259:0    0 476.9G  0 disk 
nvme1n1     259:1    0 476.9G  0 disk 
|-nvme1n1p1 259:2    0   256M  0 part /boot/efi
|-nvme1n1p2 259:3    0    32G  0 part [SWAP]
|-nvme1n1p3 259:4    0     1G  0 part /boot
|-nvme1n1p4 259:5    0   100G  0 part /var/lib/kubelet/pods/31561828-95f6-49cc-b137-54ad3203738e/volume-subpaths/empty-dir/node-exporter/0
|                                     /
`-nvme1n1p5 259:6    0 343.7G  0 part 
  `-drbd0   147:0    0 343.7G  1 disk 
## hetzner05 ```sh debian@hetzner05:~$ sudo lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS nvme0n1 259:0 0 476.9G 0 disk `-drbd0 147:0 0 476.9G 0 disk /var/lib/kubelet/pods/8313d01d-92e1-4377-8179-dc23ccb71abf/volumes/kubernetes.io~local-volume/forgejo-next-local-v12 /var/lib/kubelet/pods/e410cfb2-99e1-4af8-a4f0-5504740b2962/volumes/kubernetes.io~local-volume/forgejo-next-local-v13 /var/lib/kubelet/pods/1f782280-fe29-43bf-ae88-4b9f641d16cf/volumes/kubernetes.io~local-volume/forgejo-code-local /var/lib/kubelet/pods/a6e6a387-74a9-4e25-aa1c-42100966e1b9/volumes/kubernetes.io~local-volume/forgefriends-local-a /var/lib/kubelet/pods/49c843bd-9433-4877-a77d-932da29a467a/volumes/kubernetes.io~local-volume/forgejo-next-local-v11 /var/lib/kubelet/pods/efaae6ee-8031-4f71-b4a0-90b8745cee30/volumes/kubernetes.io~local-volume/forgejo-code-local-invisible /precious nvme1n1 259:1 0 476.9G 0 disk |-nvme1n1p1 259:2 0 256M 0 part /boot/efi |-nvme1n1p2 259:3 0 32G 0 part [SWAP] |-nvme1n1p3 259:4 0 1G 0 part /boot |-nvme1n1p4 259:5 0 100G 0 part /var/lib/kubelet/pods/8313d01d-92e1-4377-8179-dc23ccb71abf/volume-subpaths/anubis-bot-policy/anubis/1 | /var/lib/kubelet/pods/8313d01d-92e1-4377-8179-dc23ccb71abf/volume-subpaths/anubis-bot-policy/anubis/0 | /var/lib/kubelet/pods/e410cfb2-99e1-4af8-a4f0-5504740b2962/volume-subpaths/anubis-bot-policy/anubis/0 | /var/lib/kubelet/pods/1f782280-fe29-43bf-ae88-4b9f641d16cf/volume-subpaths/anubis-bot-policy/anubis/0 | /var/lib/kubelet/pods/1f782280-fe29-43bf-ae88-4b9f641d16cf/volume-subpaths/sshd-config/forgejo/2 | /var/lib/kubelet/pods/49c843bd-9433-4877-a77d-932da29a467a/volume-subpaths/anubis-bot-policy/anubis/1 | /var/lib/kubelet/pods/49c843bd-9433-4877-a77d-932da29a467a/volume-subpaths/anubis-bot-policy/anubis/0 | /var/lib/kubelet/pods/d3c72246-43ca-4830-ac0c-cd5e73da07d9/volume-subpaths/anubis-bot-policy/anubis/1 | /var/lib/kubelet/pods/d3c72246-43ca-4830-ac0c-cd5e73da07d9/volume-subpaths/anubis-bot-policy/anubis/0 | /var/lib/kubelet/pods/79aea2df-9036-4418-b492-322e3fd7056e/volume-subpaths/empty-dir/prometheus/3 | /var/lib/kubelet/pods/2193b580-9230-44c5-a1c7-b09742cf9f49/volume-subpaths/empty-dir/prometheus-operator/0 | /var/lib/kubelet/pods/6c2e63a7-22d5-4fda-84a0-5bca589cc215/volume-subpaths/empty-dir/blackbox-exporter/1 | /var/lib/kubelet/pods/4a7b92fc-87d5-4c34-86c4-726bf954f866/volume-subpaths/empty-dir/kube-state-metrics/0 | /var/lib/kubelet/pods/5b62b734-c568-4532-b94b-62ed0f36e7b8/volume-subpaths/empty-dir/node-exporter/0 | / `-nvme1n1p5 259:6 0 343.7G 0 part /srv ``` ## hetzner06 ```sh debian@hetzner06:~$ sudo lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS nvme1n1 259:0 0 476.9G 0 disk |-nvme1n1p1 259:1 0 256M 0 part /boot/efi |-nvme1n1p2 259:2 0 32G 0 part [SWAP] |-nvme1n1p3 259:3 0 1G 0 part /boot |-nvme1n1p4 259:4 0 100G 0 part /var/lib/kubelet/pods/9a95db42-79c0-483a-aea5-ced93a98f579/volume-subpaths/empty-dir/node-exporter/0 | / `-nvme1n1p5 259:5 0 343.7G 0 part /srv nvme0n1 259:6 0 476.9G 0 disk `-drbd0 147:0 0 476.9G 1 disk ``` ## hetzner03 ```sh debian@hetzner03:~$ sudo lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS nvme0n1 259:0 0 476.9G 0 disk nvme1n1 259:1 0 476.9G 0 disk |-nvme1n1p1 259:2 0 256M 0 part /boot/efi |-nvme1n1p2 259:3 0 32G 0 part [SWAP] |-nvme1n1p3 259:4 0 1G 0 part /boot |-nvme1n1p4 259:5 0 100G 0 part /var/lib/kubelet/pods/31561828-95f6-49cc-b137-54ad3203738e/volume-subpaths/empty-dir/node-exporter/0 | / `-nvme1n1p5 259:6 0 343.7G 0 part `-drbd0 147:0 0 343.7G 1 disk ```
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
infrastructure/k8s-cluster#85
No description provided.