Planned disaster recovery exercise - 24 July 2025 #81

Closed
opened 2025-12-08 12:53:04 +00:00 by earl-warren · 2 comments

Warning

do not switch invisible to NFS, it makes it much too slow. Let it be shutdown and re-scheduled with local storage instead.


  • announce https://codeberg.org/forgejo/discussions/issues/374
  • sudo apt update && sudo apt upgrade to get the latest kernel on hetzner05 and hetzner06 => linux-image-amd64 (6.1.140-1)
  • full backup out of the cluster of /precious with sudo rsync --inplace --progress --delete -HSzva root@hetzner06.forgejo.org:/precious/ /srv/precious/
  • upgrade the hetzner05 kernel
    • kubectl drain --ignore-daemonsets --delete-emptydir-data hetzner05
    • reboot hetzner05
    • verify the newer kernel is installed
    • kubectl uncordon hetzner05
    • kubectl rollout restart deployment/traefik -n kube-system # needed because externalTrafficPolicy: Local requires a traefik pod runs on the node that has the failover IP
  • move the failover IP
  • move some pods to NFS
  • move all pods out of hetzner06
    • kubectl label node --all forgejo.org/drbd- # PVC with an affinity to drbd-primary will not find a node, meaning code.forgejo.org will be stopped
    • kubectl drain --ignore-daemonsets --delete-emptydir-data hetzner06
  • switch DRBD primary to hetzner05 as if hetzner06 crashed and burned
    • hetzner06
      • sudo systemctl stop nfs-server # gracefully (hopefully) terminate NFS I/O to hetzner05
      • sudo ip addr del 10.53.101.100/24 dev enp5s0.4001 # remove NFS server IP
      • sudo umount /precious
      • sudo drbdadm secondary r1 # Switch the DRBD to secondary
      • sudo drbdadm status
    • hetzner05
      • sudo drbdadm primary r1 # Switch the DRBD to primary
      • sudo drbdadm status # wait until it is in sync
      • sudo mount /precious # DRBD volume shared via NFS
      • sudo ip addr add 10.53.101.100/24 dev enp5s0.4001 # add NFS server IP
  • start all instances that rely on local storage
    • kubectl label node hetzner05 forgejo.org/drbd=primary # hetzner05 is where local storage can be found
  • upgrade the hetzner06 kernel
    • reboot hetzner06
    • verify the newer kernel is installed
  • allow hetzner06 to run pods and receive failover IPs
    • kubectl uncordon hetzner06
    • kubectl rollout restart deployment/traefik -n kube-system # needed because externalTrafficPolicy: Local requires a traefik pod runs on the node that has the failover IP
  • revert back to using local storage when possible to improve performances
  • verify all is as it should
  • add this scenario to the documentation

Not in scope

> **Warning** do not switch invisible to NFS, it [makes it much too slow](https://invisible.forgejo.org/infrastructure/k8s-cluster/issues/664). Let it be shutdown and re-scheduled with local storage instead. --- - [x] announce https://codeberg.org/forgejo/discussions/issues/374 - [x] `sudo apt update && sudo apt upgrade` to get the latest kernel on hetzner05 and hetzner06 => `linux-image-amd64 (6.1.140-1)` - [x] full backup out of the cluster of `/precious` with `sudo rsync --inplace --progress --delete -HSzva root@hetzner06.forgejo.org:/precious/ /srv/precious/` - [x] upgrade the hetzner05 kernel - [x] `kubectl drain --ignore-daemonsets --delete-emptydir-data hetzner05` - [x] reboot hetzner05 - [x] verify the newer kernel is installed - [x] `kubectl uncordon hetzner05` - [x] `kubectl rollout restart deployment/traefik -n kube-system` # needed because externalTrafficPolicy: Local requires a traefik pod runs on the node that has the failover IP - [x] [move the failover IP](./k8s-maintenance.md#routing-the-failover-ip) - [x] <https://robot.hetzner.com/server> select hetzner05 which is the machine owning the failover IP - [x] click on the arrows next to the failover IP (v4 & v6) to see Failover configuration for 2a01:4f8:fff2:48::/64 or Failover configuration for 188.40.16.47/32 - [x] In "New routing target" select hetzner05 to replace hetzner06 - [x] click on "Set routing target" - [x] check k3s for failures <https://hl.forgejo.org/c/main> - [x] wait 15 minutes - [x] check <https://kuma.forgejo.ovh/dashboard> is all good - [x] move some pods to NFS - [x] switch all next.\* to using NFS <https://invisible.forgejo.org/infrastructure/k8s-cluster/pulls/654> and [forgefriends](https://invisible.forgejo.org/infrastructure/k8s-cluster/pulls/658) - [x] switch invisible to NFS <https://invisible.forgejo.org/infrastructure/k8s-cluster/pulls/656> # if it does not come back, checkout "[When invisible.forgejo.org is down](./k8s-maintenance.md#when-invisible-forgejo-org-is-down)" - [ ] <del>switch code to NFS <https://invisible.forgejo.org/infrastructure/k8s-cluster/pulls/657></del> not doing that because it may be as good as shutdown because of slowness that is not yet understood <https://invisible.forgejo.org/infrastructure/k8s-cluster/issues/664> - [x] move all pods out of hetzner06 - [x] `kubectl label node --all forgejo.org/drbd-` # PVC with an affinity to drbd-primary will not find a node, meaning code.forgejo.org will be stopped - [x] `kubectl drain --ignore-daemonsets --delete-emptydir-data hetzner06` - [x] [switch DRBD primary to hetzner05](./k8s-maintenance.md#manual-boot-operations) as if hetzner06 crashed and burned - [x] hetzner06 - [x] `sudo systemctl stop nfs-server` # gracefully (hopefully) terminate NFS I/O to hetzner05 - [x] `sudo ip addr del 10.53.101.100/24 dev enp5s0.4001` # remove NFS server IP - [x] `sudo umount /precious` - [x] `sudo drbdadm secondary r1` # Switch the DRBD to secondary - [x] `sudo drbdadm status` - [x] hetzner05 - [x] `sudo drbdadm primary r1` # Switch the DRBD to primary - [x] `sudo drbdadm status` # wait until it is in sync - [x] `sudo mount /precious` # DRBD volume shared via NFS - [x] `sudo ip addr add 10.53.101.100/24 dev enp5s0.4001 # add NFS server IP` - [x] start all instances that rely on local storage - [x] `kubectl label node hetzner05 forgejo.org/drbd=primary` # hetzner05 is where local storage can be found - [x] upgrade the hetzner06 kernel - [x] reboot hetzner06 - [x] verify the newer kernel is installed - [x] allow hetzner06 to run pods and receive failover IPs - [x] `kubectl uncordon hetzner06` - [x] `kubectl rollout restart deployment/traefik -n kube-system` # needed because externalTrafficPolicy: Local requires a traefik pod runs on the node that has the failover IP - [x] revert back to using local storage when possible to improve performances - [x] revert [invisible](https://invisible.forgejo.org/infrastructure/k8s-cluster/pulls/656) - revert [next](https://invisible.forgejo.org/infrastructure/k8s-cluster/pulls/654) - [forgefriends](https://invisible.forgejo.org/infrastructure/k8s-cluster/pulls/658) - [x] verify all is as it should - [x] <https://hl.forgejo.org/c/main/> is healthy - [x] <https://kuma.forgejo.ovh/dashboard> - [x] add this scenario to the documentation ## Not in scope - [Invisible is down](./k8s-maintenance.md#when-invisible-forgejo-org-is-down) - because it has been done in 2025 when invisible was created around March - Forgejo Actions runners that service the <https://codeberg.org/forgejo> organization - because it has been done in 2025 - <https://forgejo.org> - because it is under the responsibility of uberspace - <https://status.forgejo.org/> - because it is under the responsibility of @crystal

switching the nfs server will need us to re-create all pods because the nfs mounts are all getting stale

switching the nfs server will need us to re-create all pods because the nfs mounts are all getting stale

It seems switching to nfs isn't required at all when doing that next time.
If the primary node (drbd) is down and not recoverable, then switching to nfs won't fix anything.

So next time i would do the following:

  • install updates on secondary node if applicable
  • move failoverip's
  • remove drbd label and drain primary
  • move drbd to secondary and add drbd label
  • install updates on old primary if applicable
  • uncordon old primary
  • rollout traefik
It seems switching to nfs isn't required at all when doing that next time. If the primary node (drbd) is down and not recoverable, then switching to nfs won't fix anything. So next time i would do the following: - install updates on secondary node if applicable - move failoverip's - remove drbd label and drain primary - move drbd to secondary and add drbd label - install updates on old primary if applicable - uncordon old primary - rollout traefik
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
infrastructure/k8s-cluster#81
No description provided.