infrastructure/k8s-cluster

Fork

You've already forked k8s-cluster

Code Issues 35 Pull requests 6 Projects Releases Packages Wiki Activity Actions

Planned disaster recovery exercise - 24 July 2025 #81

New issue

Closed

opened 2025-12-08 12:53:04 +00:00 by earl-warren · 2 comments

earl-warren commented

2025-12-08 12:53:04 +00:00

Copy link

Warning

do not switch invisible to NFS, it makes it much too slow. Let it be shutdown and re-scheduled with local storage instead.

announce https://codeberg.org/forgejo/discussions/issues/374
sudo apt update && sudo apt upgrade to get the latest kernel on hetzner05 and hetzner06 => linux-image-amd64 (6.1.140-1)
full backup out of the cluster of /precious with sudo rsync --inplace --progress --delete -HSzva root@hetzner06.forgejo.org:/precious/ /srv/precious/
upgrade the hetzner05 kernel
- kubectl drain --ignore-daemonsets --delete-emptydir-data hetzner05
- reboot hetzner05
- verify the newer kernel is installed
- kubectl uncordon hetzner05
- kubectl rollout restart deployment/traefik -n kube-system # needed because externalTrafficPolicy: Local requires a traefik pod runs on the node that has the failover IP
move the failover IP
- https://robot.hetzner.com/server select hetzner05 which is the machine owning the failover IP
- click on the arrows next to the failover IP (v4 & v6) to see Failover configuration for 2a01:4f8:fff2:48::/64 or Failover configuration for 188.40.16.47/32
- In "New routing target" select hetzner05 to replace hetzner06
- click on "Set routing target"
- check k3s for failures https://hl.forgejo.org/c/main
- wait 15 minutes
- check https://kuma.forgejo.ovh/dashboard is all good
move some pods to NFS
- switch all next.* to using NFS https://invisible.forgejo.org/infrastructure/k8s-cluster/pulls/654 and forgefriends
- switch invisible to NFS https://invisible.forgejo.org/infrastructure/k8s-cluster/pulls/656 # if it does not come back, checkout "When invisible.forgejo.org is down"
- ~~switch code to NFS https://invisible.forgejo.org/infrastructure/k8s-cluster/pulls/657~~ not doing that because it may be as good as shutdown because of slowness that is not yet understood https://invisible.forgejo.org/infrastructure/k8s-cluster/issues/664
move all pods out of hetzner06
- kubectl label node --all forgejo.org/drbd- # PVC with an affinity to drbd-primary will not find a node, meaning code.forgejo.org will be stopped
- kubectl drain --ignore-daemonsets --delete-emptydir-data hetzner06
switch DRBD primary to hetzner05 as if hetzner06 crashed and burned
- hetzner06
  - sudo systemctl stop nfs-server # gracefully (hopefully) terminate NFS I/O to hetzner05
  - sudo ip addr del 10.53.101.100/24 dev enp5s0.4001 # remove NFS server IP
  - sudo umount /precious
  - sudo drbdadm secondary r1 # Switch the DRBD to secondary
  - sudo drbdadm status
- hetzner05
  - sudo drbdadm primary r1 # Switch the DRBD to primary
  - sudo drbdadm status # wait until it is in sync
  - sudo mount /precious # DRBD volume shared via NFS
  - sudo ip addr add 10.53.101.100/24 dev enp5s0.4001 # add NFS server IP
start all instances that rely on local storage
- kubectl label node hetzner05 forgejo.org/drbd=primary # hetzner05 is where local storage can be found
upgrade the hetzner06 kernel
- reboot hetzner06
- verify the newer kernel is installed
allow hetzner06 to run pods and receive failover IPs
- kubectl uncordon hetzner06
- kubectl rollout restart deployment/traefik -n kube-system # needed because externalTrafficPolicy: Local requires a traefik pod runs on the node that has the failover IP
revert back to using local storage when possible to improve performances
- revert invisible - revert next - forgefriends
verify all is as it should
- https://hl.forgejo.org/c/main/ is healthy
- https://kuma.forgejo.ovh/dashboard
add this scenario to the documentation

Not in scope

Invisible is down - because it has been done in 2025 when invisible was created around March
Forgejo Actions runners that service the https://codeberg.org/forgejo organization - because it has been done in 2025
https://forgejo.org - because it is under the responsibility of uberspace
https://status.forgejo.org/ - because it is under the responsibility of @crystal

> **Warning** do not switch invisible to NFS, it [makes it much too slow](https://invisible.forgejo.org/infrastructure/k8s-cluster/issues/664). Let it be shutdown and re-scheduled with local storage instead. --- - [x] announce https://codeberg.org/forgejo/discussions/issues/374 - [x] `sudo apt update && sudo apt upgrade` to get the latest kernel on hetzner05 and hetzner06 => `linux-image-amd64 (6.1.140-1)` - [x] full backup out of the cluster of `/precious` with `sudo rsync --inplace --progress --delete -HSzva root@hetzner06.forgejo.org:/precious/ /srv/precious/` - [x] upgrade the hetzner05 kernel - [x] `kubectl drain --ignore-daemonsets --delete-emptydir-data hetzner05` - [x] reboot hetzner05 - [x] verify the newer kernel is installed - [x] `kubectl uncordon hetzner05` - [x] `kubectl rollout restart deployment/traefik -n kube-system` # needed because externalTrafficPolicy: Local requires a traefik pod runs on the node that has the failover IP - [x] [move the failover IP](./k8s-maintenance.md#routing-the-failover-ip) - [x] <https://robot.hetzner.com/server> select hetzner05 which is the machine owning the failover IP - [x] click on the arrows next to the failover IP (v4 & v6) to see Failover configuration for 2a01:4f8:fff2:48::/64 or Failover configuration for 188.40.16.47/32 - [x] In "New routing target" select hetzner05 to replace hetzner06 - [x] click on "Set routing target" - [x] check k3s for failures <https://hl.forgejo.org/c/main> - [x] wait 15 minutes - [x] check <https://kuma.forgejo.ovh/dashboard> is all good - [x] move some pods to NFS - [x] switch all next.\* to using NFS <https://invisible.forgejo.org/infrastructure/k8s-cluster/pulls/654> and [forgefriends](https://invisible.forgejo.org/infrastructure/k8s-cluster/pulls/658) - [x] switch invisible to NFS <https://invisible.forgejo.org/infrastructure/k8s-cluster/pulls/656> # if it does not come back, checkout "[When invisible.forgejo.org is down](./k8s-maintenance.md#when-invisible-forgejo-org-is-down)" - [ ] <del>switch code to NFS <https://invisible.forgejo.org/infrastructure/k8s-cluster/pulls/657></del> not doing that because it may be as good as shutdown because of slowness that is not yet understood <https://invisible.forgejo.org/infrastructure/k8s-cluster/issues/664> - [x] move all pods out of hetzner06 - [x] `kubectl label node --all forgejo.org/drbd-` # PVC with an affinity to drbd-primary will not find a node, meaning code.forgejo.org will be stopped - [x] `kubectl drain --ignore-daemonsets --delete-emptydir-data hetzner06` - [x] [switch DRBD primary to hetzner05](./k8s-maintenance.md#manual-boot-operations) as if hetzner06 crashed and burned - [x] hetzner06 - [x] `sudo systemctl stop nfs-server` # gracefully (hopefully) terminate NFS I/O to hetzner05 - [x] `sudo ip addr del 10.53.101.100/24 dev enp5s0.4001` # remove NFS server IP - [x] `sudo umount /precious` - [x] `sudo drbdadm secondary r1` # Switch the DRBD to secondary - [x] `sudo drbdadm status` - [x] hetzner05 - [x] `sudo drbdadm primary r1` # Switch the DRBD to primary - [x] `sudo drbdadm status` # wait until it is in sync - [x] `sudo mount /precious` # DRBD volume shared via NFS - [x] `sudo ip addr add 10.53.101.100/24 dev enp5s0.4001 # add NFS server IP` - [x] start all instances that rely on local storage - [x] `kubectl label node hetzner05 forgejo.org/drbd=primary` # hetzner05 is where local storage can be found - [x] upgrade the hetzner06 kernel - [x] reboot hetzner06 - [x] verify the newer kernel is installed - [x] allow hetzner06 to run pods and receive failover IPs - [x] `kubectl uncordon hetzner06` - [x] `kubectl rollout restart deployment/traefik -n kube-system` # needed because externalTrafficPolicy: Local requires a traefik pod runs on the node that has the failover IP - [x] revert back to using local storage when possible to improve performances - [x] revert [invisible](https://invisible.forgejo.org/infrastructure/k8s-cluster/pulls/656) - revert [next](https://invisible.forgejo.org/infrastructure/k8s-cluster/pulls/654) - [forgefriends](https://invisible.forgejo.org/infrastructure/k8s-cluster/pulls/658) - [x] verify all is as it should - [x] <https://hl.forgejo.org/c/main/> is healthy - [x] <https://kuma.forgejo.ovh/dashboard> - [x] add this scenario to the documentation ## Not in scope - [Invisible is down](./k8s-maintenance.md#when-invisible-forgejo-org-is-down) - because it has been done in 2025 when invisible was created around March - Forgejo Actions runners that service the <https://codeberg.org/forgejo> organization - because it has been done in 2025 - <https://forgejo.org> - because it is under the responsibility of uberspace - <https://status.forgejo.org/> - because it is under the responsibility of @crystal

earl-warren closed this issue

2025-12-08 12:53:04 +00:00

viceice commented

2025-12-08 12:53:04 +00:00

Copy link

switching the nfs server will need us to re-create all pods because the nfs mounts are all getting stale

👍 1

viceice commented

2025-12-08 12:53:04 +00:00

Copy link

It seems switching to nfs isn't required at all when doing that next time.
If the primary node (drbd) is down and not recoverable, then switching to nfs won't fix anything.

So next time i would do the following:

install updates on secondary node if applicable
move failoverip's
remove drbd label and drain primary
move drbd to secondary and add drbd label
install updates on old primary if applicable
uncordon old primary
rollout traefik

It seems switching to nfs isn't required at all when doing that next time. If the primary node (drbd) is down and not recoverable, then switching to nfs won't fix anything. So next time i would do the following: - install updates on secondary node if applicable - move failoverip's - remove drbd label and drain primary - move drbd to secondary and add drbd label - install updates on old primary if applicable - uncordon old primary - rollout traefik

👍 1

root added the

cleanup

label

2025-12-08 13:32:37 +00:00

No Branch/Tag specified

main

renovate/ghcr.io-visualon-roundcube-1.6.11

renovate/ghcr.io-visualon-nginx-1.28.0

renovate/bitnami-valkey-9.0.0

renovate/bitnami-kubectl-1.34.2

renovate/ghcr.io-traefik-traefik-3.x

feat/v7-anubis-policy

No results found.

Labels

Clear labels

bug

Something is not working

cleanup

duplicate

This issue or pull request already exists

enhancement

New feature

forgefriends

https://code.forgejo.org/infrastructure/k8s-cluster/src/branch/main/.forgejo/workflows/sync.yml

a workflow is associated with this issue and will trigger when labels are updated

need more info

question

More information is needed

refactor

No functional change, reorganization

static-site

https://code.forgejo.org/infrastructure/k8s-cluster/src/branch/main/.forgejo/workflows/sync.yml

sync

https://code.forgejo.org/infrastructure/k8s-cluster/src/branch/main/.forgejo/workflows/build.yml

No labels

Milestone

Clear milestone

No items

No milestone

Projects

Clear projects

No items

No project

Assignees

Clear assignees

No assignees

2 participants

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

infrastructure/k8s-cluster#81

Reference in a new issue

Repository

infrastructure/k8s-cluster

Title

Body

No description provided.

Delete branch "%!s()"

Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?

Rows
Columns