Planned disaster recovery exercise - 24 July 2025 #81
Labels
No labels
bug
cleanup
duplicate
enhancement
forgefriends
help wanted
hetzner
invalid
label workflow
need more info
question
refactor
static-site
sync
wontfix
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
infrastructure/k8s-cluster#81
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
sudo apt update && sudo apt upgradeto get the latest kernel on hetzner05 and hetzner06 =>linux-image-amd64 (6.1.140-1)/preciouswithsudo rsync --inplace --progress --delete -HSzva root@hetzner06.forgejo.org:/precious/ /srv/precious/kubectl drain --ignore-daemonsets --delete-emptydir-data hetzner05kubectl uncordon hetzner05kubectl rollout restart deployment/traefik -n kube-system# needed because externalTrafficPolicy: Local requires a traefik pod runs on the node that has the failover IPswitch code to NFS https://invisible.forgejo.org/infrastructure/k8s-cluster/pulls/657not doing that because it may be as good as shutdown because of slowness that is not yet understood https://invisible.forgejo.org/infrastructure/k8s-cluster/issues/664kubectl label node --all forgejo.org/drbd-# PVC with an affinity to drbd-primary will not find a node, meaning code.forgejo.org will be stoppedkubectl drain --ignore-daemonsets --delete-emptydir-data hetzner06sudo systemctl stop nfs-server# gracefully (hopefully) terminate NFS I/O to hetzner05sudo ip addr del 10.53.101.100/24 dev enp5s0.4001# remove NFS server IPsudo umount /precioussudo drbdadm secondary r1# Switch the DRBD to secondarysudo drbdadm statussudo drbdadm primary r1# Switch the DRBD to primarysudo drbdadm status# wait until it is in syncsudo mount /precious# DRBD volume shared via NFSsudo ip addr add 10.53.101.100/24 dev enp5s0.4001 # add NFS server IPkubectl label node hetzner05 forgejo.org/drbd=primary# hetzner05 is where local storage can be foundkubectl uncordon hetzner06kubectl rollout restart deployment/traefik -n kube-system# needed because externalTrafficPolicy: Local requires a traefik pod runs on the node that has the failover IPNot in scope
switching the nfs server will need us to re-create all pods because the nfs mounts are all getting stale
It seems switching to nfs isn't required at all when doing that next time.
If the primary node (drbd) is down and not recoverable, then switching to nfs won't fix anything.
So next time i would do the following: