How I physically moved my server between locations without outage

Posted on May 13, 2023

#linux #thinkpad #migration #networking #kopia #zfs #haproxy #wireguard

This is a story about how I moved my server between locations 250 kilometers apart. The primary reason for making this move was to perform the initial backup to a new off-site storage location on a faster connection than what I have available at home. The second and better reason was, of course, curiosity. The server in this case is my Thinkpad T430.

The setup

This is what my setup looks like:

For more details on all the services running on my server, see this article.

The frontend servers change every now and then. This is what was used when I did the move. The back-end server itself switches networks regularly, so it’s just one of many devices on my home network and was connected via LTE when on the move. The communication with the frontend servers to which my domains are pointed to happens via a Wireguard tunnel.

Fun fact: I use this Thinkpad (the server) as my workstation occasionally and it even ran a full-blown Windows VM in libvirt/kvm with passed-through GPU.

The move

So first I made sure that my server and my phone batteries are fully charged. I didn’t expect the poor old Thinkpad to last through the whole journey, I just wanted to minimize downtime. My ride came in the morning, I packed the server, started tethering from the phone and off we went. I hadn’t checked if the server was up until about 30 minutes into the ride. I took my phone out of the bag and immediately I noticed that an e-mail came 2 minutes ago. I honestly didn’t have any expectations, but everything worked flawlessly.

I set up a simple restarter script that checked the connection between one of the front-end nodes and my server. Here is the list of time stamps when my restarter script had to restart wireguard due to lost connection after 3 failed checks, so only significant outages are caught:

04:59:02 AM UTC
05:00:36 AM UTC
05:43:12 AM UTC
05:53:31 AM UTC
08:15:31 AM UTC
08:16:32 AM UTC
08:17:31 AM UTC
08:31:32 AM UTC
10:59:32 AM UTC
11:00:31 AM UTC
11:01:31 AM UTC
11:02:32 AM UTC

The move itself was happening between 5 and 8 AM UTC, which was 7 and 10 CEST in my life (I use UTC on my servers). I actually caught more outages at the destination (my friend’s apartment) due to his unexpectedly poor WiFi connection. Eventually we switched over to cable.

Running the backups

At home this would have taken me at least 7 days, which is not a problem, but my (free) remote backup service would time out. I had to throttle the backup program (kopia) so as not to not fill the local cache of what’s being uploaded - I ran out of disk space because my connection wasn’t uploading the prepared data fast enough. This would at some points slow it down to a crawl which would upload at the rate of 1 mbps and cause the timeouts. There was simply no easy way to upload the whole dataset from my home due to how the backup program worked in tandem with the backup service. I actually tried it at home first and it failed on the 7th day so I gave up. None of this is kopia’s or rclone’s fault, but of the service I use and how they implement the fuse storage.

While at my friend’s, I helped him with some stuff and we spent some good time together. My initial backup was uploaded over the 100 mbps uplink within the first 2 days. Subsequent backups are much faster and work flawlessly from home. The reason why I didn’t go with services like S3 Glacier despite their acceptable cost is that the recovery costs are huge. And while I haven’t needed to recover anything yet, I obviously want to verify that my backups work by doing a test recovery. Since this would take a long time on my home connection too, I tested these backups immediately and all was fine. In the future I will publish an article on how I have this set up. It involves kopia, rclone, ZFS and 2 remote locations with a total capacity of 6 TB for less than 5 eur per month.

Trouble on the way back

On my way back I accidentally tested one more feature of the setup’s reliability. I bumped the laptop in such a way that one of the SSDs in RAID1 had disconnected. This was no big deal of course and was easily fixed when I came home. Here is the dmesg log from that time for posterity:

[1634427.631042] ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
[1634427.632466] ata2.00: irq_stat 0x00000040, connection status changed
[1634427.633809] ata2: SError: { CommWake DevExch }
[1634427.635220] ata2.00: failed command: FLUSH CACHE EXT
[1634427.636584] ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
                          res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
[1634427.639229] ata2.00: status: { DRDY }
[1634427.640528] ata2: hard resetting link
[1634428.367072] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[1634428.369917] ata2.00: model number mismatch 'Crucial_CT275MX300SSD1' != 'CT1000MX500SSD1'
[1634428.372093] ata2.00: revalidation failed (errno=-19)
[1634428.374277] ata2: limiting SATA link speed to 3.0 Gbps
[1634433.522914] ata2: hard resetting link
[1634433.837084] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[1634433.838980] ata2.00: model number mismatch 'Crucial_CT275MX300SSD1' != 'CT1000MX500SSD1'
[1634433.840395] ata2.00: revalidation failed (errno=-19)
[1634433.841800] ata2.00: disabled
[1634438.894797] ata2: hard resetting link
[1634439.209017] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[1634439.210822] ata2.00: supports DRM functions and may not be fully accessible
[1634439.211869] ata2.00: ATA-10: CT1000MX500SSD1, M3CR033, max UDMA/133
[1634439.212877] ata2.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 32), AA
[1634439.214467] ata2.00: supports DRM functions and may not be fully accessible
[1634439.216017] ata2.00: configured for UDMA/133
[1634439.217030] ata2.00: retrying FLUSH 0xea Emask 0x10
[1634439.228158] ata2.00: device reported invalid CHS sector 0
[1634439.229233] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=43s
[1634439.230300] sd 1:0:0:0: [sdb] tag#0 Sense Key : Illegal Request [current]
[1634439.231361] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unaligned write command
[1634439.232627] sd 1:0:0:0: [sdb] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[1634439.233679] blk_update_request: I/O error, dev sdb, sector 2056 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[1634439.234745] md: super_written gets error=-5
[1634439.234791] sd 1:0:0:0: rejecting I/O to offline device
[1634439.235800] md/raid1:md1: Disk failure on sdb1, disabling device.
                 md/raid1:md1: Operation continuing on 1 devices.
[1634439.237400] blk_update_request: I/O error, dev sdb, sector 50829080 op 0x1:(WRITE) flags 0x700 phys_seg 6 prio class 0
[1634439.239573] ata2: EH complete
[1634439.241200] zio pool=liberty_crypt vdev=/dev/mapper/liberty_cryptb error=5 type=2 offset=4531826688 size=4096 flags=180880
[1634439.243033] blk_update_request: I/O error, dev sdb, sector 40193984 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1634439.244757] zio pool=liberty_crypt vdev=/dev/mapper/liberty_cryptb error=5 type=2 offset=4531830784 size=20480 flags=180880
[1634439.244804] blk_update_request: I/O error, dev sdb, sector 40194040 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 0
[1634439.244807] md/raid1:md1: sdb1: rescheduling sector 40157176
[1634439.244811] blk_update_request: I/O error, dev sdb, sector 41978384 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1634439.244815] zio pool=liberty_crypt vdev=/dev/mapper/liberty_cryptb error=5 type=1 offset=270336 size=8192 flags=b08c1
[1634439.244836] blk_update_request: I/O error, dev sdb, sector 537233424 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1634439.244840] zio pool=liberty_crypt vdev=/dev/mapper/liberty_cryptb error=5 type=1 offset=253570850816 size=8192 flags=b08c1
[1634439.244849] blk_update_request: I/O error, dev sdb, sector 537233936 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1634439.244851] zio pool=liberty_crypt vdev=/dev/mapper/liberty_cryptb error=5 type=1 offset=253571112960 size=8192 flags=b08c1
[1634439.245136] blk_update_request: I/O error, dev sdb, sector 50829128 op 0x1:(WRITE) flags 0x700 phys_seg 4 prio class 0
[1634439.245140] zio pool=liberty_crypt vdev=/dev/mapper/liberty_cryptb error=5 type=2 offset=4531851264 size=16384 flags=40080c80
[1634439.246236] md/raid1:md1: sdb1: rescheduling sector 40157120
[1634439.264436] ata2.00: detaching (SCSI 1:0:0:0)
[1634439.264635] blk_update_request: I/O error, dev sdb, sector 41978384 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1634439.265028] md/raid1:md1: redirecting sector 40157176 to other mirror: sda1
[1634439.265421] md/raid1:md1: redirecting sector 40157120 to other mirror: sda1
[1634439.271708] zio pool=liberty_crypt vdev=/dev/mapper/liberty_cryptb error=5 type=1 offset=270336 size=8192 flags=b0ac1
[1634439.275269] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
[1634439.275519] zio pool=liberty_crypt vdev=/dev/mapper/liberty_cryptb error=5 type=1 offset=253570850816 size=8192 flags=b0ac1
[1634439.278922] sd 1:0:0:0: [sdb] Stopping disk
[1634439.279303] zio pool=liberty_crypt vdev=/dev/mapper/liberty_cryptb error=5 type=1 offset=253571112960 size=8192 flags=b0ac1
[1634439.311273] scsi 1:0:0:0: Direct-Access     ATA      CT1000MX500SSD1  033  PQ: 0 ANSI: 5
[1634439.314036] sd 1:0:0:0: [sdc] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
[1634439.316337] sd 1:0:0:0: [sdc] Write Protect is off
[1634439.318271] sd 1:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[1634439.318356] sd 1:0:0:0: Attached scsi generic sg1 type 0
[1634439.320451] sd 1:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[1634439.348965]  sdc: sdc1 sdc2
[1634439.427079] sd 1:0:0:0: [sdc] supports TCG Opal
[1634439.429153] sd 1:0:0:0: [sdc] Attached SCSI disk
[1635038.672540] md: recovery of RAID array md1
[1635205.054464] md: md1: recovery done.

The system part of my SSDs is ext4 on LVM on luks on mdadm. The bigger data part is ZFS on luks. Both recovered without a hitch. Though later down the line I learned that something is wrong with one of the SSDs or the slot itself, as this setup could be called janky at best. The Thinkpad is physically damaged (by the previous owner) in many places and the drive caddy I use in the ultrabay is duct-taped as the original holding mechanism fell apart. I didn’t build this with portability in mind and wanted it to be cheaper than Raspberry Pi when buying it.

In both directions the laptop’s battery was enough to last the whole way (3-4 hours due to delays) and the lack of significant outages can be attributed to decent LTE coverage along the highways in my country.

That’s the end of the story, everything is fine and the incremental backups run regularly ever since. Next step in my backups journey will be retesting them at least once a year, but as it goes with all things: if it’s not automated, I’ll forget.