Lately we have been running production on tux04. Unfortunately Debian got broken and I don't see a way to fix it (something with python versions that break apt!). Also mariadb is giving problems:
and that is alarming. We might as well try an upgrade. I created a new partition on /dev/sda4 using debootstrap.
Luckily not too much is running on this machine and if we mount things again, most should work.
blkid /dev/sda4 /dev/sda4: UUID="4aca24fe-3ece-485c-b04b-e2451e226bf7" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="2e3d569f-6024-46ea-8ef6-15b26725f811"
After debootstrap there are two things to take care of: the /dev directory and grub. For good measure I also capture some state
cd ~ ps xau > cron.log systemctl > systemctl.txt cp /etc/network/interfaces . cp /boot/grub/grub.cfg .
we should still have access to the old root partition, so I don't need to capture everything.
I ran MAKEDEV and that may not be needed with udev.
We need to tell grub to boot into the new partition. The old root is on UUID=8e874576-a167-4fa1-948f-2031e8c3809f /dev/sda2.
Next I ran
tux04:~$ update-grub2 /dev/sda Generating grub configuration file ... Found linux image: /boot/vmlinuz-5.10.0-32-amd64 Found initrd image: /boot/initrd.img-5.10.0-32-amd64 Found linux image: /boot/vmlinuz-5.10.0-22-amd64 Found initrd image: /boot/initrd.img-5.10.0-22-amd64 Warning: os-prober will be executed to detect other bootable partitions. Its output will be used to detect bootable binaries on them and create new boot entries. Found Debian GNU/Linux 12 (bookworm) on /dev/sda4 Found Windows Boot Manager on /dev/sdd1@/efi/Microsoft/Boot/bootmgfw.efi Found Debian GNU/Linux 11 (bullseye) on /dev/sdf2
Very good. Do a diff on grub.cfg and you see it even picked up the serial configuration. It only shows it added menu entries for the new boot. Very nice.
At this point I feel safe to boot as we should be able to get back into the old partition.
The old fstab looked like
UUID=8e874576-a167-4fa1-948f-2031e8c3809f / ext4 errors=remount-ro 0 1 # /boot/efi was on /dev/sdc1 during installation UUID=998E-68AF /boot/efi vfat umask=0077 0 1 # swap was on /dev/sdc3 during installation UUID=cbfcd84e-73f8-4cec-98ee-40cad404735f none swap sw 0 0 UUID="783e3bd6-5610-47be-be82-ac92fdd8c8b8" /export2 ext4 auto 0 2 UUID="9e6a9d88-66e7-4a2e-a12c-f80705c16f4f" /export ext4 auto 0 2 UUID="f006dd4a-2365-454d-a3a2-9a42518d6286" /export3 auto auto 0 2 /export2/gnu /gnu none defaults,bind 0 0 # /dev/sdd1: PARTLABEL="bulk" PARTUUID="b1a820fe-cb1f-425e-b984-914ee648097e" # /dev/sdb4 /export ext4 auto 0 2 # /dev/sdd1 /export2 ext4 auto 0 2
Next we are going to reboot, and we need a serial connector to the Dell out-of-band using racadm:
ssh IP console com2 racadm getsel racadm serveraction powercycle racadm serveraction powerstatus
Main trick it so hit ESC, wait 2 sec and 2 when you want the bios boot menu. Ctrl-\ to escape console. Otherwise ESC (wait) ! to get to the boot menu.
It still boots by default into the old root. That gave an error:
[FAILED] Failed to start File Syste…a-2365-454d-a3a2-9a42518d6286
This is /export3. We can fix that later.
When I booted into the proper partition the console clapped out. Also the racadm password did not work on tmux -- I had to switch to a standard console to log in again. Not sure why that is, but next I got:
Give root password for maintenance (or press Control-D to continue):
and giving the root password I was in maintenance mode on the correct partition!
To rerun grup I had to add `GRUB_DISABLE_OS_PROBER=false`.
Once booting up it is a matter of mounting partitions and tick the check boxes above.
The following contained errors:
/dev/sdd1 3.6T 1.8T 1.7T 52% /export2
Getting guix going is a bit tricky because we want to keep the store!
cp -vau /mnt/old-root/var/guix/ /var/ cp -vau /mnt/old-root/usr/local/guix-profiles /usr/local/ cp -vau /mnt/old-root/usr/local/bin/* /usr/local/bin/ cp -vau /mnt/old-root/etc/systemd/system/guix-daemon.service* /etc/systemd/system/ cp -vau /mnt/old-root/etc/systemd/system/gnu-store.mount* /etc/systemd/system/
Also had to add guixbuild users and group by hand.
We use the streaming facility. Check that
nginx -V
lists --with-stream=static, see
and load at the start of nginx.conf:
load_module /usr/lib/nginx/modules/ngx_stream_module.so;
and
nginx -t
passes
Now the container responds to the browser with `Internal Server Error`.
Visit the container with something like
nsenter -at 2838 /run/current-system/profile/bin/bash --login
The nginx log in the container has many
2025/02/22 17:23:48 [error] 136#0: *166916 connect() failed (111: Connection refused) while connecting to upstream, client: 127.0.0.1, server: genenetwork.org, request: "GET /gn3/gene/aliases/st%2029:1;o;s HTTP/1.1", upstream: "http://127.0.0.1:9800/gene/aliases/st%2029:1;o;s", host: "genenetwork.org"
that is interesting. Acme/https is working because GN2 is working:
curl https://genenetwork.org/api3/version "1.0"
Looking at the logs it appears it is a redis problem first for GN2.
Fred builds the container with `/home/fredm/opt/guix-production/bin/guix`. Machines are defined in
fredm@tux04:/export3/local/home/fredm/gn-machines
The shared dir for redis is at
--share=/export2/guix-containers/genenetwork/var/lib/redis=/var/lib/redis
with
root@genenetwork-production /var# ls lib/redis/ -l -rw-r--r-- 1 redis redis 629328484 Feb 22 17:25 dump.rdb
In production.scm it is defined as
(service redis-service-type (redis-configuration (bind "127.0.0.1") (port 6379) (working-directory "/var/lib/redis")))
The defaults are the same as the definition of redis-service-type (in guix). Not sure why we are duplicating.
After starting redis by hand I get another error `500 DatabaseError: The following exception was raised while attempting to access http://auth.genenetwork.org/auth/data/authorisation: database disk image is malformed`. The problem is it created a DB in the wrong place. Alright, the logs in the container say:
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # Redis version=7.0.12, bits=64, commit=00000000, modified=0, pid=3977, just started Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # Configuration loaded Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * Increased maximum number of open files to 10032 (it was originally set to 1024). Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * monotonic clock: POSIX clock_gettime Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * Running mode=standalone, port=6379. Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Server initialized Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Wrong signature trying to load DB from file Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Fatal error loading the DB: Invalid argument. Exiting. Feb 23 14:04:31 genenetwork-production shepherd[1]: Service redis (PID 3977) exited with 1.
This is caused by a newer version of redis. This is odd because we are using the same version from the container?!
Actually it turned out the redis DB was corrupted on the SSD! Same for some other databases (ugh).
Fred copied all data to an enterprise level storage, and we rolled back to some older DBs, so hopefully we'll be OK for now.
In the next step we need to restore backups as described in
I already created an ibackup user. Next we test the backup script for mariadb.
One important step is to check the database:
/usr/bin/mariadb-check -c -u user -p* db_webqtl
A successful mariadb backup consists of multiple steps
2025-02-27 11:48:28 +0000 (ibackup@tux04) SUCCESS 0 <32m43s> mariabackup-dump 2025-02-27 11:48:29 +0000 (ibackup@tux04) SUCCESS 0 <00m00s> mariabackup-make-consistent 2025-02-27 12:16:37 +0000 (ibackup@tux04) SUCCESS 0 <28m08s> borg-tux04-sql-backup 2025-02-27 12:16:46 +0000 (ibackup@tux04) SUCCESS 0 <00m07s> drop-rsync-balg01