Edit this page | Blame

Fallbacks and backups

A revisit to previous work on backups etc. The sheepdog hosts are no longer responding and we should really run sheepdog on a machine that is not physically with the other machines. In time sheepdog should also move away from redis and run in a system container, but that is for later. I did most of the work late 2021 when I wrote:

As a hurricane is barreling towards our machine room in Memphis we are checking our fallbacks and backups for GeneNetwork. For years we have been making backups on Amazon - both S3 and a running virtual machine. The latter was expensive, so I replaced it with a bare metal server which earns itself (if it hadn't been down for months, but that is a different story).

As we are introducing an external sheepdog server we may give it a DNS entry as sheepdog.genenetwork.org.

See also

Tags

  • type: enhancement
  • assigned: pjotrp
  • keywords: systems, fallback, backup, deploy
  • status: in progress
  • priority: critical

Tasks

  • [X] fix redis queue and sheepdog server
  • [X] check backups on tux01
  • [ ] drop tux02 backups off-site
  • [ ] backup ratspub, r/shiny, bnw, covid19, hegp, pluto services
  • [ ] /etc /home/shepherd backups for Octopus
  • [ ] /etc /home/shepherd /home/git CI-CD GN-QA backups on Tux02
  • [ ] Get backups running again on fallback
  • [ ] fix bacchus large backups
  • [ ] mount bacchus on HPC

Backup and restore

We are using borg for backing up data. Borg is excellent at deduplication and compression of data and is pretty fast too. Incremental copies work with rsync - so that is fast. To restore the full MariaDB database from a local borg repo takes a few minutes:

wrk@epysode:/export/restore_tux01$ time borg extract -v /export2/backup/tux01/borg-tux01::BORG-TUX01-MARIADB-20210829-04:20-Sun
real    17m32.498s
user    8m49.877s
sys     4m25.934s

This all contrasts heavily with restoring 300GB from Amazon S3.

Next restore the GN2 home dir

root@epysode:/# borg extract export2/backup/tux01/borg-genenetwork::TUX01_BORG_GN2_HOME-20210830-04:00-Mon

Get backups running on fallback

Recently epysode was reinstated after hardware failure. I took the opportunity to reinstall the machine. The backups are described in the repo (genenetwork org members have access)

As epysode was one of the main sheepdog messaging servers I need to reinstate:

  • [X] scripts for sheepdog
  • [ ] Check tunnel on tux01 is reinstated
  • [ ] enable trim
  • [ ] reinstate monitoring web services
  • [ ] reinstate daily backups
  • [ ] CRON
  • [ ] make sure messaging works through redis
  • [ ] fix and propagate GN1 backup
  • [ ] fix and propagate fileserver and git backups
  • [ ] add GN1 backup
  • [ ] other backups
  • [ ] email on fail

Tux01 is backed up now. Need to make sure it propagates to

  • [ ] rabbit
  • [ ] Tux02
  • [ ] balg01
  • [ ] bacchus
(made with skribilo)