Edit this page | Blame

Tux04/Tux05 disk issues

We are facing some disk issues with Tux04:

May 02 20:57:42 tux04 kernel: Buffer I/O error on device sdf1, logical block 859240457

and the same happened to tux05 (same batch). Basically the controllers report no issues. Just to be sure we added a copy of the boot partition.

Tags

Info

journalctl |grep mega
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_00], opened
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_00], [NVMe     Dell DC NVMe PE8 .2.0], lu id: 0x9a9ad026002ee4ac, S/N: SSBBN7299I250C41H, 960 GB
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_01], opened
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_01], [NVMe     Dell Ent NVMe FI .0.0], lu id: 0x3655523054a001820025384500000002, S/N: S6URNE0TA00182, 1.60 TB
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_02], opened
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_02], [NVMe     UMIS RPJTJ512MGE 0630], lu id: 0x8a13205102504a04, S/N: SS0L25210X8RC25E14WA, 512 GB
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_03], opened
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_03], [NVMe     CT4000P3SSD8     R30A], lu id: 0x550000f077a77964, S/N: 2314E6C3E33E, 4.00 TB
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_04], opened
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_04], [NVMe     CT4000P3SSD8     R30A], lu id: 0x830000f077a77964, S/N: 2314E6C3E2E2, 4.00 TB
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_05], opened
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_05], [NVMe     CT4000P3SSD8     R30A], lu id: 0x4d0000907da77964, S/N: 2327E6E9CB05, 4.00 TB

Switched on smartmontools.

smartctl -a /dev/sdf -d megaraid,0

shows no errors.

tux04:/$ lspci |grep RAID
41:00.0 RAID bus controller: Broadcom / LSI MegaRAID 12GSAS/PCIe Secure SAS39xx

Download megacli from

apt-get update
apt-get install megacli
megacli -LDInfo -L5 -a0

tux04:/$  megacli -PDList -a0|grep -i S.M
megacli -PDList -a0
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
tux04:/$  megacli -PDList -a0|grep -i Firm
Firmware state: Online, Spun Up
Device Firmware Level: .2.0
Firmware state: Online, Spun Up
Device Firmware Level: .0.0
Firmware state: Online, Spun Up
Device Firmware Level: 0630
Firmware state: Online, Spun Up
Device Firmware Level: R30A
Firmware state: Online, Spun Up
Device Firmware Level: R30A
Firmware state: Online, Spun Up
Device Firmware Level: R30A

So the drives are OK and the controller is not complaining.

Smartctl self tests do not work on this controller:

tux04:/$ smartctl -t short -d megaraid,0 /dev/sdf -c
Short Background Self Test has begun
Use smartctl -X to abort test

and nothing ;). Megacli is actually the tool to use

megacli -AdpAllInfo -aAll

Database

During a backup the DB shows this error:

2025-03-02 06:28:33 Database page corruption detected at page 1079428, retrying...\n[01] 2025-03-02 06:29:33 Database page corruption detected at page 1103108, retrying...

Interestingly the DB recovered on a second backup.

The database is hosted on a solid /dev/sde Dell Ent NVMe FI. The log says

kernel: I/O error, dev sde, sector 2136655448 op 0x0:(READ) flags 0x80700 phys_seg 40 prio class 2

Suggests:

The errors that you see are interface errors, they are not coming from the disk itself but rather from the connection to it. It can be the cable or any of the ports in the connection. Since the CRC errors on the drive do not increase I can only assume that the problem is on the receive side of the machine you use. You should check the cable and try a different SATA port on the server.

and someone wrote

analyzed that most of the reasons are caused by intensive reading and writing. This is a CDN cache node. Type reading NVME temperature is relatively high, if it continues, it will start to throttle and then slowly collapse.

and temperature on that drive has been 70 C.

Mariabd log is showing errors:

2025-03-02  6:54:47 0 [ERROR] InnoDB: Failed to read page 449925 from file './db_webqtl/SnpAll.ibd': Page read from tablespace is corrupted.
2025-03-02  7:01:43 489015 [ERROR] Got error 180 when reading table './db_webqtl/ProbeSetXRef'
2025-03-02  8:10:32 489143 [ERROR] Got error 180 when reading table './db_webqtl/ProbeSetXRef'

Let's try and dump those tables when the backup is done.

mariadb-dump -uwebqtlout db_webqtl SnpAll
mariadb-dump: Error 1030: Got error 1877 "Unknown error 1877" from storage engine InnoDB when dumping table `SnpAll` at row: 0
mariadb-dump -uwebqtlout db_webqtl ProbeSetXRef > ProbeSetXRef.sql

Eeep:

tux04:/etc$ mariadb-check -uwebqtlout -c db_webqtl ProbeSetXRef
db_webqtl.ProbeSetXRef
Warning  : InnoDB: Index ProbeSetFreezeId is marked as corrupted
Warning  : InnoDB: Index ProbeSetId is marked as corrupted
error    : Corrupt
tux04:/etc$ mariadb-check -uwebqtlout -c db_webqtl SnpAll
db_webqtl.SnpAll
Warning  : InnoDB: Index PRIMARY is marked as corrupted
Warning  : InnoDB: Index SnpName is marked as corrupted
Warning  : InnoDB: Index Rs is marked as corrupted
Warning  : InnoDB: Index Position is marked as corrupted
Warning  : InnoDB: Index Source is marked as corrupted
error    : Corrupt

On tux01 we have a working database, we can test with

mysqldump --no-data --all-databases > table_schema.sql
mysqldump -uwebqtlout db_webqtl SnpAll > SnpAll.sql

Running the backup with rate limiting from:

Mar 02 17:09:59 tux04 sudo[548058]: pam_unix(sudo:session): session opened for user root(uid=0) by wrk(uid=1000)
Mar 02 17:09:59 tux04 sudo[548058]:      wrk : TTY=pts/3 ; PWD=/export3/local/home/wrk/iwrk/deploy/gn-deploy-servers/scripts/tux04 ; USER=roo>
Mar 02 17:09:55 tux04 sudo[548058]: pam_unix(sudo:auth): authentication failure; logname=wrk uid=1000 euid=0 tty=/dev/pts/3 ruser=wrk rhost= >
Mar 02 17:04:26 tux04 su[548006]: pam_unix(su:session): session opened for user ibackup(uid=1003) by wrk(uid=0)

Oh oh

Tux04 is showing errors on all disks. We have to bail out. I am copying the potentially corrupted files to tux01 right now. We have backups, so nothing serious I hope. I am only worried about the myisam files we have because they have no strong internal validation:

2025-03-04  8:32:45 502 [ERROR] db_webqtl.ProbeSetData: Record-count is not ok; is 5264578601   Should be: 5264580806
2025-03-04  8:32:45 502 [Warning] db_webqtl.ProbeSetData: Found 28665 deleted space.   Should be 0
2025-03-04  8:32:45 502 [Warning] db_webqtl.ProbeSetData: Found       2205 deleted blocks       Should be: 0
2025-03-04  8:32:45 502 [ERROR] Got an error from thread_id=502, ./storage/myisam/ha_myisam.cc:1120
2025-03-04  8:32:45 502 [ERROR] MariaDB thread id 502, OS thread handle 139625162532544, query id 837999 localhost webqtlout Checking table
CHECK TABLE ProbeSetData
2025-03-04  8:34:02 79695 [ERROR] mariadbd: Table './db_webqtl/ProbeSetData' is marked as crashed and should be repaired

See also

Tux04 will require open heart 'disk controller' surgery and some severe testing before we move back. We'll also look at tux05-8 to see if they have similar problems.

Recovery

According to the logs tux04 started showing serious errors on March 2nd - when I introduced sanitizing the mariadb backup:

Mar 02 05:00:42 tux04 kernel: I/O error, dev sde, sector 2071078320 op 0x0:(READ) flags 0x80700 phys_seg 16 prio class 2
Mar 02 05:00:58 tux04 kernel: I/O error, dev sde, sector 2083650928 op 0x0:(READ) flags 0x80700 phys_seg 59 prio class 2
...

The log started on Feb 23 when we had our last reboot. It probably is a good idea to turn on persistent logging! Anyway, it is likely files were fine until March 2nd. Similarly the mariadb logs also show

2025-03-02  6:53:52 489007 [ERROR] mariadbd: Index for table './db_webqtl/ProbeSetData.MYI' is corrupt; try to repair it
2025-03-02  6:53:52 489007 [ERROR] db_webqtl.ProbeSetData: Can't read key from filepos: 2269659136

So, if we can restore a backup from March 1st we should be reasonably confident it is sane.

First is to backup the existing database(!) Next restore the new DB by changing the DB location (symlink in /var/lib/mysql as well as check /etc/mysql/mariadb.cnf).

When upgrading it is an idea to switch on these in mariadb.cnf

# forcing recovery with these two lines:
innodb_force_recovery=3
innodb_purge_threads=0

Make sure to disable (and restart) once it is up and running!

So the steps are:

  • [X] install updated guix version of mariadb in /usr/local/guix-profiles (don't use Debian!!)
  • [X] repair borg backup
  • [X] Stop old mariadb (on new host tux02)
  • [X] backup old mariadb database
  • [X] restore 'sane' version of DB from borg March 1st
  • [X] point to new DB in /var/lib/mysql and cnf file
  • [X] update systemd settings
  • [X] start mariadb new version with recovery setting in cnf
  • [X] check logs
  • [X] once running revert on recovery setting in cnf and restart

Other servers

un 09 03:10:05 tux05 kernel: blk_update_request: I/O error, dev sda, sector 2364120624 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 0
Dec 13 12:31:55 tux06 kernel: I/O error, dev sdb, sector 3909239837 op 0x9:(WRITE_ZEROES) flags 0x8000000 phys_seg 0 prio class 2
(made with skribilo)