It happens, be prepared!
Yes, sometimes what (or who) you trust betraies you!
And for hard disks it happens too, more often than expected.
But the risk of data (and money) loss ca be dramatically reduced
by using some kind of redundancy, i.e. RAID arrays. And it works!
Here i report my experience on my lan server,
on which i setup two 120 Gb HDs configured as a RAID-1 (mirroring) array using Linux software-raid.
M/B |
ABIT KT7A-RAID (with HighPoint HPT370 onboard IDE controller) |
Processor |
AMD Duron 850Mhz |
Memory |
640 Mb SDRAM |
Storage |
IBM-DJNA-371350 QUANTUM FIREBALL CR8.4A Maxtor 6Y120L0 Maxtor 6Y120P0 |
System |
SuSE Linux 8.1 Linux version 2.4.19-4GB |
The DJNA is used as system disk, while the two Maxtor units are configured in mirroring using RAID-1.
Originally they were two 6Y120L0, bought together, looks like from a unlucky serie, as both died, the
first after just a week (and I replaced it with the 6Y120P0), the second after 1 year and 10 days.
Here is how /etc/raidtab
looks like:
raiddev /dev/md0
raid-level 1
nr-raid-disks 2
nr-spare-disks 0
persistent-superblock 1
chunk-size 4
device /dev/hde1
raid-disk 0
device /dev/hdg1
raid-disk 1
Both disks are partitioned into one big partition of type 0xfd
(raid autodetect):
Disk /dev/hdg: 16 heads, 63 sectors, 238216 cylinders
Units = cylinders of 1008 * 512 bytes
Device Boot Start End Blocks Id System
/dev/hdg1 1 238216 120060832+ fd Linux raid autodetect
Would have been wiser not to use the whole disk space for the partition,
keeping it some Mbs smaller. Suppose a disk fails, but it's impossible to find another
one of the same model, another disk of the same size but different manufacturer or serie
could be not exaclty of the same size,
and due to the different geometry it may be a bit smaller,
avoiding us to use it to replace the failed one, and forcing us to buy a bigger (and more expensive) one.
A partition to add as spare disk to a RAID-1 array must have a size >= of the size
of the existing RAID-1 array. Partitions involved into a RAID-1 array can be of different size,
but the size of the resulting device will be equal to the size of the smallest one,
so we can add a partition bigger that the array's size to it, but not smaller.
One of the two hard disk of the array (/dev/hde
) crashed abruptly, with mechanical noise.
The fault was deteced by the system and the disk was removed from the array.
The RAID device continued to work, but in degraded mode, with only one disk.
If a spare disk would have been available, reconstruction of the array would have
bagun at once.
There is the sequence of messages from syslog
since the detection of disk failure to the
to the removal of the failed disk from the array:
Oct 17 17:24:02 starbase kernel: hde: timeout waiting for DMA
Oct 17 17:24:02 starbase kernel: hde: 0 bytes in FIFO
Oct 17 17:24:02 starbase kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14
Oct 17 17:24:02 starbase kernel: hde: status timeout: status=0xd0 { Busy }
Oct 17 17:24:02 starbase kernel: hde: drive not ready for command
Oct 17 17:24:37 starbase kernel: ide2: reset timed-out, status=0x80
Oct 17 17:24:37 starbase kernel: hde: status timeout: status=0x80 { Busy }
Oct 17 17:24:37 starbase kernel: hde: drive not ready for command
Oct 17 17:25:07 starbase kernel: end_request: I/O error, dev 21:01 (hde), sector 199141632
Oct 17 17:25:07 starbase kernel: end_request: I/O error, dev 21:01 (hde), sector 199141640
Oct 17 17:25:07 starbase kernel: end_request: I/O error, dev 21:01 (hde), sector 199141648
...
...
Oct 17 17:25:07 starbase kernel: end_request: I/O error, dev 21:01 (hde), sector 199143760
Oct 17 17:25:07 starbase kernel: end_request: I/O error, dev 21:01 (hde), sector 199143768
Oct 17 17:25:07 starbase kernel: end_request: I/O error, dev 21:01 (hde), sector 119544
Oct 17 17:25:07 starbase kernel: raid1: hde1: rescheduling block 119544
Oct 17 17:25:07 starbase kernel: md: updating md0 RAID superblock on device
Oct 17 17:25:07 starbase kernel: md: hdg1 [events: 00000277]<6>(write) hdg1's sb offset: 120060736
Oct 17 17:25:07 starbase kernel: md: recovery thread got woken up ...
Oct 17 17:25:07 starbase kernel: md0: no spare disk to reconstruct array! -- continuing in degraded mode
Oct 17 17:25:07 starbase kernel: md: recovery thread finished ...
Oct 17 17:25:07 starbase kernel: end_request: I/O error, dev 21:01 (hde), sector 199143256
Oct 17 17:25:07 starbase kernel: end_request: I/O error, dev 21:01 (hde), sector 199143776
Oct 17 17:25:08 starbase kernel: md: (skipping faulty hde1 )
Oct 17 17:25:08 starbase kernel: raid1: hdg1: redirecting sector 119544 to another mirror
The system continued to work perfectly and the fault was managed transparently.
The /proc/mdstat
reported only one disk was working in the array:
Personalities : [raid1]
read_ahead 1024 sectors
md0 : active raid1 hdg1[1]
120060736 blocks [2/1] [_U]
At this point the system was shut down, and the faulty drive has been replaced.
The down time of the system has been just the time needed to replace the broken disk,
that is just 5 minutes.
After the reboot the RAID unit was working perfectly the same, in degraded mode.
A surface scan of the new disk has benn made to check it with:
The next step was to add repartition the disk with a partition of same size and
type of the other disk (look above). And then the partition was added to the
RAID array with the command:
raidhotadd /dev/md0 /dev/hde1
The disk was integrated into the array and reconstruction began immediatly.
Meanwhile the RAID device was still perfectly working and usable.
The /proc/mdstat
reported the progress of the reconstruction.
Here there are some shots of the process:
Personalities : [raid1]
read_ahead 1024 sectors
md0 : active raid1 hde1[2] hdg1[1]
120060736 blocks [2/1] [_U]
[>....................] recovery = 0.8% (981952/120060736) finish=70.5min speed=28134K/sec
Personalities : [raid1]
read_ahead 1024 sectors
md0 : active raid1 hde1[2] hdg1[1]
120060736 blocks [2/1] [_U]
[===>.................] recovery = 19.4% (23349504/120060736) finish=56.8min speed=28343K/sec
Personalities : [raid1]
read_ahead 1024 sectors
md0 : active raid1 hde1[2] hdg1[1]
120060736 blocks [2/1] [_U]
[================>....] recovery = 82.9% (99550592/120060736) finish=11.9min speed=28667K/sec
Once the reconstruction process finished the RAID configuration on superblocks of
devices partecipating to the array was updated, and the new disk integrated into the array:
Personalities : [raid1]
read_ahead 1024 sectors
md0 : active raid1 hde1[0] hdg1[1]
120060736 blocks [2/2] [UU]
unused devices: <none>
Here are extract from the syslog
generated by the
raid daemon during the reconstruction of the array:
Oct 19 00:18:56 starbase kernel: md: trying to hot-add hde1 to md0 ...
Oct 19 00:18:56 starbase kernel: md: bind<hde1,2>
Oct 19 00:18:56 starbase kernel: RAID1 conf printout:
Oct 19 00:18:56 starbase kernel: --- wd:1 rd:2 nd:1
Oct 19 00:18:56 starbase kernel: disk 0, s:0, o:0, n:0 rd:0 us:1 dev:[dev 00:00]
Oct 19 00:18:56 starbase kernel: disk 1, s:0, o:1, n:1 rd:1 us:1 dev:hdg1
Oct 19 00:18:56 starbase kernel: disk 2, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
Oct 19 00:18:56 starbase kernel: disk 3, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
...
...
Oct 19 00:18:56 starbase kernel: disk 25, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
Oct 19 00:18:56 starbase kernel: disk 26, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
Oct 19 00:18:56 starbase kernel: md: updating md0 RAID superblock on device
Oct 19 00:18:56 starbase kernel: md: hde1 [events: 00000282]<6>(write) hde1's sb offset: 120060736
Oct 19 00:18:56 starbase kernel: md: hdg1 [events: 00000282]<6>(write) hdg1's sb offset: 120060736
Oct 19 00:18:56 starbase kernel: md: recovery thread got woken up ...
Oct 19 00:18:56 starbase kernel: md0: resyncing spare disk hde1 to replace failed disk
Oct 19 00:18:56 starbase kernel: RAID1 conf printout:
Oct 19 00:18:56 starbase kernel: --- wd:1 rd:2 nd:2
Oct 19 00:18:56 starbase kernel: disk 0, s:0, o:0, n:0 rd:0 us:1 dev:[dev 00:00]
Oct 19 00:18:56 starbase kernel: disk 1, s:0, o:1, n:1 rd:1 us:1 dev:hdg1
Oct 19 00:18:56 starbase kernel: disk 2, s:1, o:0, n:2 rd:2 us:1 dev:hde1
Oct 19 00:18:56 starbase kernel: disk 3, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
Oct 19 00:18:56 starbase kernel: disk 4, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
...
...
Oct 19 00:18:56 starbase kernel: disk 25, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
Oct 19 00:18:56 starbase kernel: disk 26, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
Oct 19 00:18:56 starbase kernel: md: syncing RAID array md0
Oct 19 00:18:56 starbase kernel: md: minimum _guaranteed_ reconstruction speed: 100 KB/sec/disc.
Oct 19 00:18:56 starbase kernel: md: using maximum available idle IO bandwith
(but not more than 100000 KB/sec) for reconstruction.
Oct 19 00:18:56 starbase kernel: md: using 508k window, over a total of 120060736 blocks.
...
...
Oct 19 01:31:27 starbase kernel: md: md0: sync done.
Oct 19 01:31:27 starbase kernel: RAID1 conf printout:
Oct 19 01:31:27 starbase kernel: --- wd:1 rd:2 nd:2
Oct 19 01:31:27 starbase kernel: disk 0, s:0, o:0, n:0 rd:0 us:1 dev:[dev 00:00]
Oct 19 01:31:27 starbase kernel: disk 1, s:0, o:1, n:1 rd:1 us:1 dev:hdg1
Oct 19 01:31:27 starbase kernel: disk 2, s:1, o:1, n:2 rd:2 us:1 dev:hde1
Oct 19 01:31:27 starbase kernel: disk 3, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
Oct 19 01:31:27 starbase kernel: disk 4, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
...
...
Oct 19 01:31:27 starbase kernel: disk 25, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
Oct 19 01:31:27 starbase kernel: disk 26, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
Oct 19 01:31:27 starbase kernel: md: updating md0 RAID superblock on device
Oct 19 01:31:27 starbase kernel: md: hde1 [events: 00000283]<6>(write) hde1's sb offset: 120060736
Oct 19 01:31:27 starbase kernel: md: hdg1 [events: 00000283]<6>(write) hdg1's sb offset: 120060736
Oct 19 01:31:27 starbase kernel: md: recovery thread finished ...
As last step a fsck
was done on the filesystem on the RAID device,
which showed no filesystem corruption.
The whole recovery process was straight forward and painless.
The system was down just the time to replace the failed disk,
while the RAID device sill was working perfectly during the whole process.
RAID can help a lot in avoiding data loss, but it's not the solution to any
problems and a good backup policy must always be adopted.