Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
URGENT ADVICE SOUGHT: RAID 5 Degraded (Thanks!)
#1
Hi all,

Of course this happened on Saturday at 6:35PM, so I am a bit nervous. I cannot go out and buy a HDD immediately. Might have to wait until Monday Sad

So I initiated a file system check on my N7510. Just precautionary. So after that I rebooted it the "adventure began".

After the reboot the N7510 started buzzing continuously and Disk 6's bottom light is flashing orange.

So I the first thing I did was a Short SMART Test on Disk 6. It passed without problems: "Healthy after short test".

Then I disabled the buzzer. :x

I am currently doing a Long SMART Test.

When that completes I plan to do a Detect Bad Block test.

So what should I do next?

To me it looks like the disk has not failed but simply a bad block was written in it. This is normal on disks, it does not mean an outright failure, right?

1. Do I need to replace the disk?

2. Or will a simple reboot could flush it out of the system??? If it's simply a bad block, which the HDD should discontinue to use that bad block and use the spare blocks instead, no? (Obviously I need for the tests to complete.)

3. Or should I shut down the NAS immediately, go out an buy another 4TB HDD, swap it and rebuild the RAID level before I loose another disk?

(Can I replace it with any other HDD on HCL? I was wondering if I should swap it with a supported 5TB or 6TB HDD? And then individually replace the remaining 6 HDDs with the new 5/6TB ones.)

4. Or should I not shut down no matter what and nervously wait till Monday to buy a replacement HDD and then do a hot-swap?

What I don't know is how Thecus's Firmware O/S behaves?

Clearly the HDD has not catastrophically failed, but is the Thecus N7510 no longer using it? Is it no longer reading/writing to it? Which is WRONG because I am now vulnerable to a true second HDD failure in that case, and all because of a bad block.

So am I vulnerable to second disk failure now?

Obviously I am no longer going to copy files to it, etc.

For Disk 6 the web UI is showing the following:

Disk No.: 6
Model: ST4000DM000-1F21
Power On Hours: 10762 Hours
Temperature: 35°C/95°F 34°C/93.2°F(Last)
Reallocated Sector Count: 0 0(Last)
Current Pending Sector: 0 0(Last)
End-to-End error: 0 0(Last)

N7510:
RAID: 5
HDDs: 8 x Seagate 4TB HDDs (ST4000DM000). <-- They are on the HCL
Firmware: Should be latest: v2.05.06 (I cannot confirm as web screen is stuck on Long SMART Test.)

N7510 Emails:
6:33PM SMART(HD Tray No: 6) testing start. <-- Long SMART Test
6:28PM SMART(HD Tray No: 6) testing has completed, the result is normal. <-- Short SMART Test
6:27PM SMART(HD Tray No: 6) testing start. <-- Short SMART Test
6:18PM The RAID [RAID5] on system [N7510] is suffering from disk problem.
RAID status is DEGRADED now. However, the data access is still functional.
Please solve the disk problem (e.g. replacing with a new hard disk).
The system will bring RAID back to the healthy state automatically.

6:18PM Hard Disk 6 on [N7510] has an I/O error.
This hard disk might have bad sectors on it.
Please replace the hard disk as soon as possible.

6:17PM The system [N7510] boot successfully..
6:17PM The RAID [RAID5] on system [N7510] is healthy now.
6:15PM The system N7510 reboot.
6:02PM The FileSystem Check RAID [ 1,2,3,4,5,6,7 ] is Success and NO Error to be found.
5:49PM The system [N7510] boot successfully..
5:39PM The RAID [RAID5] on system [N7510] is healthy now.
5:47PM The system N7510 reboot.

N7510 Events Log:
Date Level Event
21/03/2015 18:18 Info [N7510] : User admin logged in from 192.168.0.9
21/03/2015 18:18 Error [N7510] : User admin logged in fail from 192.168.0.9
21/03/2015 18:17 Error [N7510] : The RAID [RAID5] on system [N7510] change to degrade mode.
21/03/2015 18:17 Error [N7510] : Disk 6 on [N7510] has failed.
21/03/2015 18:17 Info [N7510] : [N7510] boot successfully.
21/03/2015 18:17 Info [N7510] : Healthy: The RAID [RAID5] on system [N7510] is healthy now.
21/03/2015 18:15 Info [N7510] : The system N7510 reboot.
21/03/2015 18:01 Info [N7510] : End of FileSystem Check
21/03/2015 18:01 Info [N7510] : The FileSystem Check RAID [ 1,2,3,4,5,6,7 ] is Success and NO Error to be found.
21/03/2015 18:01 Info [N7510] : Filesystem Check : RAID [ 1,2,3,4,5,6,7 ] No errors..
21/03/2015 17:55 Info [N7510] : Starting FileSystem Check.
21/03/2015 17:49 Info [N7510] : [N7510] boot successfully.
21/03/2015 17:49 Info [N7510] : Healthy: The RAID [RAID5] on system [N7510] is healthy now.
21/03/2015 17:47 Info [N7510] : The system N7510 reboot.
21/03/2015 17:46 Info [N7510] : User admin logged in from 192.168.0.9
Reply
#2
Hi all,

Of course this happened on Saturday at 6:35PM, so I am a bit nervous. I cannot go out and buy a HDD immediately. Might have to wait until Monday Sad

So I initiated a file system check on my N7510. Just precautionary. So after that I rebooted it the "adventure began".

After the reboot the N7510 started buzzing continuously and Disk 6's bottom light is flashing orange.

So I the first thing I did was a Short SMART Test on Disk 6. It passed without problems: "Healthy after short test".

Then I disabled the buzzer. :x

I am currently doing a Long SMART Test.

When that completes I plan to do a Detect Bad Block test.

So what should I do next?

To me it looks like the disk has not failed but simply a bad block was written in it. This is normal on disks, it does not mean an outright failure, right?

1. Do I need to replace the disk?

2. Or will a simple reboot could flush it out of the system??? If it's simply a bad block, which the HDD should discontinue to use that bad block and use the spare blocks instead, no? (Obviously I need for the tests to complete.)

3. Or should I shut down the NAS immediately, go out an buy another 4TB HDD, swap it and rebuild the RAID level before I loose another disk?

(Can I replace it with any other HDD on HCL? I was wondering if I should swap it with a supported 5TB or 6TB HDD? And then individually replace the remaining 6 HDDs with the new 5/6TB ones.)

4. Or should I not shut down no matter what and nervously wait till Monday to buy a replacement HDD and then do a hot-swap?

What I don't know is how Thecus's Firmware O/S behaves?

Clearly the HDD has not catastrophically failed, but is the Thecus N7510 no longer using it? Is it no longer reading/writing to it? Which is WRONG because I am now vulnerable to a true second HDD failure in that case, and all because of a bad block.

So am I vulnerable to second disk failure now?

Obviously I am no longer going to copy files to it, etc.

For Disk 6 the web UI is showing the following:

Disk No.: 6
Model: ST4000DM000-1F21
Power On Hours: 10762 Hours
Temperature: 35°C/95°F 34°C/93.2°F(Last)
Reallocated Sector Count: 0 0(Last)
Current Pending Sector: 0 0(Last)
End-to-End error: 0 0(Last)

N7510:
RAID: 5
HDDs: 8 x Seagate 4TB HDDs (ST4000DM000). <-- They are on the HCL
Firmware: Should be latest: v2.05.06 (I cannot confirm as web screen is stuck on Long SMART Test.)

N7510 Emails:
6:33PM SMART(HD Tray No: 6) testing start. <-- Long SMART Test
6:28PM SMART(HD Tray No: 6) testing has completed, the result is normal. <-- Short SMART Test
6:27PM SMART(HD Tray No: 6) testing start. <-- Short SMART Test
6:18PM The RAID [RAID5] on system [N7510] is suffering from disk problem.
RAID status is DEGRADED now. However, the data access is still functional.
Please solve the disk problem (e.g. replacing with a new hard disk).
The system will bring RAID back to the healthy state automatically.

6:18PM Hard Disk 6 on [N7510] has an I/O error.
This hard disk might have bad sectors on it.
Please replace the hard disk as soon as possible.

6:17PM The system [N7510] boot successfully..
6:17PM The RAID [RAID5] on system [N7510] is healthy now.
6:15PM The system N7510 reboot.
6:02PM The FileSystem Check RAID [ 1,2,3,4,5,6,7 ] is Success and NO Error to be found.
5:49PM The system [N7510] boot successfully..
5:39PM The RAID [RAID5] on system [N7510] is healthy now.
5:47PM The system N7510 reboot.

N7510 Events Log:
Date Level Event
21/03/2015 18:18 Info [N7510] : User admin logged in from 192.168.0.9
21/03/2015 18:18 Error [N7510] : User admin logged in fail from 192.168.0.9
21/03/2015 18:17 Error [N7510] : The RAID [RAID5] on system [N7510] change to degrade mode.
21/03/2015 18:17 Error [N7510] : Disk 6 on [N7510] has failed.
21/03/2015 18:17 Info [N7510] : [N7510] boot successfully.
21/03/2015 18:17 Info [N7510] : Healthy: The RAID [RAID5] on system [N7510] is healthy now.
21/03/2015 18:15 Info [N7510] : The system N7510 reboot.
21/03/2015 18:01 Info [N7510] : End of FileSystem Check
21/03/2015 18:01 Info [N7510] : The FileSystem Check RAID [ 1,2,3,4,5,6,7 ] is Success and NO Error to be found.
21/03/2015 18:01 Info [N7510] : Filesystem Check : RAID [ 1,2,3,4,5,6,7 ] No errors..
21/03/2015 17:55 Info [N7510] : Starting FileSystem Check.
21/03/2015 17:49 Info [N7510] : [N7510] boot successfully.
21/03/2015 17:49 Info [N7510] : Healthy: The RAID [RAID5] on system [N7510] is healthy now.
21/03/2015 17:47 Info [N7510] : The system N7510 reboot.
21/03/2015 17:46 Info [N7510] : User admin logged in from 192.168.0.9
Reply
#3
I just thought of something...

Back somewhere the Web UI, I noticed that this Disk 6 has the Spare checkbox next to it, unlike the rest of the disks. Forgot where in GUI...

So I am wondering if the N7510 firmware just "threw a dummy spit" and kicked out the disk for no "real good reason".

So all I need to do is add it back Disk 6 as a Spare Disk and tell the N7510 to rebuild the RAID level, if you know what I mean.

I got this idea from the Restoring Normal Opertion After Degraded [N4800] discussion: [url]http://forum.thecus.com/viewtopic.php?f=50&t=4748[/url] where Thecus -Yvon posted:
Quote:Postby Thecus - Yvon » Sat Dec 01, 2012 11:09 am

Daer SIr,

About RAID rebuild:
Please kindly go to [Storage] -> [RAID Management] page, select your RAID and click [Edit] button.
In the RAID information window, please kindly check the "Spare" box on the HDD whom went to join the RAID, then click [Add Spare] button, and then the system will start to rebuild your RAID.

As I said above Disk 6 passed the Short SMART Test. It is less than 2 years old. I bought all the disks in 28/May/2013, but configured the N7510 to spin them up and down on a schedule.

It is not showing any SMART errors:

SMART INFO
Info
Disk No.: 6
Model: ST4000DM000-1F21
Power On Hours: 10763 Hours
Temperature: 35°C/95°F 34°C/93.2°F(Last)
Reallocated Sector Count: 0 0(Last)
Current Pending Sector: 0 0(Last)
End-to-End error: 0 0(Last)
Test
Test Type: Long
Test Result: Testing (20%)
Test Time:

It is going to take about 10 hours for the Long SMART Test to complete. Or perhaps that was the Bad Block Test, I cannot remember which as all test it was as SYSLOG only says is that "testing started" and it completed 10 hours later. But it does not say which, perhaps it is the Bad Block Test.

1. So I am still going to finish the Long SMART Test.
2. Then run the Bad Block test.
3. And then perhaps I can add this disk as a spare and rebuilding RAID.

This might be the answer?
Reply
#4
I just thought of something...

Back somewhere the Web UI, I noticed that this Disk 6 has the Spare checkbox next to it, unlike the rest of the disks. Forgot where in GUI...

So I am wondering if the N7510 firmware just "threw a dummy spit" and kicked out the disk for no "real good reason".

So all I need to do is add it back Disk 6 as a Spare Disk and tell the N7510 to rebuild the RAID level, if you know what I mean.

I got this idea from the Restoring Normal Opertion After Degraded [N4800] discussion: [url]http://forum.thecus.com/viewtopic.php?f=50&t=4748[/url] where Thecus -Yvon posted:
Quote:Postby Thecus - Yvon » Sat Dec 01, 2012 11:09 am

Daer SIr,

About RAID rebuild:
Please kindly go to [Storage] -> [RAID Management] page, select your RAID and click [Edit] button.
In the RAID information window, please kindly check the "Spare" box on the HDD whom went to join the RAID, then click [Add Spare] button, and then the system will start to rebuild your RAID.

As I said above Disk 6 passed the Short SMART Test. It is less than 2 years old. I bought all the disks in 28/May/2013, but configured the N7510 to spin them up and down on a schedule.

It is not showing any SMART errors:

SMART INFO
Info
Disk No.: 6
Model: ST4000DM000-1F21
Power On Hours: 10763 Hours
Temperature: 35°C/95°F 34°C/93.2°F(Last)
Reallocated Sector Count: 0 0(Last)
Current Pending Sector: 0 0(Last)
End-to-End error: 0 0(Last)
Test
Test Type: Long
Test Result: Testing (20%)
Test Time:

It is going to take about 10 hours for the Long SMART Test to complete. Or perhaps that was the Bad Block Test, I cannot remember which as all test it was as SYSLOG only says is that "testing started" and it completed 10 hours later. But it does not say which, perhaps it is the Bad Block Test.

1. So I am still going to finish the Long SMART Test.
2. Then run the Bad Block test.
3. And then perhaps I can add this disk as a spare and rebuilding RAID.

This might be the answer?
Reply
#5
Hi platypus,

I'm interested to know if you fixed your RAID and what you learned along the way.

As a general rule any hard drives that show bad blocks should be replaced without hesitation, it's just not worth the risk of losing your valuable data. Bad blocks are not outright drive failure but are starting to show signs they are end of life and should be replaced before the situation escalates to something more serious. In the past I have repaired bad blocks/sectors only for the drive to generate even more bad blocks/sectors resulting in data loss or corruption.

So, to answer your questions I would do the following:

1. Do I need to replace the disk? - Replace the HDD, preferably with the same spec although I haven't had any issues with larger supported drives.

2. Or will a simple reboot could flush it out of the system? - Do not reboot, hot-swap is a better solution.

3. Or should I shut down the NAS immediately, go out an buy another 4TB HDD, swap it and rebuild the RAID level before I loose another disk? - In future always have a spare drive to hand. Do not shut down the NAS box, hot-swap the drive ASAP.

(Can I replace it with any other HDD on HCL? I was wondering if I should swap it with a supported 5TB or 6TB HDD? And then individually replace the remaining 6 HDDs with the new 5/6TB ones.)
Yes, you can replace the faulty HDD with a larger supported HDD. Replacing all HDD's is a little bit more complicated due to file system and RAID size, this will largely depend how you have configured your array. If this is your long term option I would look to backup your data and start over with fresh drives.

4. Or should I not shut down no matter what and nervously wait till Monday to buy a replacement HDD and then do a hot-swap? - Hold your nerve, keep the NAS box running.

What I don't know is how Thecus's Firmware O/S behaves?

Clearly the HDD has not catastrophically failed, but is the Thecus N7510 no longer using it? Is it no longer reading/writing to it? Which is WRONG because I am now vulnerable to a true second HDD failure in that case, and all because of a bad block. The question here is a health of the RAID, if the RAID if healthy the drive is still functioning and is still part of the RAID array. If the RAID is degraded the drive is no longer part of the array and if a second drive becomes faulty you will suffer total data loss. The simple answer in both cases is to replace the drive ASAP.

Hopefully you managed to hot-swap the faulty drive and rebuilt the RAID array but please let the community know how you got on.

Regards Lee.
Reply
#6
Hi platypus,

I'm interested to know if you fixed your RAID and what you learned along the way.

As a general rule any hard drives that show bad blocks should be replaced without hesitation, it's just not worth the risk of losing your valuable data. Bad blocks are not outright drive failure but are starting to show signs they are end of life and should be replaced before the situation escalates to something more serious. In the past I have repaired bad blocks/sectors only for the drive to generate even more bad blocks/sectors resulting in data loss or corruption.

So, to answer your questions I would do the following:

1. Do I need to replace the disk? - Replace the HDD, preferably with the same spec although I haven't had any issues with larger supported drives.

2. Or will a simple reboot could flush it out of the system? - Do not reboot, hot-swap is a better solution.

3. Or should I shut down the NAS immediately, go out an buy another 4TB HDD, swap it and rebuild the RAID level before I loose another disk? - In future always have a spare drive to hand. Do not shut down the NAS box, hot-swap the drive ASAP.

(Can I replace it with any other HDD on HCL? I was wondering if I should swap it with a supported 5TB or 6TB HDD? And then individually replace the remaining 6 HDDs with the new 5/6TB ones.)
Yes, you can replace the faulty HDD with a larger supported HDD. Replacing all HDD's is a little bit more complicated due to file system and RAID size, this will largely depend how you have configured your array. If this is your long term option I would look to backup your data and start over with fresh drives.

4. Or should I not shut down no matter what and nervously wait till Monday to buy a replacement HDD and then do a hot-swap? - Hold your nerve, keep the NAS box running.

What I don't know is how Thecus's Firmware O/S behaves?

Clearly the HDD has not catastrophically failed, but is the Thecus N7510 no longer using it? Is it no longer reading/writing to it? Which is WRONG because I am now vulnerable to a true second HDD failure in that case, and all because of a bad block. The question here is a health of the RAID, if the RAID if healthy the drive is still functioning and is still part of the RAID array. If the RAID is degraded the drive is no longer part of the array and if a second drive becomes faulty you will suffer total data loss. The simple answer in both cases is to replace the drive ASAP.

Hopefully you managed to hot-swap the faulty drive and rebuilt the RAID array but please let the community know how you got on.

Regards Lee.
Reply


Possibly Related Threads...
Thread Author Replies Views Last Post
Wink Raid degraded ury 1 197 05-15-2018, 03:05 PM
Last Post: Thecus - Yvon
  URGENT ADVICE SOUGHT: RAID 5 Degraded (Thanks!) 0 734 Less than 1 minute ago
Last Post:

Forum Jump:


Users browsing this thread: 1 Guest(s)