Degraded RAID Array: Symptoms, Diagnosis & Fixes

The safety net survived the fall — but it caught the drive, not you. The next one lands on concrete.

What It Is

A degraded RAID array is one that is still serving your data perfectly — and has quietly lost the one thing it was built to give you: a safety margin. How much margin depends on your RAID level:

RAID 0 (striping, no redundancy) — splits data across disks for speed, but any single disk failure destroys the entire array instantly. There is no degraded state — just dead. Don't use RAID 0 on a server; it turns every disk into a single point of failure for all your data.
RAID 1 (mirror) — two copies of every byte. Loses one disk, runs on the survivor. A second failure kills it.
RAID 5 (single parity) — needs at least 3 disks, survives exactly one loss. A second failure destroys the array.
RAID 6 (double parity) — survives two simultaneous failures. A single degraded disk still leaves you with one parity disk to spare — roughly the safety margin a healthy RAID 5 has.
RAID 10 (mirrored stripes) — each mirror pair tolerates one loss independently. The array only dies if both halves of the same pair go down.

All of these are explained in depth on the RAID page. Here's what matters right now: when a member drops out — fails, gets pulled, stops answering — the array does exactly what it promised: it keeps running on what's left. That's the good news, and it's genuinely good. The bad news is the part nobody feels, because nothing broke: your safety margin just shrank. On a RAID 1 or RAID 5 that margin is now zero — the very next failure means total loss. On a RAID 6 you still have one disk of headroom, but you're now where a healthy RAID 5 lives, which is not where you want to stay.

This is the single most dangerous state a storage system can be in, precisely because it doesn't hurt. A disk full screams. A failing disk throws errors. A degraded array does neither. The website loads, the database answers, df looks fine — and a four-disk RAID 5 that should read [UUUU] is sitting there reading [UUU_], one underscore from the edge, waiting for someone to notice. Most people notice when the second disk goes, which is to say: too late.

So let's frame the job clearly, because it's simpler than the panic suggests. A degraded array is not an emergency in the "the building is on fire" sense — your data is intact and reachable right now. It's an emergency in the "you have one parachute left and you're still falling" sense. The work is: confirm you're degraded, identify exactly which physical disk dropped, replace it without touching the good ones, and add the new disk so the array rebuilds itself back to full redundancy. By the end of this page you'll read /proc/mdstat and mdadm --detail like a map, name the failed member with certainty, and — the part that saves careers — know which disk not to pull. Then, at the end, the part that makes it all click: how parity actually lets four disks survive as three, which is one of the cleaner pieces of magic in all of computing.

How You Notice

A degraded array is the quiet one, so you mostly notice it through the one tool whose entire job is to tell you the array's state out loud. Here's each signal, with the command to see it on your own box right now. (Don't worry if the output looks cryptic at first glance — the How I Read It section below walks through every token line by line, and Reading It by Example shows the most common patterns.)

/proc/mdstat shows an underscore where a U should be. This is the rawest, most honest symptom there is — the kernel exposes the live state of every software array as a plain text file. Read it:
```
cat /proc/mdstat
```
A healthy array ends each line with a string of Us — one per member, all up: [UU] for a two-disk mirror, [UUUU] for a four-disk array. A degraded array shows an underscore for each missing member: [U_], [UUU_], [_U]. That single character is the whole story. An empty /proc/mdstat (or "No such file") just means no software arrays — which is itself good news.
mdadm --detail says "State : clean, degraded". The verbose view spells it out in words and, crucially, names the slot that's now empty:
```
mdadm --detail /dev/md0
```
Look for the State : line — clean, degraded (or active, degraded) — and a member listed as removed or faulty in the device table at the bottom. This is the command that turns "a disk is gone" into "this disk is gone."
A drive vanished from the kernel log. When a member fails, the kernel narrates it. Look:
```
dmesg -T | grep -iE "md/raid|kicking|Disk failure|I/O error|ata[0-9]"
journalctl -k | grep -iE "md/raid|fail"
```
Lines like md/raid1:md0: Disk failure on sdb1, disabling device name the exact moment and member the array gave up on — often alongside the I/O error lines from the dying disk that triggered it.
An mdadm --monitor email — if you set it up. The mdadm daemon can watch your arrays and email root the instant a member fails (Fail and DegradedArray events). Most boxes never turn it on, which is exactly why so many people meet their degraded array weeks late, when the second disk follows the first. We'll fix that at the end.

Any one of these means the same thing: you have lost a disk and your safety margin has shrunk. Don't wait for the array to "settle" — degraded is the settled state, and it stays that way forever until you rebuild it.

How I Read It

Two files tell you everything, and you read them in order: /proc/mdstat for the shape of the damage, then mdadm --detail for the name of the culprit. Start with mdstat, because it's instant and it's the truth — it's the kernel's own live scoreboard, not a tool's interpretation of it.

First, so we have something to compare against, here's a healthy array — a real four-disk RAID 5 from one of our backup servers, all four members present and correct:

Personalities : [raid1] [raid6] [raid5] [raid4]
md3 : active raid5 sdc4[2] sdd4[4] sdb4[1] sda4[0]
      5860147200 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]

md0 : active raid1 sdb1[1] sda1[0]
      33520640 blocks super 1.2 [2/2] [UU]

Read that bottom-right corner like a status light. [4/4] means four members expected, four present; [UUUU] means all four up. The mirror below it: [2/2] [UU], two for two, both up. This is what calm looks like — and it's what the vast majority of your arrays look like, including the RAID 1 NVMe mirrors that most servers boot from: [2/2] [UU], copy and copy, either disk can die and you'd never know.

Now the degraded version — the same kind of array after one member has dropped. This is the real shape of trouble, straight from the kernel's RAID documentation:

Personalities : [raid1] [raid6] [raid5] [raid4]
md3 : active raid5 sdc4[2] sdd4[4] sda4[0]
      5860147200 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [U_UU]

md0 : active raid1 sda1[0]
      33520640 blocks super 1.2 [2/1] [_U]

Walk it line by line, because every token is load-bearing:

md3 : active — the array is active. It's serving reads and writes right now. Degraded does not mean offline. (If this said inactive, you'd have a different, worse problem — see RAID array offline.)
raid5 sdc4[2] sdd4[4] sda4[0] — the level, then the members still present, each with its raid-slot number in brackets. Compare to the healthy line above: it listed sdb4[1] too. Here sdb4[1] is gone from the list entirely. That's your failed disk, named by its absence. The slot it held — [1] — is the empty one.
[4/3] — four members configured, only three present. The missing one is the gap between those two numbers, and it's the whole reason this page exists.
[U_UU] — the live map, one character per slot in order. U is up, _ is the hole. Read left to right: slot 0 up, slot 1 missing, slot 2 up, slot 3 up. The underscore sits exactly where sdb4[1] used to be. The position of the underscore is the slot number of the dead disk — count from zero.

The mirror tells the same story in miniature: [2/1] [_U] — two configured, one present, slot 0 is the hole. One half of the mirror is gone; the survivor is carrying the whole load alone.

That's the shape. Now get the name, because sdb4 versus sda4 is the difference between fixing the problem and causing a second one. Ask mdadm for the full detail:

mdadm --detail /dev/md3

/dev/md3:
           Version : 1.2
     Creation Time : Tue Mar 19 02:41:08 2024
        Raid Level : raid5
        Array Size : 5860147200 (5.46 TiB 6.00 TB)
     Used Dev Size : 1953382400 (1.82 TiB 2.00 TB)
      Raid Devices : 4
     Total Devices : 3
       Persistence : Superblock is persistent

       Update Time : Sat Jun  7 03:14:52 2026
             State : clean, degraded
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

              Name : fileserver-01:3
              UUID : a3f1c2d4:5e6f7a8b:9c0d1e2f:3a4b5c6d
            Events : 184237

    Number   Major   Minor   RaidDevice State
       0       8        4        0      active sync   /dev/sda4
       -       0        0        1      removed
       2       8       36        2      active sync   /dev/sdc4
       4       8       52        3      active sync   /dev/sdd4

Here's how I read this, top to bottom:

State : clean, degraded — the verdict in words. clean means there's no pending resync, the data is consistent; degraded means a member is missing. (You'll also see active, degraded if the array was mid-write — same meaning, the array just hadn't flushed when it lost the disk.)
Raid Devices : 4 vs Total Devices : 3 — four slots, three disks. The arithmetic of the missing member, spelled out. Failed Devices : 0 here is a quirk worth knowing: the disk wasn't just failed, it was kicked out and the slot removed, so it no longer counts as a failed device — it counts as an absence. (If the disk had errored but not yet been removed, you'd see Failed Devices : 1 and a faulty line instead.)
The device table is the payoff. Four RaidDevice slots, 0 through 3. Slots 0, 2, 3 read active sync and name a real device: /dev/sda4, /dev/sdc4, /dev/sdd4. Slot 1 reads removed with a dash for its Number — that's the hole. The array is telling you, in plain text: the disk that belongs in raid-slot 1 is gone. From the healthy listing you know that slot was sdb4[1] — so the failed physical disk is /dev/sdb.

And there it is — /dev/sdb is the disk to replace, with certainty, in writing, from two independent sources that agree. Do not move to the fix until both files name the same disk. Guessing here is how good disks get pulled.

Danger

The number-one way to turn a recoverable degraded array into a destroyed one is pulling the wrong drive. Linux device names (/dev/sda, /dev/sdb) are assigned at boot and can reorder between reboots — the sdb you read today may be a different physical bay tomorrow, and is not etched on the drive caddy. Before you touch a single screw, tie the failed mdadm slot to a physical disk by its serial number — smartctl -i /dev/sdb or lsblk -o NAME,SERIAL,MODEL — write that serial down, and pull the drive whose serial matches. On a hosted box, give the serial to your provider and let them pull it. And do not reboot to "see if it comes back": a degraded array survives reboots fine, but a reboot can renumber your disks mid-crisis and is the last thing you want while you're already down a member.

Reading It by Example

Train the pattern-match. The mdstat readout on the left, what I'd conclude on the right:

md0 : active raid1 ... [2/2] [UU] → Healthy mirror, both copies up. Nothing to do. The happy, and by far most common, case — most of your arrays look exactly like this.
md3 : active raid5 ... [4/4] [UUUU] → Healthy parity array, all four present and full redundancy. Calm.
md0 : active raid1 ... [2/1] [U_] → Degraded mirror. One copy gone, running on the survivor alone. Still serving data, but a single disk failure now means total loss. Identify the dead half, replace it, mdadm --add.
md3 : active raid5 ... [4/3] [UUU_] → Degraded RAID 5. Slot 3 missing. Parity is now being computed on every read instead of stored — works, but slower and naked. One more disk and the array is gone. Replace the named disk now, not this weekend.
md3 : active raid6 ... [4/3] [UUU_] → Degraded RAID 6, but still has one parity disk to spare — RAID 6 survives two failures, so a single missing member leaves you at the redundancy a healthy RAID 5 has. Less of a knife-edge, but fix it anyway: you're now where you never wanted to be permanently.
md3 : active raid5 ... [4/2] [UU__] → Two members gone from a RAID 5. This array is failed, not degraded — RAID 5 cannot survive two losses. Your data is offline. Stop, don't write anything, and go to RAID array offline and your backup.
md3 : active raid5 sdb4[1](F) ... [4/3] [U_UU] → A member marked (F) for faulty — the kernel has flagged it but not yet dropped it. Same situation as removed, one step earlier; the disk is on its way out. Confirm with smartctl (failing disk) and replace it.
md3 : ... [4/3] [U_UU] recovery = 34.5% (...) → Already rebuilding — someone (or a hot-spare) added a disk and the array is reconstructing. That's the destination, not a problem; see RAID rebuilding.

How to Fix It

The fix is a clean four-step dance: back up, identify, replace, rebuild. Done in that order, a degraded array is a chore. Done out of order — especially skipping step one or fumbling step two — it's how data dies. Let's walk it.

Danger

While the array is degraded it has reduced or zero redundancy — on RAID 1/5, a second disk failure now means total, unrecoverable loss, and the rebuild you're about to start is the single most strenuous thing you'll ever ask of the surviving disks (every one gets read cover to cover). Tired, same-age siblings sometimes pick exactly that moment to fail too. So before you add anything: confirm your backup is current and restorable. If you can't, take a fresh one now, while the data is still reachable. No rebuild is more urgent than the backup that makes a second failure survivable instead of fatal.

Step 1 — Identify the failed disk by serial, not by sdX. You already named the slot from mdadm --detail (raid-slot 1 → /dev/sdb). Now pin that name to a physical drive:

lsblk -o NAME,SERIAL,MODEL,SIZE
smartctl -i /dev/sdb

Write the serial number down. That is what you (or your hoster) pull — never the bay you assume sdb lives in.

Step 2 — Fail and remove the member from the array (if it isn't already removed). A disk the kernel marked (F)/faulty is still nominally a member; tell mdadm to let it go cleanly:

mdadm /dev/md3 --fail /dev/sdb4
mdadm /dev/md3 --remove /dev/sdb4

(If the slot already shows removed in the detail output, the disk dropped itself and you can skip straight to the swap.) Marking it failed first is the polite, deterministic way — it ensures the array isn't mid-write to a dying disk when you yank it.

Step 3 — Physically swap the disk. Pull the drive whose serial matches what you wrote down, slot in a replacement of equal or greater capacity, and let the kernel see it. On hot-swap hardware that's a clean pull-and-insert; on a hosted box, hand your provider the serial and let them do it. Then partition the new disk to exactly match its siblings — a RAID member is usually a partition (sdb4), not a whole disk, so the new partition table must line up. The classic one-liner copies the layout from a surviving disk:

sfdisk -d /dev/sda | sfdisk /dev/sdb     # MBR: clone the partition table
sgdisk /dev/sda -R /dev/sdb && sgdisk -G /dev/sdb   # GPT: clone, then new GUIDs

Step 4 — Add the new member and let it rebuild. One command, and the array starts pulling itself back to full redundancy:

mdadm /dev/md3 --add /dev/sdb4

The instant you run that, /proc/mdstat sprouts a recovery line and a progress bar:

md3 : active raid5 sdb4[5] sdc4[2] sdd4[4] sda4[0]
      5860147200 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [U_UU]
      [=====>...............]  recovery = 27.8% (543042176/1953382400) finish=131.9min speed=178291K/sec

Note the new disk gets a fresh slot number (sdb4[5]) — mdadm doesn't reuse the old one until the rebuild lands it in raid-slot 1. The [U_UU] map will flip the underscore back to a U the moment recovery completes, and you're whole again. That rebuild deserves its own page — what's safe to do during it, why it's the riskiest hour, how long it takes: RAID rebuilding.

Pro Tip

If the "failed" disk has no real defect — it dropped out from a cable wobble, a CRC storm, or a transient bus reset rather than dying sectors — you don't necessarily need a new drive at all. Re-add the same member with mdadm /dev/md3 --re-add /dev/sdb4; if the array still has its write-intent bitmap, it rebuilds only the blocks that changed while the disk was away, which can finish in seconds instead of hours. But confirm the disk is actually healthy first with smartctl -a — re-adding a genuinely failing disk just buys you the same outage again, soon.

How to Avoid It

You cannot avoid losing disks — that's physics, and a few years of bare metal makes it a certainty. What you can avoid is a disk loss turning into a silent, redundancy-less limbo that lasts until the second disk lands the killing blow. The goal isn't "never degrade"; it's "be told the instant you do, and be back to full redundancy fast." A short list, in order of importance:

Backup. RAID is not a backup, and believing it is has ended more companies than disk failure ever did. RAID survives a dead disk; it does nothing against rm -rf, a bad deploy, ransomware, or a fire — every one of which it faithfully mirrors to all your disks at once, instantly. A degraded array is the moment you'll be grateful a real backup exists, because it's the moment you're one failure from needing it.
A hot spare. Add a spare disk to the array (mdadm --add an extra one as a spare) and mdadm rebuilds onto it automatically the instant a member fails — turning the degraded window from "however long until a human notices" into "minutes." On parity arrays especially, an idle spare is the cheapest insurance you can buy.
mdadm --monitor. Run the monitor daemon so a failed member emails you immediately — mdadm --monitor --scan --mail you@example.com. The whole danger of this problem is that it's silent; the monitor is what gives it a voice.
Periodic scrubs. Schedule a check action (echo check > /sys/block/md3/md/sync_action, or the distro's mdadm cron job) so the array reads every block across every member on a timer. A scrub finds a silently rotting sector on one disk while the others can still rebuild it — catching the rot before it becomes the second failure.

Note

The cruelest property of RAID 5 is the correlated second failure during rebuild. Disks bought as a batch and aged in the same hot rack like to fail near each other; the rebuild then hammers every survivor end-to-end, and a tired sibling picks that moment to throw an unreadable sector — which, on RAID 5, means the rebuild fails and the array is lost. This is exactly why large arrays run RAID 6 (two parity disks, survives two failures) instead of RAID 5, and why a tested backup is rule 1, not rule 5. The bigger your disks, the longer the rebuild, the wider that window — and modern disks are very big.

How RAID Actually Survives a Missing Disk

Now the part you don't need in the emergency — but that lets you reason about RAID instead of just following steps. Once you can picture how an array reads data off a disk that isn't there anymore, every underscore in /proc/mdstat stops being a scary symbol and becomes something you can reason about. There are two tricks, and they're completely different.

Mirroring: The Boring, Beautiful One

A RAID 1 mirror is exactly what it sounds like — every byte written to disk A is written, identical, to disk B. There's no cleverness to it, and that's the point: when one disk dies, the other is a complete, working copy of your data, no reconstruction required. The array drops the dead half, keeps reading from the survivor, and performance barely flinches. The cost is honest and obvious: you pay for two disks and get the capacity of one. You're buying a spare tyre that's already mounted and spinning. The rebuild, when you add a fresh disk, is just a block-for-block copy — slow but simple, and nothing can go subtly wrong with "copy this disk to that one."

That [2/1] [_U] you read earlier? It's a mirror running on one leg. The data is all there, intact, on the surviving disk — the array isn't reconstructing anything, it's just down to its last copy. Which is precisely why it's so urgent: there's nothing to fall back on if that last copy hiccups.

Parity: Reconstructing a Disk That Isn't There

RAID 5 is where the real magic lives, and the maths is surprisingly clean. Instead of keeping a full second copy (which would cost you half your disks), it keeps a much smaller insurance policy: parity. Here's the whole idea, and once it clicks you'll never forget it.

Take a row of bits across your data disks — say three data disks holding 1, 0, 1. The parity disk stores the XOR of them: 1 XOR 0 XOR 1 = 0. Now four values sit across four disks: 1, 0, 1, 0. The beautiful property of XOR is that any one of those four can be reconstructed from the other three, by XOR-ing them back together. Lose the second disk (the 0)? Compute 1 XOR 1 XOR 0 = 0 — and you've recovered it exactly. The missing disk isn't stored anywhere; it's recomputed on the fly from its siblings, every single read, for as long as the array is degraded.

That's what [4/3] [UUU_] actually is: the array is now doing arithmetic on every read to fill in the disk that isn't there. It works flawlessly — and it's why a degraded parity array feels a touch slower, and why it's so exposed. With one disk gone, every surviving disk is load-bearing. There's no spare maths left; lose one more and there aren't enough numbers in the equation to solve for the missing ones. The insurance has paid out, and the policy is now void until you rebuild.

And here's the small piece of genius that makes RAID 5 practical: the parity isn't all dumped on one dedicated disk (that was RAID 4, and the parity disk became a bottleneck — every write had to touch it). RAID 5 scatters the parity across all the disks, one stripe's parity on disk D, the next on disk A, the next on disk B, round and round. That's the left-symmetric layout you saw in mdadm --detail — the rotation pattern. So there's no single parity disk to fail or to bottleneck; every disk carries a share of data and a share of insurance. RAID 6 just does the same trick with two independent parity calculations (a second one using clever Galois-field maths, not plain XOR), so it can solve for two missing disks at once — which is the whole reason it shrugs off a single failure where RAID 5 sweats.

So: two ways to survive a missing disk. Mirroring keeps a whole spare copy and reads it directly — simple, expensive, bulletproof. Parity throws away the copy and keeps only enough maths to rebuild the missing piece — clever, cheap, and naked the moment one disk is already gone. Hold both pictures and every line of /proc/mdstat reads itself: count the Us, find the _, and you know exactly how much net you have left. Right now, degraded, the answer is none — which is the one fact this whole page exists to make you act on.