RAID Array Rebuilding: Symptoms, Diagnosis & Fixes
The array is healing itself. It's working exactly as designed — and it's never been more fragile.
What It Is
A rebuilding RAID array is one that lost a disk, got a fresh one to replace it, and is now in the middle of copying enough data onto the newcomer to make the array whole again. For a RAID 1 mirror that means cloning the surviving disk block-for-block onto its new twin; for RAID 5 or 6 it means reading every surviving disk cover to cover and recomputing the missing disk's contents from parity. (RAID 0, for the record, cannot rebuild at all — it has no redundancy, so a single disk failure destroys the array. If you're running RAID 0 on a server, replace it with RAID 1, 5, 6, or 10 — any of those give you the rebuild you'll eventually need.) Either way, the kernel's software-RAID layer (md, the thing behind mdadm) is grinding away in the background, and your data is still fully there and fully readable the whole time.
So let's start with the reassuring part, because this is the page people land on at 2 a.m. convinced something is on fire: a rebuild is not a fault. It is the array doing precisely the job you bought it to do. A disk died, the array shrugged and kept serving every read and write without dropping a byte, and now — on its own, no downtime — it is rebuilding the redundancy you lost. This is the happy ending of a disk failure, not the start of a disaster. CleverUptime flags it as RAID:ResyncRunning at INFO/WARNING severity for exactly that reason: it's informational — here's what your box is busy doing — with a gentle nudge, because for the hours this takes, you are running with less safety net than usual.
That last clause is the whole story of this page, and it's worth saying plainly before anything else: a rebuilding array is working as designed, but it is also at its single most vulnerable. While it reconstructs the missing disk, the surviving disks are being read harder than they are ever read in normal life — and on a degraded array, until the rebuild finishes, you have no spare redundancy left to lose. By the end of this page you'll read the progress line like a dashboard, know resync from recovery from check (three words the array uses that mean three different things), tell a healthy rebuild from a stalled one, know how to speed it up or slow it down safely, and — the part that actually matters — understand why right now, mid-rebuild, is the most important moment in the array's life to have a working backup. We'll lead with how to see it and read it, and save the lovely "how does it even know what the dead disk held" mechanics for the end.
How You Notice
A rebuild is quiet by design — it's meant to heal without bothering you — so the symptoms are mostly things you only see if you look. Here's each, with the exact command to see it on your own box:
-
/proc/mdstatshows a progress bar. This is the ground truth, the file the kernel updates live as it works. Onecatand you see every array and exactly what each is doing:cat /proc/mdstatA rebuilding array carries an extra line the healthy ones don't — a little ASCII progress bar with a percentage, an ETA, and a speed. We'll take it apart line by line in the next section; for now, the presence of that
recovery =(orresync =) line is the symptom. -
mdadm --detailsaysrecovering. The richer, human-readable view of one array, including a plain-English state and a rebuild percentage:mdadm --detail /dev/md0Look for
State : clean, degraded, recoveringand aRebuild Status : NN% completeline. The worddegradedthere is not a second alarm — during a recovery the array is by definition still short its full disk count until the copy finishes, sodegradedandrecoveringtravel together right up to the last percent. -
Everything feels a touch slower, and
waticks up. A rebuild is a firehose of sequential I/O across every disk in the array, and it competes with your real workload for those same spindles and that same SATA bus. Intopyou may see the I/O wait figure (wa) sit higher than usual while the rebuild runs. This is expected and temporary — the array deliberately throttles itself so your applications keep priority (more on that knob below), and the moment it finishes, thewasettles back down.top -
A
mdadm --monitormail — if you set it up. Themdadmdaemon (mdadm --monitor, usually running as a service on Debian/Ubuntu out of the box) emails root on every array event:Fail,SpareActive,RebuildStarted,RebuildFinished. If you got a "DegradedArray" mail an hour ago, the rebuild line in/proc/mdstatis the sequel.
None of these is bad news. They're the array narrating a recovery in progress. The one thing they do mean is: don't reboot for fun, don't pull another disk, and make sure your backup is current — because for the next stretch of time, this array is doing the most demanding work it ever does.
How I Read It
The file I open first, every time, is /proc/mdstat — a tiny virtual file the kernel writes on the fly that shows the live state of every software-RAID array on the box. It costs nothing to read and it never lies. Here's a real RAID 1 mirror partway through a recovery, exactly as the kernel prints it:
Personalities : [raid1]
md0 : active raid1 sdc1[2] sdb1[0]
2095040 blocks super 1.2 [2/1] [U_]
[===>.................] recovery = 37.4% (784384/2095040) finish=112.3min speed=98000K/sec
unused devices: <none>
Four lines, and each one tells you something. Let's walk them top to bottom, because once you've parsed this block once you'll parse it forever.
Line 1 — Personalities : [raid1]. The RAID levels this kernel knows how to drive. It's just a capabilities list — [raid1], [raid6] [raid5] [raid4], [linear], and so on — telling you which personalities the md layer has loaded. Glance at it once and move on; it almost never matters during a rebuild.
Line 2 — md0 : active raid1 sdc1[2] sdb1[0]. The array's name and roster. md0 is the array device. active is its state — good; the alternative you don't want is inactive, which is a different and worse problem. raid1 is the level. Then the members: sdc1 and sdb1, each with its role number in square brackets. Those bracketed numbers are the device's role, not the partition names — and the convention is the tell you want: on an array that needs n disks, roles 0..n-1 are the working members and anything numbered n or higher is a spare. So sdb1[0] is the survivor sitting in working slot 0, while sdc1[2] carries role 2 on a two-disk mirror — a number above the working set, which marks it as the new disk being rebuilt in (md hands a fresh replacement the next free role rather than reusing the dead disk's slot, then swaps it down into the proper slot once the rebuild completes).
Line 3 — 2095040 blocks super 1.2 [2/1] [U_]. The size and, crucially, the health summary. 2095040 blocks is the array's capacity in 1 KB blocks (~2 GB here — a small test mirror). super 1.2 is the metadata format. Then the two pieces that matter most:
[2/1]— disks expected / disks currently in sync. This array wants 2 disks and has 1 fully synced right now. That/1is why it's degraded: one member isn't carrying its share yet.[U_]— the same fact drawn as a picture, one character per slot:U= up (in sync),_= not up.[U_]means slot 0 is healthy and slot 1 is missing-or-rebuilding. This little glyph is the fastest health check in all of Linux storage —[UU]is a happy mirror,[U_]is one down,[UUUU]a happy RAID 5/6,[UU_U]one down on a four-disk array. When the rebuild finishes, you'll watch[U_]flip to[UU]and[2/1]become[2/2]. That transition is the finish line.
Line 4 — the progress bar. The line that only exists while work is happening, and the one you actually came for:
[===>.................] recovery = 37.4% (784384/2095040) finish=112.3min speed=98000K/sec
Read it left to right:
[===>.................]— a 20-character ASCII gauge. The=is done, the>is the write head's current position, the.is yet to come. It maps to the percentage; it's there so you can eyeball progress without reading the number.recovery = 37.4%— how far along, and which kind of sync this is. The verb is load-bearing:recovery,resync, andcheckare three different operations (we'll pin down the difference in a moment).(784384/2095040)— the raw counters behind the percentage: blocks done / blocks total.784384 ÷ 2095040 = 37.4%. Useful when you want to watch the numerator climb in real time.finish=112.3min— the kernel's own ETA, computed from the current speed. It's an estimate and it will wander as the speed changes, but it's honest about the trend.speed=98000K/sec— the current throughput, in KB/sec.98000K/secis about 96 MB/s — a healthy clip for spinning disks. This number is also the dial you watch to tell a smooth rebuild from a struggling one.
Pro Tip
To watch a rebuild progress without re-typing the command, run
watch -n5 cat /proc/mdstat— it refreshes the whole block every five seconds, so the percentage climbs and the ETA narrows in front of you. It's oddly satisfying, and it's the single best way to confirm a rebuild is actually moving and not wedged. (watch -deven highlights the digits that changed.)
Now the mdadm --detail view of the same array — the prose version, which adds the device-by-device roster:
/dev/md0:
Version : 1.2
Creation Time : Sat Feb 14 09:18:21 2026
Raid Level : raid1
Array Size : 2095040 (2046.34 MiB 2145.32 MB)
Used Dev Size : 2095040 (2046.34 MiB 2145.32 MB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Sat Feb 14 11:02:54 2026
State : clean, degraded, recovering
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
Consistency Policy : bitmap
Rebuild Status : 37% complete
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
2 8 33 1 spare rebuilding /dev/sdc1
The headline is State : clean, degraded, recovering — three words that look alarming together but say something precise. clean means the data is consistent (no in-flight writes lost). degraded means the array is short a fully-synced member. recovering means it's actively fixing that. Rebuild Status : 37% complete matches the /proc/mdstat percentage. And the device table spells out roles: /dev/sdb1 is active sync (the survivor doing the reading), /dev/sdc1 is spare rebuilding (the newcomer being filled). When the rebuild ends, sdc1 graduates from spare rebuilding to active sync, and the state drops to a plain clean.
Note
Spare Devices : 1here does not mean you have a spare sitting idle. During a recovery the disk being rebuilt is counted as a "spare" right up until it's fully synced and promoted to "active" — so a recovering array always shows one spare that's really your replacement disk mid-fill, not a free hot-spare on standby. ReadRebuild Statusand the device line (spare rebuilding) together and the picture is unambiguous.
Resync vs Recovery vs Check
Three words show up in that progress line, and people treat them as synonyms. They aren't, and knowing which one you're seeing tells you why the array is busy:
recovery— rebuilding a missing disk's contents onto a replacement. This is the one this page is about: a disk died (or you added one), and the array is reconstructing it from the survivors. The array is degraded the whole time ([U_]), because the new disk isn't pulling its weight yet.resync— making all present disks consistent with each other when the array isn't missing anyone. You see this right after creating an array (the initial sync), or after an unclean shutdown when the kernel can't be sure every mirror matched. The array is not degraded — all disks are there ([UU]); md is just double-checking they agree. (A bitmap, theConsistency Policy : bitmapabove, makes this fast by tracking only the regions that were being written — so an unclean reboot resyncs megabytes, not the whole array.)check— a scrub: read every block on every disk, compare against parity/mirror, and tally (or fix) any mismatches, with nothing wrong and nothing missing. This is the preventive sweep a healthy array runs on a schedule (Debian/Ubuntu kick one off the first Sunday of each month via/etc/cron.d/mdadm) to catch a silently rotting sector while the others can still rebuild it. Acheckline in/proc/mdstaton a[UU]array is good hygiene, not trouble. (CleverUptime treats a runningcheckthe same way it treats a recovery —RAID:ResyncRunning— because from the kernel's point of view it's the same machinery; the verb tells you which intent.)
The mechanical engine is identical for all three — read disks, write disks, walk the array start to finish — which is why they share one progress line and one set of speed knobs. The verb is what tells you whether you're watching a heal (recovery), a sanity pass (resync), or a checkup (check).
Reading the ETA and Speed
The finish= and speed= numbers are a single story told two ways, and both are governed by two kernel tunables most people never meet:
cat /proc/sys/dev/raid/speed_limit_min # default 1000 (KB/sec)
cat /proc/sys/dev/raid/speed_limit_max # default 200000 (KB/sec)
These are the floor and ceiling on rebuild throughput, in KB/sec, and they exist to solve a genuine tension. speed_limit_max (default 200000, i.e. ~195 MB/s) is the cap when the array is otherwise idle — go as fast as you like, but no faster than this, so the rebuild can't completely monopolise the disks. speed_limit_min (default 1000, i.e. ~1 MB/s) is the guaranteed minimum the rebuild gets even when your applications are hammering the same disks — so a busy server can't starve the rebuild down to nothing and leave you degraded for a week. Between those bounds, md backs the rebuild off automatically whenever real I/O shows up and lets it sprint when the disks go quiet. That speed=98000K/sec you read is md's live answer to "how much spare I/O is there right now."
So when you see the speed sag and the ETA balloon, it usually isn't a fault — it's the array politely yielding to your workload, exactly as it's supposed to. And when you genuinely need it done faster (more on that below), these two files are the throttle.
Reading It by Example
Train the pattern-match. The /proc/mdstat (or mdadm --detail) readout on the left, what I'd conclude on the right:
[2/1] [U_]withrecovery = 37.4% ... speed=98000K/sec, ETA shrinking → A healthy RAID 1 rebuild, right on track. Nothing to do but let it run — and make sure your backup is current while it does.[4/3] [UU_U]withrecovery = 8.5% ... finish=148.6min speed=152340K/sec→ A RAID 5 rebuilding one failed member onto a replacement; three of four disks healthy, parity being recomputed onto the slot marked_. Normal, if slow — RAID 5 rebuilds read all survivors, so the bigger the disks the longer this takes.[2/2] [UU]withresync = 19.7% ... speed=138218K/sec→ Not a degraded array at all — an initial sync or a post-crash consistency pass. No disk is missing ([UU]); md is just making the mirrors agree. Nothing died.[2/2] [UU]withcheck = 92.9% ... speed=125392K/sec→ A scheduled scrub on a fully-healthy array. Pure hygiene. Let it finish; if it reports a non-zeromismatch_cntafterward, that's a separate (and rare) conversation.speed=1000K/secand the ETA in the thousands of minutes → The rebuild has been throttled to its floor by heavy competing I/O —speed_limit_min's default. Not broken, but crawling. If the array is degraded and you want it safe sooner, this is the case where you raise the limits (below).recovery = 0.0%and the percentage not moving across twocats → A genuinely stuck rebuild — almost always because the surviving disk is itself failing and a read is hanging. Stop, check the survivor's SMART now: see disk failing. This is the scenario the Danger box below exists for.resync=DELAYEDorresync=PENDINGon a second array → Two arrays share the same physical disks, so md queues their rebuilds one at a time rather than thrashing the spindles. The delayed one starts automatically the moment the first finishes. Nothing to fix.
How to Fix It
Here's the part that surprises people: most of the time, the fix is to do nothing. A rebuild is self-driving. It started on its own when the replacement disk was added, it throttles itself around your workload, and it will finish on its own and flip [U_] to [UU] without you touching a key. The "fix" is patience plus one piece of preparation. But there are real cases where you act, so let's cover the urgent one first.
Danger
A rebuild reads every surviving disk end to end — the heaviest, most sustained load they ever bear — and on a degraded array you have zero redundancy left to lose. If a second disk dies mid-rebuild (and tired, same-age, same-batch siblings genuinely do pick this moment), the array is gone and so is your data. So before you lean on the array — and ideally the instant it went degraded — get a current backup off it while every read still works. And do not answer a slow rebuild by yanking and reseating disks or rebooting to "speed things up": on an array with one foot already off the curb, every extra disturbance is a fresh chance to lose the second foot. The rebuild is the recovery. Let it recover.
With that said, here's what to do, by situation:
-
Healthy rebuild, just running: leave it, and back up. Let
/proc/mdstatclimb to 100%. Confirm your backup is current — this is the one moment it earns its entire existence. When the bar hits 100% and[U_]becomes[UU], you're whole again; no action needed. -
You replaced a failed disk and the rebuild didn't auto-start. Some setups need you to add the new disk to the array by hand. Partition it to match, then add it and md begins recovery immediately:
mdadm --add /dev/md0 /dev/sdc1Watch
/proc/mdstat— within seconds you should see therecovery =line appear. (If the array is still showing a failed old member, remove it first withmdadm --remove /dev/md0 /dev/sdX1.) For the full disk-swap walkthrough, see RAID degraded. -
The rebuild is genuinely too slow and the array is degraded. When you need redundancy back now and the box can spare the I/O, raise the throttle. These take effect live, mid-rebuild:
echo 100000 > /proc/sys/dev/raid/speed_limit_min # floor: ~100 MB/s echo 500000 > /proc/sys/dev/raid/speed_limit_max # ceiling: ~500 MB/sThe speed will jump toward whatever the disks can actually sustain. (To make it survive a reboot, drop it in
/etc/sysctl.d/. But on a degraded array you usually want it done this boot, not next.)Warning
Raising
speed_limit_mindoesn't conjure free bandwidth — it just shifts who wins the fight for the disks, away from your live applications and toward the rebuild. On a busy production box that can make user-facing requests crawl while the rebuild sprints. It's the right trade when finishing the rebuild safely matters more than latency for the next hour; it's the wrong one if the box is serving customers and the array can afford to take its time. Raise it, watch the rebuild finish, then set it back. -
The rebuild is stuck at a fixed percentage and not moving. This is the dangerous case masquerading as the boring one. A rebuild that flatlines is almost always a surviving disk hitting a sector it can't read — the rebuild can't proceed past a block it can't reconstruct from. Don't poke the array; check the health of the disks it's reading from right now:
smartctl -a /dev/sdb # the SURVIVOR, not the new disk dmesg -T | grep -iE "I/O error|ata[0-9]|md/raid"If the survivor shows real defects (pending or reallocated sectors climbing), you are one bad block away from losing the array — get your backup finished first, then deal with the survivor. The whole reading is on disk failing.
How to Avoid It
You don't avoid a rebuild — you want it. A rebuild happening means your array did its job: it survived a dead disk and is restoring your safety margin automatically. The thing actually worth minimising is the time spent degraded (the window where a second failure is fatal) and the risk of a rebuild going wrong. A short list, in order of payoff:
-
Backup — and especially before and during a rebuild. RAID survives a dead disk; it does nothing against
rm -rf, a bad deploy, ransomware, or the second disk dying mid-rebuild. The rebuild is the highest-risk window in the array's life, which makes a current, restore-tested backup most valuable at exactly the moment you're most tempted to assume the array has it covered. Do it now. Yes, even mid-rebuild — gentle reads don't meaningfully slow the recovery. -
Keep a hot spare in the array. Add a disk as a
spare(mdadm --add, with the array already full) and md will automatically pull it in the instant a member fails — the rebuild starts itself, unattended, often before you've even read the alert email. It collapses the degraded window from "however long until a human notices and swaps a disk" down to "seconds." For anything you can't babysit, this is the single best upgrade. -
Use RAID 6 (double parity) on large arrays. RAID 5 survives one disk; during its rebuild it survives zero, and the rebuild is precisely when a second disk is most likely to give out — the reason "RAID 5 is dead on big disks" became a sysadmin proverb. RAID 6 keeps a second parity disk, so even mid-rebuild you can lose another disk and live. On big, slow drives with long rebuilds, that second layer is not paranoia; it's the math.
-
Run scheduled scrubs and watch SMART. A periodic
check(the monthly scrub above) reads every block while all disks are present, so it finds a weak sector and rebuilds it from redundancy before a real failure forces a rebuild that then trips over that same sector. A scrub is a rehearsal for the rebuild you hope never comes — and the disk that fails the scrub is the one to replace on your own calm schedule, via disk failing.
Note
The cruellest, most common rebuild failure is the latent bad sector. A disk can sit in a healthy array for a year with one block it can no longer read — and nothing notices, because nothing read that block. Then a sibling dies, the rebuild starts reading everything, hits that block, and can't reconstruct the dead disk because the data it needs is on the unreadable sector. Two failures, one of them silent for months. Scrubs exist to surface that latent block while the array can still heal it — which is why a healthy array that scrubs on a schedule rebuilds far more reliably than one that never does.
How a Rebuild Actually Works
Now the part you don't need in a crisis — but that turns "watch the bar go up" into genuine understanding. The question that nags everyone the first time: if a disk died, how does the array know what was on it? The answer is one of the quietly beautiful ideas in computing, and it differs by RAID level.
RAID 1: Just Copy the Survivor
The mirror is the easy case, and worth starting with because the intuition is free. In RAID 1, every disk holds an identical copy of the data. Lose one, and the survivor still has the complete picture — so a rebuild is, conceptually, dd from the good disk to the new one, block for block, beginning to end. That's why a mirror rebuild's speed is simply "how fast can one disk read and the other write," and why [U_] becomes [UU] in roughly the time it takes to read the whole disk once. No cleverness, just copying. The only subtlety is the bitmap: md keeps a tiny on-disk record (Consistency Policy : bitmap) of which regions had writes in flight, so after a transient drop-and-readd it can resync only the dirty regions in seconds, instead of recopying terabytes that already match. A small file that saves enormous time — the kind of detail that, once you know it, makes you trust the system more.
RAID 5 and 6: Reconstruct from Parity
Here's where it gets clever, and where the magic lives. RAID 5 doesn't keep a second copy of your data — that would cost you half your disks. Instead it keeps parity, and the trick rests on one humble operation you already know from a logic class: XOR.
XOR has a magic property: it's reversible. If you XOR together the data blocks from every disk in a stripe and store the result as a parity block, then any single missing block can be reconstructed by XOR-ing all the blocks you do still have. Picture a stripe across four disks: D1 ⊕ D2 ⊕ D3 = P. Lose disk 2, and you recover it as D1 ⊕ D3 ⊕ P = D2. The parity you stored "anticipated" every possible single loss, all at once, with a single block instead of a full copy. That's the whole idea of RAID 5 in one line — and it's why losing a second disk in RAID 5 is unrecoverable: XOR can solve for one unknown, not two. (RAID 6 adds a second, mathematically-different parity — a Reed–Solomon syndrome, not just another XOR — so it can solve for two unknowns. That's the entire reason it survives a second failure.)
This is also why a parity rebuild is so much heavier than a mirror rebuild, and why it stresses the survivors so hard. To reconstruct one missing block, md must read the corresponding block from every surviving disk and XOR them together. Multiply that across the whole array and the rebuild reads every survivor, in full, start to finish — there is no shortcut, because every stripe needs every other disk to rebuild its missing piece. The new disk barely breaks a sweat (it only writes); it's the old, surviving, same-age disks that get read cover to cover, harder than they've ever been read. Now you can feel why the danger box is worded the way it is: the rebuild is, quite literally, a full-surface stress test run simultaneously on every tired disk you have left, at the exact moment you can least afford one to fail. The math that saves you is the same math that strains you.
Why It Throttles, and Why the ETA Wanders
The last piece ties back to those speed knobs. md runs the rebuild as a background citizen of the I/O system: it watches for real application I/O and yields to it, sprinting only when the disks would otherwise be idle. That's the speed_limit_min/speed_limit_max band in action — a guaranteed minimum so the rebuild can't be starved forever, a ceiling so it can't hog an idle box's disks pointlessly, and automatic backoff in between. It's a small, civilised piece of engineering: the array heals as fast as it can without ever making your server feel broken to the people using it. Which is exactly why the finish= estimate breathes in and out — it's not confused, it's reactive, recomputing the ETA from a speed that rises and falls with how busy you keep the disks. A rebuild that speeds up at 3 a.m. and slows at noon isn't malfunctioning; it's reading the room.
Hold all of that together and the progress line stops being a mystery bar and becomes a status report you can reason about: which operation (the verb), how far (the percent and raw counts), how fast right now (the speed, i.e. how much spare I/O exists), and how exposed you are until it's done (the [U_]). One little file, the whole story.
See Also
- RAID — how mirrors and parity actually protect your data, from zero
mdadm— the tool that builds, monitors, and repairs software RAID/proc/mdstat— the live status file every rebuild writes tosmartctl— check the surviving disk's health before you trust it to a rebuildtop— where the rebuild's I/O load shows up aswa- I/O wait — why the box feels slow while disks are saturated
- RAID degraded — the state before this one: a disk is gone and needs replacing
- RAID array inactive — when the array won't assemble at all, a far worse problem
- disk failing — read the SMART diary; the disk that triggered all this, and the survivor you're now leaning on
- disk full — the other storage emergency, about space rather than redundancy
- backup — the only thing that saves you if a second disk dies mid-rebuild; the most important link here
Is your array rebuilding right now — and is it safe to keep working while it does?
CleverUptime reads
/proc/mdstaton every server we watch, spots the recovery in progress, and tells you in plain language which array is degraded, how far along the rebuild is, and that you're running without a safety net until it flips back to[UU]— so you know to hold off on the risky stuff and check your backup before the bar hits 100%.Want to see your own server's health right now? One command, no signup, no install.