Failing Disk: Symptoms, Diagnosis & Fixes
Every drive is slowly dying. The trick is telling the one failing tonight from the one just getting old.
What It Is
A failing disk is a drive — a spinning HDD or a solid-state SSD — that has begun losing the ability to store your bytes reliably. Sectors that read back cleanly last month now come back wrong, or not at all. The drive (usually) notices before you do, quietly walls off the bad spots, and keeps going. For a while you'd never know. Then one day the spare pool runs dry, the filesystem hits a read it can't satisfy, and the kernel throws the whole mount into read-only to stop the bleeding. The server's still up; it just can't write anymore. That's the moment most people first hear the word "SMART," usually at an unkind hour.
So let's start with the single most reassuring thing on this page, and then spend the rest of it making sure the next one never catches you out:
This is completely normal. A failing disk is not a sign you did something wrong. Drives are consumables — like tyres, like brake pads — and they wear out on a schedule the laws of physics wrote, not you. If you run a handful of bare-metal servers for a few years, you will be back on this page, more than once, and that's not bad luck; that's just the arithmetic of owning spinning metal and aging silicon. In a big datacenter this is so routine that there are people whose entire job is to walk the aisles with a cart of fresh disks, pull the amber-lit ones, and slot in replacements — all day, every day, forever. Somewhere, someone is doing it right now.
What separates a calm admin from a panicked one isn't avoiding failing disks. It's understanding them — and being prepared. By the end of this page you'll read the warning the drive has been broadcasting all along, tell the genuinely-dying disk from the merely-old one, know exactly what to do about each, and — the part that actually saves you — have a backup in place so the next failure is a shrug instead of an outage. We'll start where it hurts — how to spot it, read it, fix it, and prepare for it — and save the fascinating physics of why disks die for the end, because once the fire's out it's one of the better stories in computing.
How You Notice
A failing disk announces itself in the places you'd expect and a few you wouldn't. Here's each one, with the command to see it on your own box right now — so you can tell a real symptom from a scare:
-
I/O errors in the kernel log. The kernel narrates hardware trouble in plain text. Look:
dmesg -T | grep -iE "I/O error|ata[0-9]|medium error" journalctl -k -p errLines like
blk_update_request: I/O error, dev sda, sector 1234567orata1.00: failed command: READ FPDMA QUEUEDname the exact device and sector that refused to cooperate. This is the rawest, most honest symptom there is — and an empty result here is genuinely good news. -
The filesystem flips read-only. ext4 and XFS would rather freeze than corrupt your data, so on a serious write error they remount read-only. Check it:
mount | grep -w ro touch /some/path/testfile # fails with "Read-only file system"Suddenly every write fails even though
dfswears there's plenty of space. (That's the tell that separates this from disk full: full is about capacity, this is about hardware. Same panic, opposite cause.) -
Everything gets slow and
waclimbs. A drive retrying a bad read can stall for whole seconds. Intopthewa(I/O wait) figure climbs while the CPU sits idle — the box isn't busy, it's waiting, while a dying disk asks the platter "are you sure?" five times before giving up. And here's the cruel twist: a process stuck on a read that never returns drops intoDstate — uninterruptible sleep — and you can't kill it, not even withkill -9. It's the one process on the whole box that's beyond your reach: sulking in the corner, arms folded, waiting on a disk that may never call back — and if the disk is truly gone, nothing short of a reboot will pry it loose. List them:ps -eo state,pid,comm | awk '$1 ~ /D/'A little column of
D-state processes is exactly what a dying disk looks like from across the room. (Sustained, that whole picture is high I/O wait.) -
A
smartdemail — if you set it up. Thesmartmontoolsdaemon can watch your drives and email root the moment an attribute moves. Most servers never turn it on, which is exactly why the read-only remount is the first many admins ever hear of the problem. We'll fix that at the end.
Any one of these means: stop reading about disk space and go read the disk's health. They're completely different problems with completely different tools — and the tool for health is one command.
How I Read It
Every modern drive keeps a running diary of its own condition called SMART — Self-Monitoring, Analysis and Reporting Technology — and the tool that reads that diary is smartctl, from the smartmontools package. The command I reach for first asks the drive for everything it knows:
smartctl -a /dev/sda
(-a is "all"; on an NVMe drive the device is /dev/nvme0 and the report looks a little different — we'll get there.) Here's roughly what comes back — a real, healthy 6 TB drive, trimmed of a few of its more verbose blocks:
=== START OF INFORMATION SECTION ===
Model Family: HGST Ultrastar 7K6000
Device Model: HGST HUS726060ALE610
User Capacity: 6,001,175,126,016 bytes [6.00 TB]
Rotation Rate: 7200 rpm
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail - 0
9 Power_On_Hours 0x0012 088 088 000 Old_age - 85576
193 Load_Cycle_Count 0x0012 098 098 000 Old_age - 3478
194 Temperature_Celsius 0x0002 139 139 000 Old_age - 43 (Min/Max 21/53)
197 Current_Pending_Sector 0x0022 100 100 000 Old_age - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age - 0
It looks like a wall of numbers, but it's really just three blocks. First, who the drive is — the information section: model, capacity, and Power_On_Hours, here a frankly heroic 85576 (this disk has been awake for nearly ten years). Second, the one-line verdict — PASSED, a word we're about to learn to distrust. Third, the attribute table: the diary itself, the part you actually read. Here's the thing it takes most people years to learn, so let's lead with it — you read the table, not the verdict. Let's take the table apart, then come back to that headline and why we walk straight past it.
What to Actually Look At
One trap to clear before the numbers, because it snares absolutely everyone. The VALUE / WORST / THRESH columns are normalized, and they count down, not up. A healthy attribute starts life at 100 (or 200, or 253) and falls toward THRESH as things get worse; the drive has "failed" an attribute when VALUE sinks to meet THRESH. So a VALUE of 100 does not mean "100% used up" — it means pristine. (SMART grades like a strict professor: everyone starts at 100, and the only direction is down.) The number you actually want is the last column, RAW_VALUE — the honest count. In the healthy drive above, every one of those reads a calm 0; on a dying drive, it's where the damage shows up first. When in doubt, read the raw.
Now the attributes themselves. A full table runs forty rows of vendor trivia; you only need to recognise a handful — these, in rough order of how seriously I take a non-zero, and especially a growing, value:
- 5 — Reallocated_Sector_Ct. Sectors that failed and got remapped to the spare pool. A few, accumulated over years, is just age. A number that climbs between two readings is a drive actively shedding surface — the single most useful HDD attribute there is.
- 197 — Current_Pending_Sector. Sectors that read badly and are awaiting their fate. The freshest possible evidence: they failed now, this read, and haven't been remapped yet. Pending sectors either recover or graduate to reallocated/uncorrectable.
- 198 — Offline_Uncorrectable. Sectors the drive tried to recover and couldn't. Whatever lived there is gone. Non-zero is serious.
- 187 — Reported_Uncorrect. Errors the drive handed back up to the operating system unfixed — the count behind those
I/O errorlines indmesg. The big fleet operators lean on this one hard as a death predictor, and so should you. - 199 — UDMA_CRC_Error_Count. The odd one out, and the one that saves you from a costly mistake: this is not the disk's fault. CRC errors are corruption on the cable between drive and controller. A non-zero, non-growing value is usually a one-time blip; a climbing one means reseat or replace the SATA cable — not the drive.
Warning
Swapping a perfectly good disk because of CRC errors (attribute 199) is one of the classic rookie misreads — you bin a healthy drive, reconnect the new one through the same bad cable, and watch the errors come right back. When 199 is climbing but the real-defect attributes (5, 197, 198, 187) are all zero, the cable is the suspect. Power down, reseat both ends or swap the cable, note the current number, and watch. You can't usually reset a SMART counter — the drive owns it, not you — so "fixed" here means "stopped climbing," not "back to zero."
And then the one that isn't a defect at all: 202 / 233 / Wear_Leveling_Count / Media_Wearout_Indicator, the SSD endurance counter. This is age, not damage — it's the drive counting down its finite write budget — so it's worth knowing, but it's never a 3 a.m. emergency. (You're about to see exactly how loudly a drive can cry about this while being perfectly fine.)
On a NVMe drive the names change but the ideas are identical. smartctl -a /dev/nvme0 gives you a tidier health log:
Critical Warning: 0x00
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 7%
Media and Data Integrity Errors: 0
Read it the same way. Critical Warning is a bitmask — anything but 0x00 is the drive flagging something specific (spare low, gone read-only, overheating). Media and Data Integrity Errors is the NVMe cousin of reported-uncorrect: data served that failed the drive's own integrity check; non-zero is a real defect. Available Spare falling toward its threshold means the spare reserve is draining. And Percentage Used is, once again, just the endurance countdown — 7% means this drive has spent 7% of its rated writes and has a long life ahead.
Why You Ignore the Headline
Notice that I walked you through the entire table without once trusting that big PASSED/FAILED line at the top. That was deliberate, and it's the most important habit on this page: the overall-health verdict is the least reliable line in the entire report. Two real drives from our own racks show exactly why.
First, an SSD whose headline could not be more terrifying:
Device Model: Micron_1100_MTFDDAK512TBN
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.
"FAILED!" "SAVE ALL DATA." — and then, in the very same breath, one line down: "No failed Attributes found." The report is arguing with itself. (That blood-curdling middle line, by the way, is hardcoded boilerplate: smartctl has printed those exact words, unchanged, on every drive that trips the failed bit for about a quarter-century. It's not a real 24-hour estimate — it's a 1990s default that has panicked three generations of admins identically.) So we ignore the shouting and read the table:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE WHEN_FAILED RAW_VALUE
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age - 69172
187 Reported_Uncorrect 0x0032 100 100 000 Old_age - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age - 0
202 Percent_Lifetime_Used 0x0030 000 000 001 Old_age FAILING_NOW 100
Every attribute that signals an actual defect reads a clean 0. The only one tripping the alarm is 202 Percent_Lifetime_Used, at 100 with the flag FAILING_NOW — the endurance counter from a moment ago, reaching the end of its countdown. This SSD has been powered on for 69172 hours — call it 7.9 years — and over its life has soaked up roughly 470 terabytes of writes. It hasn't broken; it has retired. It will keep serving reads and writes perfectly well — it's a smoke alarm chirping because the battery is low, not because the house is burning. The honest verdict, the one a panicked beginner reading "SAVE ALL DATA" would never reach, is: plan a relaxed replacement whenever it's convenient, and sleep fine tonight.
Now the mirror image — and this is the dangerous one. A 6 TB hard drive whose headline could not be calmer:
Device Model: HGST HUS726060ALE610
SMART overall-health self-assessment test result: PASSED
PASSED. Move along, nothing to see. Except — read the table:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail - 11
9 Power_On_Hours 0x0012 089 089 000 Old_age - 82228
197 Current_Pending_Sector 0x0022 100 100 000 Old_age - 8
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age - 0
5 Reallocated_Sector_Ct = 11: eleven sectors already failed and got swapped out to spares. 197 Current_Pending_Sector = 8: eight more are suspect right now — read badly, not yet remapped. Bad sectors on a hard drive, once they start, tend to arrive in an accelerating rush (the physics at the end of the page explains why) — so treat the first few as the opening frames of an avalanche, not a stable state. This drive is 9.4 years old (82228 hours) and quietly coming apart — and its overall health says PASSED, because none of those raw counts has crossed the manufacturer's threshold yet.
Put the two side by side and the lesson writes itself: the verdict screamed FAILED at the healthy SSD and whispered PASSED at the rotting HDD. That one-line result is a single bit, computed against thresholds the manufacturer chose to suit their warranty math — not your uptime. The truth was in the attribute table the whole time. So, the rule that should outlive everything else on this page: read the attributes, never the verdict.
Reading It by Example
Train the pattern-match. Readout on the left, what I'd actually conclude on the right:
overall-health: PASSED, all four defect attrs0,Percentage Used: 12%→ Healthy. Nothing to do. The happy, and by far most common, case — most of your drives look exactly like this.overall-health: FAILED, defect attrs0,Percent_Lifetime_Used: 100/Percentage Used: 100%→ Worn out, not broken. The "SAVE ALL DATA" SSD. Schedule a calm replacement; it's living on borrowed warranty, not borrowed time. (Don't bother asking your hoster to swap it — it still works, so it won't meet their defect bar.)overall-health: PASSED,Reallocated_Sector_Ct: 11,Current_Pending_Sector: 8→ The avalanche, just starting. The verdict is lying by omission. Back up, watch the counts, plan to replace — and if it's a RAID member, this is the disk to swap.ReallocatedandPendingboth climbing across two days → A drive in active free-fall. A trajectory beats any single snapshot every time. Replace now, not this weekend.UDMA_CRC_Error_Count: 240and rising, every defect attr0→ Cable, not disk. Reseat or replace the SATA cable, then watch the number stop climbing. Do not RMA the drive.- A short self-test that fails —
smartctl -t short /dev/sda, wait two minutes, thensmartctl -l selftest /dev/sdashowsCompleted: read failure ... LBA_of_first_error 0x...→ Confirmed bad surface at a known block. That drive is done; back up and replace.
How to Fix It
The right move depends on which of those you're looking at — but the very first step is always the same, and it isn't optional.
Danger
When the real-defect attributes (reallocated, pending, uncorrectable) are non-zero or climbing, assume the drive can die completely at any moment — a hard drive especially, because of the avalanche. Get your irreplaceable data off first, while reads still work, with something gentle like
rsync -ato another machine. Do not kick off a heavybadblocksscan or a fullfsckon a dying drive before you've backed up — the extra reads can be the straw that finishes it. No fix is more urgent than your backup.
Then, by cause:
- Real defect (reallocated / pending / uncorrectable growing): replace the drive. Bad sectors don't heal, and a drive that's started shedding them keeps going — faster, if it's spinning rust. If it's part of a RAID array, fail and remove the bad member (
mdadm --fail /dev/md0 /dev/sdd, then--remove), slot in a fresh disk, and let the array rebuild onto it — which is the entire reason the array exists. On a rented or hosted box, open a ticket with thesmartctl -aoutput pasted in; if the defect attributes are non-zero, a decent hoster swaps it, usually free and often same-day. (Their bar for "defective" is real-defect attributes, not the endurance counter — which is exactly the distinction you now know how to make.) - Worn-out SSD (endurance maxed, no defects): replace it on your own schedule. No emergency. Order a drive, swap it when convenient, retire the old one. If it's mirrored, you've got all the time in the world to do it cleanly.
- CRC errors only (199 climbing, defects zero): fix the cable, not the disk. Power down, reseat the SATA data and power connectors at both ends, or swap the cable outright. You can't reset the counter (the drive owns it), so just note the current number and watch: if it stops climbing, you're done, and the disk was never the problem.
- A few stubborn pending sectors: writing fresh data over those blocks forces the drive to either recover or remap them, which a RAID scrub does as a side effect. Honestly, though, on a drive old enough to be growing pending sectors, the right answer is usually still "replace it" — your time is worth more than the disk.
How to Avoid Them
Let's be blunt, because it's kinder than false comfort: you cannot prevent failing disks. Run a few bare-metal servers for a few years and dead drives are not a risk, they're a certainty — the physics at the end of this page guarantees it. So the goal is never "avoid"; it's "be completely unbothered when it happens." A short list gets you there, in order of importance:
- Backup. The only thing that turns a dead disk from a catastrophe into a chore. A backup you've actually tested a restore from — not just configured and assumed — is the difference between a shrug and a resignation letter. (Our backup guide covers the how.)
- Backup. Yes, seriously. It's on this list twice because nothing else on it actually saves your data — RAID and SMART below just buy you warning and uptime. Disks will die; backup is the one thing that copes with it when they do. So do it now, before you read another section — it's worth it, and it's easier than you think. Our backup guide shows the best practices and how to set one up without a big effort.
- RAID. Mirror or parity your data across multiple drives so that any one of them can die without taking your service down — the failing disk becomes a calm, hot-swap replacement instead of an outage. One sharp edge worth knowing: the rebuild after a disk dies is the single most strenuous thing you'll ever ask of the survivors — every one gets read cover to cover to reconstruct the missing disk, and tired, same-age siblings sometimes pick exactly that moment to fail too. It's why large arrays run double parity, and why it deserves its own page: RAID.
- SMART self-tests. Turn on
smartdand schedule a long self-test (smartctl -t long) so the drive walks its entire surface on a timer and finds weak sectors before your workload does. A healthy RAID setup does the equivalent automatically — a periodic scrub that reads every block across the array, catching a silently rotting sector on one disk while the others can still rebuild it.
Note
RAID is not a backup, and believing it is has ended more companies than disk failure ever did. RAID survives a dead disk. It does nothing against
rm -rf, a buggy deploy, ransomware, or a fire — every one of which it faithfully mirrors to all your drives at once. And drives bought as a batch and aged in the same hot rack like to fail near each other, so "two disks dead in one week" is less rare than the statistics naively suggest. That's why backup is rules 1 and 2, and RAID is only rule 3.
And the deepest version of the SMART rule isn't a one-off command at all — it's watching the trend. One reallocated sector is noise; one reallocated sector becoming twenty over a week is the avalanche, and you only see it if something reads the diary every day and compares. A single manual smartctl run, weeks apart, misses exactly the signal that matters most.
How Disks Actually Die
Now the part you don't need in an emergency — but that turns this from a checklist into a sense, and happens to be one of the better stories in computing. The two kinds of drive in your server die in two completely different ways, for reasons that go all the way down to physics. Once you can picture the machinery, every number above stops being trivia to memorise and becomes something you can simply reason out.
Flash: A Few Thousand Writes, and Out
An SSD or NVMe drive has no moving parts at all. It stores each bit as a tiny trapped electric charge in a microscopic cell — picture a bucket that holds a few electrons, with a switch that reads "on" if the bucket's full and "off" if it's empty. Billions of buckets, and your data is just which ones are full. Reading a bucket is gentle; you can do it forever. But writing one means forcing charge across a thin insulating wall, and erasing it means dragging that charge back out — and every single time, that wall wears down a little. It's not a metaphor for wear; it is literally wearing through an insulator a few atoms thick.
Which means a flash cell has a hard, countable lifespan: a few thousand erase cycles for typical consumer flash, then the wall gets too leaky to reliably tell "full" from "empty." The cell doesn't explode. It just gets unreliable — and the drive, which has been expecting this since the day it was made, retires it and moves on.
And here's the first piece of real magic, the thing that makes SSDs work at all: individual cells fail constantly, and the drive is built entirely around hiding that from you. Every SSD ships with more flash inside than it advertises — a private reserve of spare cells (over-provisioning). It runs relentless internal bookkeeping that spreads your writes evenly so no cell wears out years before its neighbours (wear leveling), wraps every block in error-correcting codes that can rebuild data from a cell gone slightly wrong (ECC), and quietly relocates data off any block getting flaky. An SSD is, under the hood, a tiny civilization constantly moving residents out of crumbling buildings into fresh ones, demolishing the old blocks, and never once mentioning it to the city above. "No moving parts" is true — but it's furiously busy in there.
This is also where the price tags come from, and it directly predicts how long a drive lives. The buckets can hold more than one bit if you measure the charge level precisely enough:
- SLC — one bit per cell. Bucket's full or empty: dead simple, fast, survives tens of thousands of writes. Also expensive, so it mostly lives in enterprise gear.
- MLC / TLC / QLC — two, three, or four bits per cell. To fit four bits the drive must distinguish sixteen different charge levels in one tiny leaky bucket — far denser and cheaper, but slower and much less durable, because as the wall wears those sixteen levels blur into each other fast. Consumer drives are mostly TLC and QLC, rated for hundreds to a low thousand writes per cell.
Now look at one of those QLC buckets and tell me: is it 68.75% full, or just 62.5%? Those are two different stored values — eleven sixteenths versus ten sixteenths — and the drive has to tell them apart, reliably, billions of times. Now picture billions of these leaky buckets packed tight, side by side and stacked in towers more than two hundred layers deep — yes, modern flash is built upward now, microscopic skyscrapers of cells — every one of them slowly weeping a little charge into its neighbours, and all of your data resting on the controller's ability to still read the exact fill level of each one. Let that picture sink in for a second. … Now is an excellent moment to double-check that your backup script actually runs.
(There's a lovely sleight of hand that keeps these drives fast despite all that: many cheap QLC drives run a small slice of their flash in fast, durable SLC mode as a write cache, absorbing bursts at SLC speed and trickling them back to dense QLC later. It's why a budget SSD feels quick right up until you copy something huge and watch it fall off a cliff mid-transfer — you've overrun the cache and hit the raw QLC underneath.)
The upshot for failure: an SSD mostly dies by wearing out — predictably, gracefully, with plenty of warning, because it's been counting down its write budget the whole time and will happily tell you exactly how much is left. Remember that word graceful. The other kind of drive doesn't do graceful at all.
Spinning Rust: A 747 Flying Three Millimetres Off the Ground
A hard drive — affectionately, spinning rust — is one of the most absurd feats of engineering you own, and once you picture what's happening inside, its failure modes stop being mysterious and start being obvious.
Inside the sealed case: a stack of up to ten glass-or-aluminium platters coated in magnetic film, on a spindle that whirls them at 7,200 revolutions a minute (or 10k, or 15k in the fast ones). A hair's breadth above each surface — and below it, since both faces are used — floats a read/write head on a swing arm, riding a cushion of air dragged along by the spinning platter. It does not touch the surface. It flies. And the flying height is the part that breaks your brain: a few nanometres. Scaled up to something human, imagine a Boeing 747 at full speed a couple of millimetres above the runway, for years, never once scraping. That's the tolerance a hard drive holds, billions of times, while you forget it exists.
And the newest high-capacity drives go one better: they're filled with helium instead of air, then welded shut. Helium is far less dense, so there's less turbulence buffeting the heads and less drag on the platters — which lets the maker stack in more, thinner platters, run cooler, and sip less power. Yes: the 12- and 20-terabyte drives in big storage arrays are, quite literally, full of the same gas that makes party balloons float.
Now you can feel how a hard drive dies. At that flying height the enemy is anything that closes the gap: a speck of dust, a knock to the chassis, a worn spindle bearing letting the platter wobble a few nanometres. The head brushes the surface — a head crash — and gouges a microscopic scratch in the magnetic film. The data there is gone, which is bad enough. But here's the vicious part, the thing that makes hard-drive failure behave the way it does: that scratch throws off fresh debris — more particles now tumbling around a sealed box at the speed of a spinning platter, each one a fresh chance for the next scratch. One bad spot seeds the next.
That's why, when a hard drive starts failing, the bad sectors don't trickle — they avalanche. A drive that read clean last week can show a dozen reallocated sectors today and a hundred next week, because each failure is busy manufacturing the next. This is the single most important practical fact about HDD failure, and now you know it isn't folklore — it's physics. The drive defends itself the only way it can: when it finds a sector it can't read cleanly, it copies the data (if it still can) to a hidden spare and remaps the address, so your filesystem never sees the wound. That remap is a reallocated sector — the very attribute from the diagnosis above. A handful gathered slowly over years is age. A handful that appeared this week is the avalanche starting.
So: two drives, two deaths. Flash wears out like a pencil, shorter every day in a way you can measure. Rust fails like ice cracking — quiet, then all at once. Hold both pictures and the whole SMART table reads itself.
What Makes Them Die Faster
The lifespan is set by physics, but a few things you control move the needle hard:
Heat. The quiet killer of both kinds. Bearings, lubricants, and magnetic film all degrade faster hot; flash cells leak charge faster hot. As a rough rule storage engineers carry around, sustained high temperature can roughly double the failure rate for every 10–15 °C — so a disk baking at 55 °C year-round is living a much shorter life than the identical drive at 35 °C. Airflow is cheap; disks are not.
Writes — especially from databases. Reads are nearly free; it's writes that spend an SSD's finite life and keep a hard drive's heads busy. And the biggest, most surprising write amplifier in most stacks is a database configured to be maximally safe. Which brings us to a story. We once (mis)configured a database to do the most cautious thing it possibly could: persist every single write straight to the platters the instant it happened — no batching, no buffering, each transaction individually forced down to disk before the database would say "done." It was, predictably, agonisingly slow. But the part that stuck with us was what it did to the SSD underneath: we watched the drive's wear indicator — the percentage of its rated write life burned through — climb from 0 to 80 in a matter of weeks. A drive built to last most of a decade was on track to wear out before the next quarter, purely because of one configuration line. We relaxed the setting, the writes batched up sanely, and the counter went back to crawling. Lesson seared in: a database's durability setting is a knob wired directly to your SSD's lifespan. This is also why a write-heavy database belongs on a drive rated for it — SSD endurance is sold as TBW (terabytes written) or DWPD (whole-drive writes per day for the warranty); put a busy database on a cheap QLC drive and you're not buying storage, you're buying a countdown.
Spinning up and down. Here's the question worth its own answer: should you save power and wear by parking idle hard drives, or does switching them on and off do more harm than just letting them run? For anything you touch regularly, let them run. The single most stressful moment in a hard drive's life is spin-up — peak current, bearings going from cold and still to 7,200 rpm, heads loading onto the platter — and drives are rated for only a finite number of start/stop and head-load cycles. Constant gentle spinning at a sane temperature beats a life of cold starts. And there's a famous massacre to prove it: a generation of "green," power-thrifty consumer drives shipped with aggressive head-parking on by default, unloading the heads after as little as eight seconds of idle to save a trickle of power. Drop one into a Linux server that sighs to disk every couple of minutes and it would park and reload its heads hundreds of times an hour, burning a load-cycle rating meant to last years in a matter of months. People watched SMART attribute 193 Load_Cycle_Count tick into the hundreds of thousands and their disks die young — all to save a fraction of a watt. (hdparm -B 254 disarms it.) So spin-down genuinely helps only for disks that are truly idle for long stretches — an archival or backup drive touched once a day. For your live data: keep it spinning, keep it cool.
Sudden power loss. Yank power mid-write and you can corrupt the block being written; do it often and some drives wear in odd ways defending themselves. (They count it — there's a SMART attribute for unexpected power loss.) A UPS is cheap insurance for a machine that matters.
There are more threads to pull — vibration in dense disk shelves, the gulf between consumer and enterprise duty ratings, the way an SSD can slowly lose data left unpowered for many months — but those are pages of their own. For now you have the whole shape of it: heat, writes, cold starts, power, time. The drive keeps an honest diary through all of it — but a diary only ever buys you warning. The one thing that actually saves you is the backup you made before you needed it. Sysadmins have a blunter way of saying it, half joke and half creed, and they're dead right: no backup, no pity.
See Also
- backup — the one habit that makes every dead disk a non-event; the most important link on this page
smartctl— the tool that reads the drive's diarysmartmontools— the package, plus thesmartddaemon that watches it for youdmesg— where the kernel's raw I/O errors surface firsthdparm— tune power management, spindown, and head-parking- RAID — surviving a dead disk without losing sleep or data
- SSD — how flash actually stores a bit, and why it wears out
- disk full — the other disk emergency, about space rather than health
- high I/O wait — what a stalling drive does to the rest of the box
- degraded RAID array — when the failing disk was part of an array
Is the FAILED on your screen a real defect — or just an SSD chirping at its wear bar?
CleverUptime reads the SMART attributes on every disk you run it on, watches the defect counts climb day over day, and tells you in plain language which drive is genuinely losing sectors — naming the device — and which one's scary verdict is only counting its age, so you replace the disk that's actually dying and let the old one keep working.
Want to see your own server's health right now? One command, no signup, no install.