Worn-Out SSD: Symptoms, Diagnosis & Fixes
An SSD doesn't break when it wears out — it retires. The trick is telling the two apart.
What It Is
A worn-out SSD is a solid-state drive that has written everything it was ever rated to write. Flash storage doesn't last forever: every cell can only be erased and rewritten a finite number of times before it stops holding charge reliably, and the drive counts down that budget from the day it's switched on. When the counter hits the end, the drive has reached its rated endurance — and the way it announces that is, frankly, a public-relations disaster. It trips the same overall-health bit a genuinely-dying disk would, smartctl prints FAILED!, and a perfectly serviceable drive sends its owner into a 3 a.m. panic over a problem that is, in plain terms, the warranty running out.
So let's lead with the single most reassuring fact on this page, because it's the whole point: a worn-out SSD is not a broken SSD. Wearing out is the designed, expected, graceful end of a flash drive's life — the equivalent of a car rolling past its rated mileage. It has not lost your data, it is not shedding sectors, and in almost every case it will keep serving reads and writes perfectly well for a good while yet. This is the polar opposite of a disk failing, where the media itself is physically defective — scratched platters, dead cells, sectors that read back wrong. Wear is age. A defect is damage. The entire skill this page teaches is reading the SMART report well enough to tell, at a glance, which one is staring back at you — so you replace the dying disk tonight and let the merely-old one keep earning its keep until a convenient Tuesday.
By the end you'll read the NVMe health log line by line, understand exactly what "Percentage Used: 245%" means (yes, a percentage can sail past 100, and we have a real one that did), tell wear-out from a real defect without a moment's doubt, and know precisely how relaxed to be about each. We'll start where it counts — spotting it, reading it, fixing it, preparing for it — and save the remarkable why (how a flash cell physically wears through an insulating wall a few atoms thick) for the end, where it belongs once nobody's panicking.
How You Notice
A worn-out SSD rarely breaks anything — that's what makes it confusing. It usually surfaces not as a failure but as a scary-looking report you went looking at for another reason. Here's each place it shows up, with the command to see it on your own box right now:
-
A
smartctloverall-health of FAILED — on a drive that's working fine. This is the classic. You run a health check, expectingPASSED, and get the most alarming words in the entire monitoring vocabulary:smartctl -H /dev/nvme0SMART overall-health self-assessment test result: FAILED!And yet the server is up, the filesystem is happily read-write, nothing is slow, nothing is erroring. The report is contradicting the evidence in front of you. Hold that thought — resolving exactly this contradiction is the heart of this page.
-
An NVMe critical-warning bit set. The richer NVMe health log carries a one-byte bitmask of warnings. Read it:
smartctl -a /dev/nvme0 | grep -i "Critical Warning" nvme smart-log /dev/nvme0 | grep -i critical_warningA value of
0x00is the all-clear. Anything else is the drive flagging something specific — and the bit that wear trips,0x04, means "reliability degraded," which the drive sets the momentPercentage Usedcrosses 100%. Scary label, mundane cause. -
A wear indicator climbing toward (or past) 100%. The honest, quantitative symptom — the line you actually want:
smartctl -a /dev/nvme0 | grep -iE "Percentage Used|Available Spare"Percentage Used: 92%is a drive in its final stretch;Percentage Used: 100%is at its rated end; and — as you're about to see —Percentage Used: 245%is a drive 2.5× past its rated write life and still going. On older SATA SSDs the same idea hides under the attribute table as177 Wear_Leveling_Count,202 Percent_Lifetime_Used, or233 Media_Wearout_Indicator— same countdown, different name. -
A
smartdemail, if you've set it up. Thesmartmontoolsdaemon can email root the instant the health bit flips. Most servers never turn it on — which is exactly why the surpriseFAILED!is the first many admins ever hear of endurance at all. We'll fix that at the end.
None of these is an emergency by itself. Every one of them is a cue to stop reacting to the headline and go read the numbers underneath it — because the numbers are where wear and defect part ways, and they part ways cleanly.
How I Read It
Every modern drive keeps a running diary of its own condition called SMART — Self-Monitoring, Analysis and Reporting Technology — and the tool that reads it is smartctl, from the smartmontools package. On an NVMe drive the report is a clean, tidy health log rather than the sprawling attribute table of an older SATA disk, and -a ("all") dumps the lot:
smartctl -a /dev/nvme0
Here's a real one — verbatim from app-01, a production box whose two Samsung 1 TB NVMe drives have been writing hard for years. I've trimmed a couple of the more verbose blocks, but every number below is exactly as the drive reports it:
=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZVLB1T0HALR-00000
Total NVM Capacity: 1,024,209,543,168 [1.02 TB]
Namespace 1 Utilization: 812,415,725,568 [812 GB]
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- available spare has fallen below threshold
- percentage used exceeds capacity
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x04
Temperature: 43 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 245%
Data Units Read: 1,103,994,521 [565 TB]
Data Units Written: 1,442,713,734 [738 TB]
Host Read Commands: 18,442,031,995
Host Write Commands: 24,901,773,402
Controller Busy Time: 92,418
Power Cycles: 37
Power On Hours: 45,818
Unsafe Shutdowns: 1
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Take a breath, because this drive is the perfect teacher. The headline screams FAILED! and helpfully adds two red-letter reasons. The first — "available spare has fallen below threshold" — is, on this drive, a flat-out lie (read the spare line: 100% against a 10% floor; the spare reserve is completely untouched). The second — "percentage used exceeds capacity" — is true but harmless. This is a drive 2.5× past its rated endurance, still holding a full spare pool, with zero media errors and zero integrity errors. In plain language: worn well past warranty, not remotely broken. Let's walk every line and prove it, because once you can read this log, every wear-vs-defect call you'll ever make collapses into about four numbers.
The Lines That Decide Wear vs Defect
Read the NVMe health log top to bottom, but weight it like this — the first three lines settle the verdict, the rest is context:
- Critical Warning — a one-byte bitmask, the drive's own summary of what's wrong.
0x00is the all-clear; any other value is one or more bits set. Each bit means a specific thing (full breakdown below). Onapp-01it reads0x04, which is bit 2 alone — reliability degraded — and the drive sets that bit automatically the momentPercentage Usedpasses 100%. So0x04here isn't reporting a fault; it's reporting that the warranty clock ran out. That's the single most important misread on this page, and now you'll never make it. - Available Spare / Available Spare Threshold — the reserve of spare flash blocks the drive keeps to swap in as cells fail, expressed as a percentage of that reserve still intact, against the manufacturer's failure floor. This is the line that separates "worn but fine" from "actually dying."
100%against a10%threshold, as onapp-01, means the drive has not yet had to spend a single spare block — its physical media is pristine. Spare falling toward the threshold is the drive consuming its safety margin; spare at or below the threshold is a genuine defect and a real emergency (that's NVMe spare exhausted, a different page with a much shorter fuse). - Percentage Used — the endurance odometer, and the star of this page. It's the fraction of the drive's rated write endurance consumed, and here's the line that breaks brains: it can exceed 100%. It's not "percent of the drive full" and it's not a health bar that pins at 100 — it's a literal odometer that keeps counting.
100%means "you've written everything the warranty promised";245%means "you've written two and a half times that and the drive is still serving." The NVMe spec even says the value saturates at255, which is exactly whyapp-01's sibling drive reads255%rather than something higher — it's pegged the gauge. A high number here is age, full stop. It is never, by itself, a defect. - Media and Data Integrity Errors — the defect line that matters most after spare. This counts times the drive returned (or detected) data that failed its own integrity check — the NVMe equivalent of the dreaded reported-uncorrectable on a hard drive. On
app-01it's0, which is the whole ballgame: a worn drive with zero integrity errors hasn't actually corrupted anything. A non-zero value here is a real defect and pushes you straight to disk failing. - Error Information Log Entries — a running count of entries in the drive's internal error log. A few over years of uptime is normal housekeeping; a number that climbs between two readings is worth a look.
0here, on a drive with 45,818 hours on it, is genuinely impressive.
Those five lines hand you the verdict. The rest of the log is the drive's life story, and it's worth knowing too:
- Data Units Written — how much the drive has been written to over its life, the cause of the wear above. NVMe reports it in 512,000-byte units, which is why
smartctldoes you the kindness of the bracketed[738 TB]. That is a colossal number —app-01has absorbed 738 terabytes of writes (its sibling, 736 TB), which on a 1 TB drive is roughly 738 full-drive overwrites. Compare it to the drive's rated TBW (its endurance spec) andPercentage Usedstops being magic and becomes simple division: writes-so-far ÷ rated-TBW. - Data Units Read — the same, for reads. Reads are nearly free in endurance terms (more on why at the end), so this number is mostly trivia — but a read total far exceeding writes is the fingerprint of a read-heavy workload, and the reverse (writes ≫ reads, as on
app-01) is the fingerprint of a write-heavy one. Which is the whole reason these drives are worn: something has been writing to them, hard, for years. - Host Read / Write Commands — the raw count of I/O operations the host issued. Divide Data Units by these and you get average I/O size — occasionally useful for spotting a workload doing zillions of tiny writes (murder on flash) vs. fewer big ones.
- Power On Hours — total time powered up.
45,818hours is 5.2 years of continuous service. This drive has earned its retirement honestly. - Power Cycles / Unsafe Shutdowns — how many times it's been powered up, and how many of those were not a clean shutdown (yanked power, crash, kernel panic).
app-01shows37cycles and a single unsafe one in 5+ years — a beautifully-behaved server. A high unsafe-shutdown count points at power or stability trouble upstream of the disk (see power issue). - Temperature + Warning/Critical Comp. Temperature Time — current temperature, plus how long the drive has spent in its warning and critical thermal zones.
43 Celsiusis comfortable, and0/0on the temperature-time counters means it has never been too hot. Heat is the great accelerant of wear (we'll come back to it), so cool and zero is exactly what you want to see. - Controller Busy Time — minutes the controller spent actively processing I/O. Pure trivia for most people; a capacity-planning signal for a few.
The Critical-Warning Bitmask, Decoded
Because that one byte settles so much, here's the whole map. Critical Warning is eight bits; read it like a row of warning lights:
- bit 0 (
0x01) — available spare has fallen below threshold. A real defect; the media is failing. - bit 1 (
0x02) — temperature has crossed a critical threshold. The drive is too hot right now. - bit 2 (
0x04) — reliability degraded due to media errors or the rated endurance being exceeded. This is the wear bit — the one onapp-01— and it's the only one that fires on a healthy-but-old drive. - bit 3 (
0x08) — the media has been placed in read-only mode. The end of the line; the drive has given up on writes. - bit 4 (
0x10) — the volatile-memory backup device (the capacitor that flushes the drive's cache on power loss) has failed. - bit 5 (
0x20) — the persistent-memory region has become unreliable (enterprise feature; rare).
So 0x04 is bit 2 standing alone — the gentlest possible non-zero value, and the one this whole page is about. If you ever see 0x01, 0x08, or a value with several bits lit (0x05, 0x09…), that's no longer this page: that's a drive with a genuine defect, and you want NVMe spare exhausted or disk failing.
Note
smartctl'sFAILED!verdict and its red "available spare has fallen below threshold" line are both computed from the critical-warning bitmask, not by re-reading the actual spare percentage. So when bit 2 (endurance) is the only bit set, smartctl still parrots the spare-low text as a generic explanation — which is whyapp-01showsFAILED!and "spare below threshold" right above a spare reading of a pristine100%. The numbers, not the headline, are the truth. Read the spare line yourself; if it's comfortably above its threshold, the spare warning is boilerplate.
Why You Ignore the Headline
Notice we reached a confident verdict — worn, not broken — by reading the numbers and walking straight past the big FAILED!. That's deliberate, and it's the most valuable habit on this page: the overall-health line is the least reliable thing in the report. It's a single bit, OR-ed together from the critical-warning byte, and the manufacturer sets the endurance threshold to suit their warranty math, not your uptime. A drive that has written exactly its rated TBW and not one byte more — flawless media, full spare pool, zero errors — trips FAILED! the instant the odometer clicks past 100%, purely as a "your warranty is up" signal. Meanwhile a hard drive quietly shedding sectors can still read PASSED because its raw counts haven't crossed the maker's thresholds yet (that mirror-image trap is covered in disk failing). The verdict screams at the healthy drive and whispers at the dying one. So the rule that should outlive everything else here: read the spare, the integrity errors, and the percentage — never the verdict.
Reading It by Example
Train the pattern-match. Readout on the left, what I'd actually conclude on the right:
overall-health: PASSED,Percentage Used: 12%,Available Spare: 100%, errors0→ Healthy and young. Nothing to do. By far the most common case — most of your SSDs look exactly like this for years.overall-health: PASSED,Percentage Used: 88%,Available Spare: 100%, errors0→ Aging gracefully, nearing the end of its rated life but not there yet. No action; the spare pool is full and there are no defects. Put a replacement on the someday list.overall-health: FAILED!,Critical Warning: 0x04,Percentage Used: 100%+,Available Spare: 100%,Media and Data Integrity Errors: 0→ Worn out, not broken. Theapp-01case exactly. Plan a calm replacement on your own schedule; it's living on borrowed warranty, not borrowed time. (Don't bother asking your hoster to swap it — it still works, so it won't meet their defect bar.)Percentage Used: 245%, sibling at255%, both spare100%→ A worn but rock-solid mirrored pair, both well past warranty and both fine. The matched wear is the tell that they were installed together and carry the same workload — which also means they'll reach genuine end-of-life near each other, so when you swap, plan to swap both, and ideally stagger the replacements so a RAID rebuild never runs across two equally-tired drives at once.Critical Warning: 0x01,Available Spare: 8%,Available Spare Threshold: 10%→ Not wear — defect. The spare pool is exhausted and the drive is about to drop to read-only. This is an emergency, not a someday. Go to NVMe spare exhausted: back up now, replace now.Media and Data Integrity Errors: 14and rising → Also not wear — defect. The drive is returning bad data. Treat it as a disk failing: back up immediately, replace.Critical Warning: 0x08(read-only) → The drive has stopped accepting writes entirely. Whether from wear or defect, you're out of runway: it's read-only, so your last job is to read everything off it and replace it today.
The fork is simple once you've internalised it: spare full + integrity errors zero = wear (relax); spare low OR integrity errors non-zero = defect (act). Percentage Used, however eye-watering, only ever sorts you into the wear column.
How to Fix It
The fix for a genuinely worn-out SSD is the most relaxed fix in this entire knowledge base: replace it, eventually, on your own schedule. There's no avalanche to outrun (that's the hard-drive failure mode), no spare pool draining toward a cliff, no corruption in flight. But "relaxed" is not "never," and the first step is still non-negotiable.
Danger
"Relaxed" assumes you've confirmed it's only wear. If
Available Spareis at or below its threshold, orMedia and Data Integrity Errorsis non-zero, or any critical-warning bit other than0x04is set, this is no longer a worn drive — it's a failing one, and it can go read-only or start losing data at any moment. In that case stop, get your irreplaceable data off first with something gentle likersync -ato another machine, and only then proceed. And whatever the verdict: a drive past 100% endurance is, by definition, the day to confirm your backup actually runs. No fix is more urgent than that.
Then, by situation:
- Worn, spare full, no errors — plan an unhurried swap. Order a replacement, fit it when convenient, retire the old one. If you want a deadline rather than a vague "someday," watch the spare line, not the percentage: the percentage will keep climbing harmlessly, but the day
Available Sparestarts ticking down from 100% is the day the drive begins spending its real safety margin — bring the swap forward then. - Worn drive in a RAID mirror or array — swap one at a time. This is the dream scenario. Fail and remove the worn member (
mdadm --fail /dev/md0 /dev/nvme0n1, then--remove), fit a fresh drive, let the array rebuild onto it, and repeat for its equally-tired sibling. You have all the time in the world to do it cleanly — that's the entire reason the array exists. - Worn drive on a rented or hosted box — usually on you, not the hoster. Wear is not a defect, so a worn drive generally won't meet a hoster's replacement bar. Paste your
smartctl -aoutput into the ticket anyway: if the real defect lines (spare at/below threshold, integrity errors) are clean, expect them to (politely) decline, and plan to swap it as part of normal lifecycle. If those lines are dirty, it's a defect and they'll usually replace it. Knowing which conversation you're having is exactly the skill this page taught you. - Don't try to "fix" the wear. There's no command that resets the endurance counter, no firmware trick that un-writes 738 terabytes. The counter belongs to the drive, and it's telling the truth. A
secure eraseornvme formatfrees up flash and can briefly improve performance on a tired drive, but it does not roll back wear — the cells are as old as they were. The only real fix for endurance is new silicon.
Pro Tip
If a worn drive still has years of light duty left in it, retire it downward, not to the bin. A drive at 245% of its write endurance with a full spare pool and zero errors makes a perfectly good home for read-mostly data — an OS boot volume, a read replica, a scratch cache you can lose. Reads cost it almost nothing (the cells only wear on writes), so the very workload that wore it out is the only one to keep it away from. Move the heavy writes to a fresh drive; let the veteran serve reads until it genuinely won't.
How to Avoid It
You can't avoid wear any more than you can avoid mileage on a car — every write spends a little flash, forever. But you have far more control over the rate than most people realise, and the levers are worth knowing because the difference between them is years. In rough order of impact:
- Backup — first, always, and twice. A worn drive gives you the gentlest warning of any storage failure, but a warning is not a copy of your data. The one habit that turns any dead disk — worn, defective, dropped, stolen — into a chore instead of a catastrophe is a backup you've actually tested a restore from. Do it before you read rule 2. Yes, seriously.
- Buy for the write workload, not the price tag. SSD endurance is sold as TBW (total terabytes the drive is rated to write) or DWPD (whole-drive writes per day across the warranty). A write-heavy server — anything with a busy database — on a cheap, low-TBW consumer drive isn't buying storage, it's buying a countdown. A drive rated for ten times the writes lives, roughly, ten times as long. The
app-01Samsungs reached 245% precisely because they're enterprise-class endurance drives that kept going long past where a budget drive would have hit the spare-exhaustion wall. - Tame the writes you don't need. The biggest, most surprising write amplifier in most stacks is a database configured to be maximally safe — forcing every single transaction straight to flash the instant it happens, no batching. We once watched a drive's wear indicator climb from 0 to 80 in weeks from one over-cautious durability setting; relaxing it sanely sent the counter back to crawling. A database's
fsync/O_DIRECTdurability knob is wired directly to your SSD's lifespan. Excessive logging, chatty metrics, andatimeupdates (mount withnoatime) are the same story in miniature. - Keep it cool. Heat accelerates flash wear and makes cells leak the charge that holds your bits — the warning and critical temperature-time counters in the SMART log exist precisely because heat shortens endurance.
app-01's0/0on those lines and a steady43 °Care a big part of why those drives lasted. Airflow is cheap; flash is not. - Leave some headroom — never fill an SSD to the brim. A flash drive needs free space to shuffle data around as it wears-levels and garbage-collects; starve it and write amplification spikes, meaning every byte you write costs the drive several bytes of internal rewriting, burning endurance faster. Keeping a drive comfortably under full (and using
fstrimso it knows which blocks are actually free) is free endurance. This is the wear-side cousin of disk full: there it costs you space, here it costs you life.
Note
RAID is not a backup, and on a worn-out pair it can bite in a way unique to wear: identical drives, installed together, carrying identical writes, reach end-of-life together — so the rebuild after you replace the first one runs across a sibling that's just as tired, at the most strenuous moment of its life. That's why backup is rule 1, why you stagger replacements, and why "two SSDs from the same batch" is a phrase that should make you keep a spare on the shelf.
And the deepest version of all of this isn't a one-off command — it's watching the trend. A single smartctl run tells you today's percentage; it's the slope that tells you whether you have two years or two months, and whether the spare pool has just begun, quietly, to fall. You only see a slope if something reads the diary every day and remembers yesterday's number.
How Flash Actually Wears Out
Now the part you don't need in an emergency but that puts the whole picture together — and it's one of the better stories in computing. Once you can picture what's physically happening inside the chip, every number above stops being trivia to memorise and becomes something you can simply reason out.
A Bucket of Electrons Behind a Wall a Few Atoms Thick
An SSD or NVMe drive has no moving parts at all. It stores each bit as a tiny trapped electric charge in a microscopic cell — picture a bucket that holds a few electrons, with a switch that reads "on" if the bucket's full and "off" if it's empty. Billions of buckets; your data is just which ones are full.
Reading a bucket is gentle — you sense its charge without disturbing it, and you can do that essentially forever. This is why Data Units Read barely matters and why a worn drive makes a fine read-mostly volume: reads don't wear flash. Writing, though, is violent. To fill a bucket you must force electrons across a thin insulating wall (the tunnel oxide) and trap them on the far side; to erase it you drag them back out. Every single crossing batters that wall a little. It is not a metaphor for wear — it is literally wearing through an insulator a few atoms thick, one charge cycle at a time.
So a flash cell has a hard, countable lifespan: a few thousand erase cycles for typical consumer flash, after which the wall gets too leaky to reliably tell "full" from "empty." The cell doesn't explode or short out. It just gets unreliable — slower to read, more prone to error — and the drive, which has been expecting this since the day it left the fab, retires that cell and moves on. Wear-out is the sum of billions of these tiny, graceful retirements. That's the whole difference from a disk failing: a defect is damage the drive didn't see coming; wear is the drive spending exactly the budget it was sold with.
How the Drive Hides It — Until It Can't
Here's the first piece of real magic, the thing that makes SSDs work at all: individual cells fail constantly, from day one, and the entire drive is built around hiding that from you. Every SSD ships with more flash inside than it advertises — a private reserve of spare cells (over-provisioning) that is exactly the reserve your Available Spare line reports. It runs relentless internal bookkeeping that spreads your writes evenly so no cell wears out years before its neighbours (wear leveling — the SATA attribute literally named for it), wraps every block in error-correcting codes that rebuild data from a cell gone slightly wrong (ECC), and quietly relocates data off any block getting flaky. An SSD is, under the hood, a tiny civilization constantly moving residents out of crumbling buildings into fresh ones, demolishing the old blocks, and never once mentioning it to the city above. "No moving parts" is true — but it is furiously busy in there.
Now the two SMART lines you've been reading fall into place perfectly. Percentage Used is the drive's estimate of how much of its rated endurance budget the average cell has spent — a wear-leveling figure, which is precisely why it can sail past 100%: the rating is a conservative warranty promise, and good flash routinely outlives it (hello, 245%). Available Spare is how much of that private reserve of replacement cells remains. And here's the crucial relationship, the one that makes app-01 so reassuring: Percentage Used rising is normal aging; Available Spare falling is the drive actually spending its safety net. You can be at 245% used with 100% spare — old but with every reserve cell still in the bank — which is a drive that has aged without having failed. The day spare starts dropping is the day aging tips toward failing.
Why a Percentage Goes to 245
It still feels wrong that a percentage can exceed 100, so let's make peace with it once and for all. Percentage Used isn't "percent of something full." It's an odometer expressed in percent: writes-so-far divided by rated-writes, times 100. A car rated for 150,000 miles doesn't stop at 100% of its mileage — it rolls on to 200%, 300%, as long as the engine turns. Flash is the same: the manufacturer picks a rated TBW conservative enough to honour under warranty, the drive divides your actual writes by it, and good cells keep working well past the promise. app-01 has written 738 TB against a rating it crossed long ago, and the gauge simply kept counting until — per the NVMe spec — it saturates at 255 and pins there (which is why the sibling reads exactly 255%, not 300%). The number isn't broken. It's just honest about having outlived its warranty, and a little proud of it.
The Density Tradeoff — Why Some Drives Wear Faster
This is also where price tags come from, and it directly predicts lifespan. The buckets can hold more than one bit if you measure the charge level precisely enough:
- SLC — one bit per cell. Bucket's full or empty: dead simple, fast, survives tens of thousands of writes. Expensive, so it lives mostly in enterprise gear and write caches.
- MLC / TLC / QLC — two, three, or four bits per cell. To fit four bits the drive must distinguish sixteen different charge levels in one tiny leaky bucket — far denser and cheaper, but slower and far less durable, because as the wall wears those sixteen levels blur into each other fast. Consumer drives are mostly TLC and QLC, rated for hundreds to a low thousand writes per cell.
So when you buy a cheap, high-capacity SSD, you're buying QLC — denser bits, lower endurance, a Percentage Used that climbs faster under the same workload. A pricey enterprise drive buys back that endurance with lower density, over-provisioning, and sometimes SLC cells. The line item is the lifespan. (There's a lovely sleight of hand that keeps cheap drives feeling fast: many QLC drives run a small slice of their flash in fast, durable SLC mode as a write cache, absorbing bursts at SLC speed and trickling them back to dense QLC later — which is why a budget SSD feels quick right up until you copy something huge and watch it fall off a cliff mid-transfer. You've overrun the cache and hit raw QLC underneath.)
Wear vs. Defect — The Whole Page in One Picture
Hold the machinery in your head and the SMART log reads itself. Wear is the average cell spending its rated charge cycles — measured by Percentage Used, harmless past 100%, covered by the over-provisioned spare pool the whole way. Defect is the spare pool running dry (Available Spare at threshold), or cells failing faster than ECC can patch them (Media and Data Integrity Errors), or the drive throwing in the towel and going read-only (critical-warning bit 3). One is the engine passing 150k miles; the other is a rod through the block. A flash drive almost always wears out gracefully and tells you exactly how much budget is left every step of the way — which is the genuinely cheerful thing about it, and the opposite of a hard drive, which fails like ice cracking: quiet, then all at once.
The drive keeps an honest diary through all of it. But a diary only ever buys you warning — and warning is worth nothing if no one reads it daily, and worth less than nothing without the backup you made before you needed it. Sysadmins have a blunter way of saying it, half joke and half creed, and on a drive past 100% endurance they're dead right: no backup, no pity.
See Also
- disk failing — the other side of the coin: a drive with a real, physical defect
- NVMe spare exhausted — when the spare pool actually drains — the emergency wear is not
- disk full — the space emergency; on an SSD, also the thing that accelerates wear
smartctl— the tool that reads the drive's diarysmartmontools— the package, plus thesmartddaemon that watches it for you- SMART — what the self-monitoring data is and how to read it
- SSD — how flash actually stores a bit, and why writing wears it out
- NVMe — the protocol behind the modern health log on this page
- RAID — mirroring worn drives so a swap is a non-event, and why you stagger the replacements
- backup — the one habit that makes every dead disk a non-event
rsync— getting your data off gently before you swap a drive
Is that SSD on your screen actually dying — or just an enterprise drive chirping because it outlived its warranty?
CleverUptime reads the SMART health log on every disk you run it on, and tells wear from a defect the way this page does — naming the drive, reporting its real spare reserve and integrity-error count in plain language, and saying definitively whether a scary FAILED is a worn-out endurance counter you can plan around or a genuine fault you need to act on tonight.
Want to see your own server's health right now? One command, no signup, no install.