Disk Cable Errors (CRC): Symptoms, Diagnosis & Fixes

The disk is fine. The wire between it and the controller isn't — and the difference is a swapped cable, not a swapped drive.

What It Is

A CRC error is the rarest of gifts in storage troubleshooting: a scary-looking alarm that is almost always good news about your data. The bytes on the platters are intact. The flash cells are healthy. What's broken is the link — the thin SATA cable, or the backplane connector, that carries those bytes between the drive and the disk controller. Somewhere on that short journey a bit got flipped by electrical noise, a loose pin, or a kinked wire, and the drive's own error-checking caught it red-handed.

CRC stands for Cyclic Redundancy Check — a small checksum the drive computes over every block of data it ships across the cable. The controller recomputes the same checksum on arrival; if the two disagree, the block was corrupted in transit, and the controller throws it away and asks for it again. The data is never silently accepted wrong — it's caught, retried, and delivered correctly. The drive keeps a tally of how many times this happened in a SMART counter called 199 UDMA_CRC_Error_Count, and a rising tally is the single most misread number in all of disk diagnostics.

So let's plant the flag that the whole rest of this page defends: a climbing UDMA_CRC_Error_Count means a bad or loose cable, not a dying disk. The platters are fine. The fix is to reseat or replace the cable — and the classic, expensive mistake is to RMA a perfectly healthy drive, reconnect the replacement through the same bad cable, and watch the errors march right back. By the end of this page you'll read the counter with confidence, confirm the diagnosis from the kernel log in two commands, fix the actual problem in five minutes with a screwdriver, and understand exactly why a corrupt wire looks so alarming in the logs while costing you precisely zero bytes. We'll do the help first — spot it, read it, fix it, prevent it — and save the "how does a checksum catch a single flipped bit" story for the end, because it's one of the cleanest ideas in computing.

How You Notice

Cable errors announce themselves in three places, and the trick is that all three describe a connection problem, never a media one. Here's each, with the command to see it on your own box right now:

The SMART CRC counter is non-zero — and, the part that matters, climbing. This is the headline symptom and the one CleverUptime watches for you. Ask the drive directly:
```
smartctl -a /dev/sda | grep -i crc
```
A line like 199 UDMA_CRC_Error_Count ... 701 is the whole story in one row. A single old non-zero value that never moves is a healed scar — a one-time blip from a cable seated years ago. A value that grows between two readings is a live, ongoing fault: the link is corrupting data right now and the controller is busy retrying. Trajectory is everything here, exactly as it is for disk failing — one reading is a snapshot, two readings are a diagnosis.
The kernel narrates ATA exceptions in dmesg. Every retry the controller performs leaves a paper trail. Look for it:
```
dmesg -T | grep -iE "ata[0-9]|SError|hard resetting|BadCRC|exception"
journalctl -k -p err | grep -i ata
```
Lines mentioning exception Emask, SError, BadCRC, ICRC, and hard resetting link are the link layer stumbling and recovering. We'll read a real block of these line by line in a moment — they look terrifying and mean "the cable hiccuped, I retried, all good."
Occasional stalls, but no read-only filesystem and no I/O errors. A retried command costs a few milliseconds, so a flaky cable can make the box feel intermittently sluggish — a brief blip in top's wa (I/O wait) column when the controller resets the link. But — and this is the tell that separates a cable from a failing disk — the filesystem does not flip read-only, and you do not see blk_update_request: I/O error against specific sectors. CRC errors are recovered; media errors are not. If the mount went read-only or dmesg names dead sectors, you're on the wrong page — go read disk failing. If it's all BadCRC and hard resetting link and the data keeps flowing, you're in the right place.

The unifying signature: lots of noise, zero data loss. That contradiction is the fingerprint of a cable problem, and once you've seen it once you'll never confuse it with a dying drive again.

How I Read It

Every modern drive keeps a running diary of its own condition called SMART — Self-Monitoring, Analysis and Reporting Technology — and the tool that reads it is smartctl, from the smartmontools package. The command I reach for asks the drive for everything it knows:

smartctl -a /dev/sda

(-a is "all".) Here's a real, healthy 10 TB drive from one of our own storage boxes — a Seagate ST10000NM0156, one of a stack of identical disks — trimmed to the rows that matter for this page:

=== START OF INFORMATION SECTION ===
Device Model:     ST10000NM0156-2AA111
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Rotation Rate:    7200 rpm

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  -           0
  9 Power_On_Hours          0x0032   075   075   000    Old_age   -           22192
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   -           0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   -           0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   -           0

The line we care about is the last one: 199 UDMA_CRC_Error_Count ... 0. Every drive in that box reads exactly the same — a calm, flat zero. This is what a healthy link looks like: not one corrupted block in 22192 hours of service (that's two and a half years of uptime). Hold that image of 0, because the contrast is the whole lesson.

Now the same attribute on a drive with a genuinely bad cable. This is a real readout from a disk whose link was corrupting data — note that everything else is pristine:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  -           0
  9 Power_On_Hours          0x0012   088   088   000    Old_age   -           14533
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   -           0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   -           0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   -           0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   -           701

Read it the way a storage engineer reads it, top to bottom, and the verdict assembles itself:

overall-health: PASSED — but we don't trust the headline, we read the table (the hard-won habit from disk failing).
5 Reallocated_Sector_Ct = 0 — no sectors have failed and been remapped. The surface is clean.
197 Current_Pending_Sector = 0 — no sectors are reading badly right now. Nothing is in the waiting room.
198 Offline_Uncorrectable = 0 — nothing the drive tried and failed to recover. No data is gone.
187 Reported_Uncorrect = 0 — the drive has never handed an unfixable error up to the operating system. This is the big one fleet operators lean on as a death predictor, and it's a flat zero.
199 UDMA_CRC_Error_Count = 701 — and there it is. Seven hundred and one blocks corrupted on the wire and re-sent.

Now line the two pictures up. Every attribute that signals a real, physical defect reads 0. The only number that's moved is 199. That's not a coincidence — it's the definition of a cable problem. A failing disk shows damage in 5, 197, 198, 187; a failing cable shows damage in 199 and 199 alone. When the defect attributes are clean and 199 is the lone outlier, you are not looking at a sick drive. You are looking at a sick wire.

Why "Old_age" and Why the Counter Won't Reset

Two quirks of 199 trip people up, so let's clear them before the gallery.

First, the TYPE column says Old_age, not Pre-fail. That has nothing to do with the drive being old — SMART's two type labels just mean "this attribute, if it hits threshold, indicates the drive has failed" (Pre-fail) versus "this attribute degrades naturally with use" (Old_age). CRC errors are filed under Old_age because the drive's makers don't consider a few corrupted-and-retried blocks to be a drive fault — which is exactly right, because it isn't one. Don't read Old_age as a verdict on the disk's age; read it as the drive politely declining to blame itself.

Second, you cannot zero this counter. The raw value lives in the drive's own non-volatile memory and the drive owns it, not you — there's no smartctl flag, no vendor tool, nothing short of trickery that resets it. So once you've fixed the cable, "fixed" does not mean "back to 0." It means "stopped climbing." Note today's number, fix the wire, and check again next week: if it's still 701, the cable was the problem and it's solved. If it's 730, the fix didn't take.

Note

The normalized VALUE/WORST/THRESH columns count down, not up — a healthy attribute starts at 100 (or 200) and falls toward THRESH as it degrades. So 199's VALUE of 200 looks reassuring even with a raw count of 701, because manufacturers set the CRC threshold absurdly high; the drive won't "fail" attribute 199 until thousands upon thousands of errors. That's deliberate, and it's why you must read the raw value and watch its trend — the headline normalized number will reassure you right up until the cable falls out.

How the Kernel Tells It

The SMART counter tells you how many link errors happened; the kernel log tells you as they happen, in real time, with enough detail to be certain. This is the confirmation step, and it's where a CRC problem stops being a suspicion and becomes a fact. Run:

dmesg -T | grep -iE "ata[0-9]|SError|hard resetting|EH complete"

Here is a real, verbatim sequence from a machine with a faulty SATA cable — one complete error-and-recovery cycle, exactly as the kernel logged it:

ata1: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
ata1: irq_stat 0x00000040, connection status changed
ata1: SError: { CommWake DevExch }
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete

Walk it line by line — this is the entire grammar of a link error, and once you can read it you can diagnose any of them:

ata1: exception Emask 0x10 ... frozen — the controller hit a problem on link ata1 and froze it to handle the error safely. Emask 0x10 is the "ATA bus error" mask — the controller's way of saying the trouble was on the wire, not in a command the drive refused. That 0x10 is your first strong hint: bus, not media.
SErr 0x4040000 / SError: { CommWake DevExch } — the SATA error register, decoded. CommWake and DevExch mean the physical link's communication faltered and the device had to be re-negotiated — the electrical equivalent of a plug wobbling in its socket. On a worse cable you'll see { UnrecovData 10B8B BadCRC } here instead: BadCRC is the checksum mismatch itself, and 10B8B is a low-level encoding error on the SATA wire — both are pure link corruption.
hard resetting link — the recovery move. The controller drops the link and re-establishes it from scratch, the way you'd unplug and replug a flaky connection. This single line, repeating in your log, is the loudest signal of a cable problem there is.
SATA link up 3.0 Gbps (SStatus 123 SControl 300) — it came back, this time negotiating 3.0 Gbps. Watch this number across resets: a healthy SATA III link runs at 6.0 Gbps, and if you see the kernel repeatedly step it down to 3.0 or even 1.5 Gbps, that's the controller's last-ditch survival tactic — a noisy cable can't carry the fast signal cleanly, so the link drops to a slower, more robust speed to keep working at all. A SATA III drive stuck at 1.5 Gbps is a cable shouting for help.
ata1.00: configured for UDMA/133 — the drive (.00 is the first device on this link) re-negotiated its transfer mode. All good.
EH complete — Error Handling complete. The kernel recovered, the data went through on the retry, and your application never knew. This is the line that proves no data was lost.

That last point is the whole emotional arc of a CRC error: every one of these scary blocks ends in EH complete. The kernel caught the corruption, reset the link, retried, and delivered the bytes correctly. Compare that to a real failing disk, whose log ends not in EH complete but in blk_update_request: I/O error, dev sda, sector 1234567 — an error the kernel couldn't recover, handed up to your filesystem, often flipping the mount read-only. The presence of EH complete and the absence of I/O error is the difference between "tighten a cable" and "replace a drive."

Pro Tip

The kernel keeps a hardware tally that survives across reboots and is even more precise than dmesg: the SATA Phy Event Counters. Run smartctl -l sataphy /dev/sda and look for R_ERR response for data FIS and CRC errors within host-to-device FIS — these count link-layer faults at the silicon level, by direction. Rising host-to-device counts point at the cable on the controller side; the breakdown can tell you which end to reseat first.

Reading It by Example

Train the pattern-match. Readout on the left, the verdict I'd actually reach on the right:

199 UDMA_CRC_Error_Count: 0, all defect attrs 0 → Perfect link, healthy disk. The happy and overwhelmingly common case — every drive in a well-cabled box looks like this.
199: 12, flat across weeks, defects 0 → A healed scar. A handful of errors from one past event — a reboot, a chassis bump, a cable seated long ago — that stopped on its own. Note it and move on; this is not an emergency and not worth a maintenance window.
199: 701 and rising, every defect attr 0 → The textbook bad cable. The drive is fine; the wire is corrupting data and the controller is retrying. Reseat or replace the cable. Do not RMA the drive.
199 climbing and 5/197/198 also climbing → Two problems, or a backplane fault hurting both link and drive. Rare. Treat the media defects as primary (see disk failing), and fix the cable too — but don't let the CRC count distract you from the sectors.
dmesg full of hard resetting link and the link stepping 6.0 → 3.0 → 1.5 Gbps → A cable so noisy the controller is throttling to survive. Replace it; reseating a cable this far gone rarely holds.
199 non-zero on several drives sharing one backplane or one power rail → Suspect the shared part, not three coincidentally-bad cables. A flaky backplane, a marginal power supply sagging under load, or a SAS expander is corrupting the whole group. Fix the common cause.
199: 0 but dmesg shows I/O error and dead sectors → Not a cable problem at all. This is a failing disk; you're on the wrong page.

How to Fix It

The fix for a cable error is, gloriously, to fix the cable — one of the few storage problems with a screwdriver-and-five-minutes solution and no data at stake. But because reaching the cable means going inside the box, the order of operations matters.

Danger

Power the machine fully down before you touch any internal cable — shutdown -h now, then pull the plug. Reseating a live SATA or, especially, a power connector can short a drive, corrupt the block being written at that instant, or damage the controller. SATA data cables are nominally hot-pluggable; SATA power is not, and you don't always know which one is loose. The data on the platters is the one thing this whole problem hasn't put at risk — so don't be the one who risks it by working live. Shut down, then open the case.

Then, by situation:

Reseat both ends first — it's free and it's often enough. Power down, open the case, and firmly unplug and replug both the SATA data cable and the SATA power cable, at both ends (drive side and motherboard/controller side, and PSU side for power). Connectors work loose from vibration, thermal cycling, and the simple act of someone sliding a drive in its bay. Listen for the click. Boot, note that 199 has stopped climbing, and you may well be done — a huge fraction of CRC problems are nothing more than a connector that wasn't fully home.
Replace the data cable if reseating doesn't hold. SATA cables are cheap, and the cheap ones are exactly the problem — flimsy connectors, marginal shielding, latching clips that don't latch. Swap in a quality cable with metal retention clips, route it so it isn't sharply kinked or stretched taut, and keep it clear of fans and high-current power leads to cut electrical crosstalk. A good cable is a couple of euros; a wrongly-RMA'd drive plus the downtime is not.
Don't bundle cables tight. Crosstalk between tightly-zip-tied data cables is a real and underappreciated cause of CRC errors. Give signal cables a little room to breathe; this is the rare case where neater is worse.
Suspect the backplane or power when several drives share the symptom. If CRC counts rise on multiple disks that plug into the same backplane, the same SAS expander, or the same straining power rail, stop replacing individual cables and look at the shared component. A marginal power supply that sags under load is a classic culprit — the link errors appear only when the box is busy.
On a rented or hosted box, you don't hold the screwdriver — open a ticket. Paste the smartctl -a output (showing 199 climbing with all defect attributes at 0) and the dmesg hard resetting link lines, and ask the provider to check the cabling and reseat the drive. Be explicit that the SMART defect attributes are clean and you suspect a connection, not the disk — that framing gets a competent data-center hand to reseat or recable rather than reflexively swap a healthy drive and leave the bad cable in place.

Note

After any fix, the SMART counter will not drop back to zero — the drive owns that number and there's no resetting it. Success looks like a counter that has stopped moving. Record the value the day you fix it, and judge the repair by the next reading, not by hoping for a 0 that will never come.

How to Avoid Them

Unlike a failing disk — which physics guarantees and you cannot prevent — cable errors are genuinely avoidable, because they're a workmanship problem, not a wear problem. A few habits keep 199 flat for the life of the box:

Buy decent cables and seat them properly. The single biggest cause of CRC errors is a cheap cable or a connector that was never fully clicked in. Spend the extra euro on cables with metal locking clips, and push every connector home until it latches. This one habit prevents most of what's on this page.
Route for airflow and against strain, not for tidiness. Don't kink cables around tight corners, don't stretch them taut across the case, and don't cinch signal cables into dense bundles where they can crosstalk. A relaxed, well-routed cable outlives a beautifully-bundled one.
Reseat after any physical work. The most common time a good cable goes bad is right after a human was inside the case — adding RAM, swapping a drive, cleaning dust. Bumping a neighbouring connector loose is easy. After any hands-on maintenance, glance at dmesg for fresh ata exceptions before you close up.
Mind the power supply and the backplane. A PSU running near its limit, or an aging hot-swap backplane, corrupts links under load in ways no amount of cable-swapping fixes. In a dense disk shelf, vibration alone can loosen connectors over time — which is one more reason this is a fleet problem worth watching continuously rather than checking once.

And the deepest version of all four: the signal that matters isn't the counter's value, it's its trajectory. A 199 of 701 that's been frozen for a year is a non-event; a 199 that went from 3 to 40 since last Tuesday is a cable working loose right now. You only catch the difference if something reads the diary every day and compares — which is precisely the kind of trend a single manual smartctl run, weeks apart, will miss entirely.

How a Checksum Catches a Flipped Bit

Now the part you don't need in an emergency, but that makes the whole page click — and it happens to be one of the most quietly brilliant ideas in computing. Why does a corrupt cable produce a clean error instead of silently-wrong data? Because of the CRC itself, and the math behind it is quietly brilliant once you see it.

The Problem: Bytes Cross a Hostile Gap

Picture the journey one block of your data makes. It leaves the drive's controller, races down a few centimetres of copper inside a SATA cable at 6 gigabits a second, and arrives at the host controller. That copper sits in a metal box full of spinning motors, switching power supplies, and fans — an electrically filthy environment. A bit is just a voltage; a stray pulse of interference, a marginal connector adding resistance, a cable acting as a tiny antenna for the noise around it — any of these can nudge a 0 into looking like a 1 for the few picoseconds it's on the wire. Over billions of bits a second, the wonder isn't that bits occasionally flip; it's that they almost never do.

So the link can't be trusted to deliver bytes perfectly. It has to be checked. And you can't check by sending the data twice and comparing — that halves your bandwidth, and both copies could be corrupted the same way. You need something cleverer: a small fingerprint, computed from the data, that changes if any bit of the data changes.

The Trick: Division With No Carrying

Here's the elegant bit. The drive treats your block of data — say 4096 bytes — not as a number to add up, but as one enormous binary number, and it divides it by a fixed, carefully-chosen constant called the generator polynomial. It doesn't keep the quotient; it keeps the remainder. That remainder, typically 32 bits, is the CRC — the checksum it appends to the block and sends across the wire. The receiver performs the same division on the data it received. If it computes the same remainder, the data is intact. If even one bit flipped in transit, the remainder comes out different — and the mismatch is the BadCRC you saw in the kernel log.

But it's a strange kind of division. CRC arithmetic is done in a system where addition and subtraction are both just XOR — there is no carrying. Add 1 + 1 and you get 0, no carry, full stop. This sounds like a toy, but it's exactly what makes CRCs cheap enough to compute on a chip at 6 Gbps: no carries means no waiting for a carry to ripple up through 32 bits, so the whole thing reduces to a cascade of shifts and XORs that hardware does in a blink. The mathematics has a proper name — arithmetic over a finite field, the same deep structure that underpins error-correcting codes, secret-sharing, and a good chunk of modern cryptography — but the drive doesn't need to know the theory. It just shifts and XORs.

And here's the payoff that makes it work: because the generator polynomial is chosen with care, a CRC is mathematically guaranteed to catch certain whole classes of error. A good 32-bit CRC catches every single-bit error, every double-bit error, every error that flips an odd number of bits, and — the one that matters most on a noisy cable — every burst of corrupted bits up to 32 long, the exact failure mode a flaky connector produces when it drops a cluster of bits at once. Not "probably catches." Guaranteed catches, by the structure of the math. That's why a corrupt SATA link gives you an honest BadCRC and a retry, instead of quietly handing your filesystem a wrong byte. The checksum was engineered so that the failures a cable actually produces are precisely the failures it cannot miss.

And the choice of which polynomial is not arbitrary — it's a small engineering legend in itself. SATA uses the same 32-bit polynomial as Ethernet, the one usually called CRC-32, picked decades ago because of exactly which error patterns it provably catches. There's a whole sub-field of people who hunt for better polynomials — Philip Koopman at Carnegie Mellon famously catalogued thousands of them, ranking each by the longest message over which it still guarantees to catch every 2-bit, 3-bit, and 4-bit error, and showing that several long-deployed "standard" CRCs were quietly worse than alternatives nobody had standardised. The lesson that falls out of his work is humbling: a checksum is only as good as the constant baked into it, and choosing that constant well is a genuine piece of mathematics, not a coin flip. The drive on your desk is running the verdict of that math at 6 billion bits a second and never getting it wrong.

Why

The name "Cyclic Redundancy Check" is a tiny lesson in itself. Redundancy: the CRC adds extra bits that carry no new information — pure overhead — whose only job is to make corruption detectable. Cyclic: the underlying math has a rotational symmetry (shift the data and the checksum shifts predictably) that's exactly what lets it run as a simple feedback shift register in silicon. The whole field of data corruption detection — from this 32-bit SATA checksum up to the giant erasure codes that protect data across a RAID array — is the same idea scaled up: spend a few redundant bits so that the inevitable flipped bit announces itself instead of poisoning your data in silence.

Why It Looks So Scary and Costs So Little

Put it together and the emotional whiplash of a CRC error makes sense. The log looks like a five-alarm fire — exception, frozen, SError, hard resetting link, page after page of it — because the kernel narrates every single retry in full, and a loose cable can stumble dozens of times a minute. But each of those alarming blocks is the system working exactly as designed: corruption detected, link reset, data re-sent, EH complete. The noise is the sound of the safety net catching the fall, not the sound of the fall. The only real cost is performance — a retried command is a slow command, and a link throttled from 6.0 to 1.5 Gbps is a slow link — plus the small, real risk that a cable bad enough to corrupt data is also a cable about to disconnect entirely, which would take the drive offline and, if it's a lone disk, your data with it. That's the actual urgency: not that the bytes are wrong, but that the connection is unreliable and getting worse.

So the shape of the whole thing, held in one picture: your data is fine, the wire is not, the math caught it, the kernel recovered it, and the fix is a screwdriver. And once you see this one checksum, you start seeing it everywhere — because the same idea is layered up and down the entire stack. Every Ethernet frame that reaches your server carries a CRC-32 in its tail; the switch drops the frame and TCP re-sends it, the network's exact mirror of the SATA retry you just read. One layer up, ZFS distrusts the hardware checksum entirely and keeps its own checksum for every block, stored separately from the data, so that even corruption the drive's firmware blesses as fine gets caught on the next read — the famous "ZFS detected silent corruption" that an ext4 box would have served you wrong without a whisper. And scaled all the way out, the very same finite-field arithmetic that computes your disk's remainder is the ancestor of the Reed–Solomon codes that let a probe four billion miles past Pluto, whispering at the power of a fridge bulb, still get a clean picture back to Earth. One flipped bit on a SATA cable pulls a thread that runs from your desk to the edge of the solar system — but those are pages of their own. For now you have the whole shape of it: a flipped bit on a hostile wire, caught by a remainder, retried in a blink. Reseat the cable, watch the counter stop, and let the drive get back to the quiet, honest work of keeping your bytes safe — which, this whole time, it never actually stopped doing.