Memory Errors (ECC): Symptoms, Diagnosis & Fixes

A single cosmic ray flips one bit; good RAM catches it, tells you, and moves on. Listen when it does.

What It Is

A memory error is the moment a bit in your RAM is read back wrong — a 1 where a 0 was written, or the other way around. It sounds like science fiction, and the cause sometimes literally is: a stray cosmic ray, a flake of radioactive decay in the chip packaging, a cell that's leaking charge a little faster than it should because it's getting old. Most of the time you'd never know. The bit flips, your program reads garbage, and maybe nothing important was sitting there. Or maybe it was a pointer, and now a process segfaults, or a filesystem metadata block, and now your data is quietly corrupt. On a machine with ordinary RAM, that flip happens silently — no log, no warning, no fingerprint. It just happens, and you spend a weekend chasing a "random" crash that was never random at all.

Servers don't accept that. They run ECC — Error-Correcting Code — memory, and ECC changes the whole story. ECC RAM carries extra bits alongside every word so the memory controller can check its own work on every single read. A single flipped bit, it catches and silently corrects — that's a correctable error, a CE. A double flip it can't fix, but it can still detect, and rather than hand your CPU known-bad data it raises a machine-check exception — an uncorrectable error, a UE — and the kernel decides whether to kill the affected process or halt the box. Either way the lie never reaches your software undetected. That's the entire reason ECC exists, and the entire reason it costs more.

The two outcomes the controller reports are worth pinning down before anything else, because every line below turns on which one you're reading:

Term	Meaning	Severity
CE (Correctable Error)	A single-bit flip ECC caught and fixed transparently; the read returned correct data	Warning — track the per-DIMM rate, not the lone event
UE (Uncorrectable Error)	A multi-bit flip ECC detected but could not fix; the bad word never reached your software as a lie	Error — data corruption, a killed process, or a panicked box

So here's the single most useful thing on this page, the frame everything else hangs on: a correctable ECC error is not a crisis — it's your RAM telling you, in advance, which stick is starting to fail. ECC's superpower isn't really the correcting; it's the reporting. Every CE the kernel logs is a free early-warning ticket, named down to the physical slot, weeks before that DIMM gets bad enough to throw an uncorrectable one and take a process — or the whole machine — down with it. By the end of this page you'll read those tickets, tell a one-off cosmic-ray fluke from a stick that's genuinely dying, know exactly which DIMM to pull, and understand the rather beautiful math that lets a strip of silicon notice its own mistakes. We'll start where it hurts — spot it, read it, fix it — and save the lovely how does RAM even catch its own errors story for the end, because once you're not panicking it's one of the prettier ideas in computing.

How You Notice

A memory error rarely knocks on the front door. ECC's whole job is to keep the server running through the error, which means by default the only place it leaves a fingerprint is the kernel log and the kernel's error-accounting subsystem, EDAC (Error Detection And Correction). Here's every place it surfaces, with the command to check your own box right now:

EDAC error lines in the kernel log. The kernel narrates every corrected and uncorrected error in plain text the instant the controller reports it. This is the rawest, most honest symptom there is:
```
dmesg -T | grep -iE "EDAC|mce|Hardware Error"
journalctl -k -p err | grep -iE "EDAC|memory error"
```
A line like EDAC MC0: 1 CE memory read error on ... (csrow:0 channel:0 ...) names the exact memory controller, channel, and rank that fumbled a read. An empty result here is genuinely good news. (dmesg and journalctl read the same kernel ring buffer; use whichever your distro keeps longer.)
Rising counts in the EDAC sysfs tree. The kernel keeps a running tally per controller and per DIMM under /sys/devices/system/edac. No tool needed — the counts are just files you can cat:
```
cat /sys/devices/system/edac/mc/mc0/ce_count
cat /sys/devices/system/edac/mc/mc0/ue_count
grep -r . /sys/devices/system/edac/mc/mc0/dimm*/dimm_ce_count
```
These are cumulative since boot. A single CE that never moves again is noise. A count that climbs between two readings is a stick in active decline — and that trend, not the snapshot, is the signal that matters.
edac-util summarises it for you. The edac-util helper (from the edac-utils package) walks that whole sysfs tree and prints a tidy total instead of you cat-ing twenty files:
```
edac-util -v
edac-util --report=ce      # just the corrected-error total
```
rasdaemon keeps the history after a reboot. EDAC's sysfs counts reset to zero every boot, which is exactly when you most want them. The modern fix is rasdaemon, a daemon that logs every error to a little SQLite database so the trend survives. Query it with ras-mc-ctl:
```
ras-mc-ctl --error-count
ras-mc-ctl --summary
```
An uncorrectable error you can't miss. When a UE hits, the gentle reporting is over. A process dies with a machine-check signal, or — if the bad word was kernel memory — the box panics outright and reboots. If you came to this page after an unexplained crash, the first thing to check is whether dmesg from the previous boot (journalctl -k -b -1) ends on an mce or Hardware Error line. That's not a software bug. That's a DIMM.

Any one of these means: stop guessing, and go read the error log the hardware has been keeping for you. There's exactly one place it lives, and it's about to read like an open book.

How I Read It

The kernel's EDAC driver translates the controller's raw machine-check into a human-readable line and drops it into the log. This is the artifact I reach for first — one real, canonical correctable-error line from an Intel server, with the hostname anonymised:

[Tue Jun  2 03:14:09 2026] web-01 kernel: EDAC MC1: 1 CE memory read error on \
CPU_SrcID#0_MC#1_Chan#0_DIMM#0 (channel:0 slot:0 page:0x5c068 offset:0x900 grain:32 \
syndrome:0x0 - err_code:0101:0090 socket:0 imc:1 rank:0 bg:2 ba:0 row:0x1e05 col:0x40)

It looks like a barcode, but every field is telling you something, and only three of them matter at 3 a.m. Let's take the whole line apart left to right:

EDAC MC1 — the memory controller that reported it. A modern CPU has more than one integrated memory controller (here, MC1); each owns a set of DIMM slots. This is the same mc1 you'll find under /sys/devices/system/edac/mc/mc1.
1 CE — one Correctable Error. CE is the word that lets you exhale: the bit was caught and fixed, the read returned correct data, nothing was lost. If this said UE instead — Uncorrectable Error — you'd be reading about data that was already corrupt. The number is how many errors this single event represents.
memory read error — the operation that tripped it (a read; you'll also see scrubbing error when the controller's background patrol finds one on its own — more on that lovely feature later).
CPU_SrcID#0_MC#1_Chan#0_DIMM#0 — the human-friendly location string, and the one you actually act on: socket 0, memory controller 1, channel 0, DIMM 0. This is the physical stick. Hold onto it.
channel:0 slot:0 — the same location in the kernel's internal coordinates. A channel is one independent command-and-data bus out of the controller, each driving a group of DIMMs; slot is the position on that channel.
rank:0 — which rank of chips on the DIMM. A single DIMM is often two ranks (front and back banks of chips that the controller selects between); rank narrows the fault to half the stick.
page:0x5c068 offset:0x900 row:0x1e05 col:0x40 bg:2 ba:0 — the exact physical address, right down to the DRAM bank-group, bank, row and column on the chip. Forensic-grade detail. You almost never need it — but the fact that the kernel can hand you the literal row and column of the failing capacitor should tell you how seriously this subsystem takes its job.
syndrome:0x0 — the ECC syndrome, the math fingerprint that says which bit was wrong (we'll see at the end how this number is computed — it's the cleverest part).

Strip away the forensics and the line says one plain sentence: controller 1, channel 0, DIMM 0 just had a single bit flip on a read, and I fixed it. That's it. The whole art of reading EDAC is pulling the location and the CE/UE out of the noise — and then asking the only question that matters: is the count for that one stick growing?

Note

EDAC reports the logical channel and DIMM number the controller sees — which is not always the silkscreen label on the motherboard. Boards are free to wire it however they like, and some do so perversely: on one popular consumer board, the firmware maps the controller's Channel 0 to the physical slot labelled B, and Channel 1 to slot A — exactly backwards from what you'd guess. Before you pull a stick, confirm the mapping for your board with ras-mc-ctl --print-labels (below) or dmidecode -t memory, rather than trusting that Chan#0 means "the first slot." Pulling the wrong DIMM and watching the errors continue is the classic, infuriating misread here.

Counting It Up Without dmesg Spelunking

One line is a clue; the count is the diagnosis. The kernel keeps the tally for you, and you don't even need a tool — EDAC exposes it as plain files in /sys:

$ cat /sys/devices/system/edac/mc/mc1/ce_count
4216
$ cat /sys/devices/system/edac/mc/mc1/ue_count
0
$ cat /sys/devices/system/edac/mc/mc1/dimm0/dimm_ce_count
4216
$ cat /sys/devices/system/edac/mc/mc1/dimm0/dimm_label
CPU_SrcID#0_MC#1_Chan#0_DIMM#0

That's the whole story in four cats: controller mc1 has logged 4216 corrected errors and zero uncorrectable ones, and every one of those corrected errors belongs to a single DIMM — dimm0. One stick, thousands of catches, nothing yet uncorrected. This is a DIMM in clear, accelerating decline that ECC is still papering over — replace it before the day it can't.

The tidy way to see the same thing is edac-util, which walks that tree and totals it:

$ edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: ch0: 0 Corrected Errors
mc0: csrow0: ch1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: ch0: 4216 Corrected Errors
mc1: csrow0: ch1: 0 Corrected Errors

Read it like a heatmap: everything is 0 except mc1: csrow0: ch0, which has 4216 corrected errors. The fault is localised to one channel of one controller — a single physical stick — and that pinpoint is the entire value of ECC. (The csrow/ch here is the older sysfs naming for the same chip-select-row-and-channel coordinate the dmesg line gave you.)

When the Counts Survive a Reboot: rasdaemon

There's one cruel gap in everything above: the EDAC counters reset to zero on every boot. A flaky DIMM that throws a UE crashes the box, the box reboots, and your evidence vanishes — the counters come up at 0 and you're left staring at a "spontaneous" reboot with nothing to blame. The fix the big operators all run is rasdaemon: a daemon that listens for every EDAC and machine-check event and writes it to a persistent SQLite database, so the history outlives the reboot. You query it with ras-mc-ctl:

$ ras-mc-ctl --error-count
Label                 CE      UE
mc#0csrow#2channel#0  0       0
mc#0csrow#2channel#1  0       0
mc#1csrow#0channel#0  4216    0
mc#1csrow#0channel#1  0       0

$ ras-mc-ctl --summary
Memory controller events summary:
  Corrected on DIMM Label(s): 'DIMM_B1' location: 1:0:0:-1 errors: 4216

No PCIe AER errors.
No Extlog errors.
No MCE errors.

Same 4216, same single location — but now it's labelled DIMM_B1, the actual silkscreen name, because rasdaemon cross-references the board's DMI table. That's the line you take to the data center (or paste into a ticket with your hoster): "DIMM_B1 on mc1, 4216 corrected errors, please replace." No ambiguity, no pulling the wrong stick.

Pro Tip

If your board's labels show up as unknown or missing, teach rasdaemon the mapping once with ras-mc-ctl --register-labels (it reads a small labels.db you populate from dmidecode -t memory, matching each Locator: like DIMM_B1 to its mc/csrow/channel). Do it on a healthy box, before you need it — pinpointing the failing stick the moment the first CE lands is worth ten minutes now, and it's the difference between a two-minute swap and an afternoon of pulling sticks one at a time.

Reading It by Example

Train the pattern-match. The readout on the left, what I'd actually conclude on the right:

ce_count: 1, never moves again across weeks → A cosmic-ray fluke. The universe flipped one bit; ECC caught it. Nothing to do. Genuinely. A handful of lifetime CEs on a busy server with a lot of RAM is expected background radiation, not a fault.
ce_count climbing steadily on one DIMM, ue_count: 0 → That stick is degrading. ECC is still saving you on every read, but the trend only goes one way. Plan a calm replacement of that specific DIMM — this is the textbook Memory:CorrectableECC warning, and the whole point of having ECC is that you get to act now, unhurried.
CEs spread roughly evenly across all DIMMs and channels → Suspect the common element, not the sticks: the memory controller, the CPU's memory interface, an unstable overclock/XMP profile, or heat. One bad stick lights up one location; a bad controller or bad voltage lights up everything.
CE count exploding into the thousands per hour → A stick on its last legs. It's a matter of time before one of those flips is a double flip ECC can't fix. Replace it now, not this weekend — the next error may be a UE.
A single UE line, then the process died (or the box rebooted) → The serious one. ECC detected corruption it could not fix; whatever read that word got an error instead of a lie. This is Memory:UncorrectableECC. Get the workload off this machine and replace the DIMM before you trust it with data again.
HardwareCorrupted: 8 kB in /proc/meminfo, but EDAC counts are zero → A different beast entirely. The kernel took a page offline after a machine-check, but with no ECC attribution. That's the bad RAM page, not this one — same villain, different fingerprint. (See "Not the Same as Bad RAM" below.)

How to Fix It

The right move depends on which row of that gallery you're in — but for anything past a one-off CE, the destination is the same: a healthy DIMM goes into that slot. RAM, unlike a failing disk, has no spare pool and no remap; a cell that's gone weak doesn't heal, and there's nothing to "scrub clean." You replace the stick.

Danger

An uncorrectable error (Memory:UncorrectableECC, a UE line, a machine-check panic) means data in RAM was already corrupted and the corruption may have been written somewhere before the machine noticed. Do not treat a UE-crashed box as trustworthy: take the workload off it, and before you put it back into service, treat any data it wrote around the crash as suspect — restore from backup if the corrupt word could have reached the database or filesystem. A correctable error is a warning you get to act on calmly; an uncorrectable one is an incident. Don't confuse the two.

Then, by cause:

One DIMM with a growing CE count: replace that DIMM. This is the happy path of having ECC — it told you exactly which stick, weeks early. Order a matching module (same speed, ranks, and ECC type), power down, swap only the named stick, and clear the slate: a reboot resets the EDAC counters, and rasdaemon gives you a clean baseline to confirm the errors are gone. On a rented or hosted box, you don't swap anything yourself — open a ticket and paste the ras-mc-ctl --summary line (DIMM_B1, mc1, 4216 corrected errors). A defective ECC DIMM is an unambiguous, warranty-grade fault; a decent hoster swaps it fast, because they can't argue with the controller's own log.
An uncorrectable error: replace the DIMM, then verify the rest. Pull the offending stick, and before you trust the box again run a full pass of memtest86+ on the remaining memory — a UE sometimes travels with neighbours. ECC tells you a stick is bad; memtest86+ tells you the others are good.
Errors smeared across every stick: don't replace RAM yet. Even-spread errors point at the shared parts. Reset any memory overclock / XMP / EXPO profile to JEDEC stock speeds (aggressive timings are a top cause of "bad RAM" that isn't), check the temperature — hot DRAM leaks charge faster and errors climb with the thermometer — reseat the sticks to rule out a dirty contact, and only then suspect the controller or CPU. Swapping good RAM because the board is unstable is wasted money and a wasted afternoon.
A genuine one-off CE: do nothing. Log it, move on, sleep fine. A lone corrected error that never recurs is the system working exactly as designed — it caught a fluke and told you. Replacing a DIMM over a single lifetime CE is the equivalent of binning a smoke alarm because it once chirped at burnt toast.

How to Avoid Them

You can't stop cosmic rays, and you can't make DRAM cells immortal — so the honest goal isn't zero errors, it's never being surprised by one. A short list, in order of leverage:

Use ECC RAM in the first place. This is the whole ballgame, and it's the one decision that's hard to undo later. Non-ECC memory has the exact same bit flips — it just suffers them in silence, corrupting your data with no log, no warning, and no way to tell a memory fault from a software bug. If a server holds anything you'd hate to lose, ECC isn't a luxury; it's the floor. (Most server and workstation platforms support it; many consumer ones quietly don't, which is its own bad RAM trap.)
Run rasdaemon. Without it, the counters die at every reboot and a flaky stick can crash you repeatedly while covering its own tracks. With it, you have a persistent, labelled history the moment you need it. It's a few hundred kilobytes and one systemctl enable. Turn it on now, on every ECC box, before the first error — not after.
Keep memory cool and at stock speed. Heat and over-aggressive timings are the two stressors you actually control. DRAM retention falls as temperature rises, so good airflow directly buys fewer errors; and a memory overclock that "passes" a quick boot can throw correctable errors for months under real load. Stock JEDEC speeds plus decent cooling is the boring, correct default.
Let the patrol scrubber run. Most server platforms can sweep all of RAM in the background — patrol scrubbing — reading every word periodically so ECC catches and corrects a latent single-bit error before it has a chance to collect a second flip and become uncorrectable. It's usually a BIOS toggle, on by default on real server boards, and it's the memory equivalent of a RAID scrub. Leave it on.

One thing to keep straight while you're at it: ECC is not a backup, and it is not RAID for memory. It corrects a single flipped bit and merely detects a double — it does nothing for a triple flip, a dead controller, or a stick that fails outright. It buys you honesty and early warning, which is enormous, but the data-safety net is still the same one it always was: a tested backup. ECC tells you the disaster is coming; the backup is what makes it a non-event.

And the deepest version of every rule above is the one a human can't do by hand: watch the trend, per DIMM, over time. A single smartctl-style snapshot weeks apart misses exactly the signal that matters — one stick's CE count quietly accelerating. The counters reset on reboot, scroll out of dmesg, and look identical at any single glance. The error that takes you down was visible for weeks; it just needed something reading the diary every day and comparing.

Not the Same as Bad RAM

It's worth drawing one sharp line, because two findings here look like twins and aren't. ECC errors (this page, Memory:CorrectableECC / Memory:UncorrectableECC) come from the EDAC subsystem and require ECC hardware — the controller checked its parity math and reported what it found, attributed to a specific DIMM. That attribution is the whole gift.

Bad RAM is the more primitive cousin: the kernel exposes a counter called HardwareCorrupted in /proc/meminfo, the number of bytes it has taken offline after a memory machine-check. That can fire on a machine with no ECC at all (triggered by a different mechanism), and it comes with no channel, no DIMM, no syndrome — just "this many bytes went bad, and I've stopped using them." Both mean a stick is failing and both end in a replacement, but they're separate findings with separate fingerprints: ECC tells you which DIMM and how fast; HardwareCorrupted only tells you that it happened. If you've got ECC, you'll usually see the EDAC errors first and act on them long before a page ever has to be retired. For the HardwareCorrupted story, see bad RAM; for the wider picture of components giving out, hardware failure.

How RAM Catches Its Own Mistakes

Now the part you don't need in an emergency — but that turns the fix into something you genuinely understand, and happens to be one of the prettier ideas in computing. How can a strip of silicon possibly notice that one of its own bits is wrong? It has nothing to compare against; the wrong value is the value it stored. The answer is a sixty-year-old piece of math so elegant that once you see it, the whole syndrome:0x0 field stops being noise and becomes the most interesting number on the line.

The Trick: Spend a Few Bits to Watch the Others

Start with the cheap version everyone meets first: parity. Take eight data bits, count how many are 1, and add a ninth bit that makes the total even. Now if any one bit flips — data or the parity bit itself — the count goes odd, and you know something broke. Beautifully simple. But it has two fatal limits: it can only detect, never fix (it knows something flipped, not which something), and a double flip cancels out and sails through undetected. Parity is a smoke detector with no address and a blind spot. Good enough for a 1980s desktop; not good enough for a server.

ECC is parity grown up. The idea, from Richard Hamming at Bell Labs in 1950 — born, the story goes, of his pure fury at a weekend batch job that kept dying on a single card-reader error with no way to recover — is breathtaking: don't use one parity bit over everything; use several, each watching a cleverly overlapping subset of the data bits. Arrange the overlaps just so, and when a bit flips, the particular combination of parity checks that come up wrong spells out, in binary, the exact address of the broken bit. That combination is the syndrome — the very field in the dmesg line. A syndrome of 0x0 means every check passed: no error. Any other value is a number that literally points at the guilty bit — and once you know which bit is wrong, fixing it is trivial: you just flip it back. That's the leap from parity to ECC: parity says "something's wrong"; Hamming's code says "bit number 19 is wrong," and correcting it is a single operation.

Real server memory uses a beefed-up descendant called SECDED — Single Error Correct, Double Error Detect. The math is tuned so the syndrome can pinpoint and correct any single-bit flip, and still reliably detect (though not fix) any double flip — which is exactly the CE-versus-UE distinction this whole page rests on. A correctable error is a syndrome that named one bit; an uncorrectable error is a syndrome that says "two bits are wrong and I can't tell you how to undo that." The cost of all this is a few extra memory chips per module — classically a 9th chip's worth of width for every 8 — which is precisely why ECC DIMMs are physically wider, and a touch pricier, than the consumer kind. You're not buying more memory; you're buying the watchful margin that lets the controller check its own arithmetic on every read.

Why

The same Hamming idea radiates everywhere once you spot it. The blocky black-and-white QR code on a package survives being scuffed or torn because it's wrapped in error-correcting codes — scan a QR sticker with a chunk ripped out and it still resolves, because the redundancy reconstructs the missing squares. Deep-space probes use ferocious ECC to claw a clean signal out of noise from billions of kilometres away; CDs shrug off scratches the same way; RAID parity rebuilds a whole dead disk from the survivors using a cousin of the same trick. The flavour differs, but the bargain is identical and ancient: spend a little redundancy up front, and you get to survive damage instead of merely noticing it. Your server's RAM is running a piece of that same deep idea, billions of times a second, just to make sure the byte you wrote is the byte you read.

Why the Bits Flip at All

So what actually knocks a bit loose? A DRAM cell stores each bit as a tiny charge in a microscopic capacitor — full is a 1, empty a 0 — and that charge is forever leaking, which is why DRAM has to be refreshed thousands of times a second just to remember itself. (That's the "dynamic" in Dynamic RAM, and it's why RAM forgets everything the instant the power dies.) Anything that nudges enough charge across the line between "full" and "empty" flips the bit. The classic culprits:

Cosmic rays and stray particles. High-energy particles from space — and, more often, a trace of natural radioactive decay in the chip's own packaging materials — slam into a cell and dump just enough charge to tip it. These are soft errors: the silicon isn't damaged, the next write to that cell is perfectly fine, the bit was simply knocked over once. This is the source of the lone, never-repeating CE, and it's genuinely, measurably more common at altitude — there's a real reason aerospace and high-altitude installations obsess over ECC.
A weakening cell. As a chip ages or runs hot, a particular capacitor can start leaking faster than its neighbours, drooping below the threshold before the next refresh catches it. These are hard errors — the same cell, the same bit, again and again — and they're the source of the climbing CE count on one DIMM. A soft error is a coin that came up tails once; a hard error is a coin that's started landing tails most of the time. ECC reports both identically as a CE; only the trend tells them apart, which is why the trend is everything.
Heat. Charge leaks faster the hotter the silicon, so a baking DIMM is a DIMM throwing more errors — the thread that ties this page back to temperature and airflow.
Disturbance from the neighbours. And here's the one that turned a quiet reliability footnote into a security story. Modern DRAM packs cells so densely that hammering one row of memory — reading or writing it over and over, millions of times a second — can leak just enough charge to flip a bit in the adjacent row you never touched. It's called Rowhammer, and it isn't a fault; it's physics weaponised. Researchers showed that by flipping bits in carefully chosen neighbouring rows, an unprivileged program could escalate to root — corrupting a page table from the outside without ever being allowed to write to it. The defenders' answers (target-row refresh, more frequent refresh, and yes, ECC raising the bar) and the attackers' next moves have been trading blows ever since. The takeaway for this page: bit flips aren't always the universe being random. Sometimes they're someone pushing. (The full story is a proper rabbit hole, and worth an evening — start from the original "Flipping Bits in Memory Without Accessing Them" paper and follow the thread.)

So: a server's memory is a vast field of leaking buckets, refreshed thousands of times a second, peppered by particles from space and slowly aged by heat, with the occasional bit shoved over by a hostile neighbour — and over the top of all that chaos runs a 1950 piece of mathematics that catches each single mistake, names the exact bit, and quietly puts it right before your software ever sees it. The error log is just that math keeping you honest. Read it, and you're reading the one component in the whole machine that tells you the truth about itself, in advance, by name. All you have to do is listen.