Bad RAM: Symptoms, Diagnosis & Fixes

The kernel found a bad spot in your memory, walled it off, and kept running. The next bad spot might not be so polite.

What It Is

Bad RAM is exactly what it sounds like — a stick of physical memory that has started returning the wrong bits. A memory cell that held a 1 last second reads back a 0; a region that stored your process's stack comes back scrambled. The hardware is failing, slowly or suddenly, and the data passing through it can no longer be trusted. And here's the part that makes bad RAM uniquely nasty among server problems: memory is where everything lives. A failing disk corrupts files; a failing DIMM can corrupt anything — a running program's variables, a filesystem buffer about to be written to disk, the kernel's own bookkeeping. The symptom isn't "the memory broke." It's "three unrelated programs crashed, a database wrote garbage, and the box rebooted itself," and nobody thinks RAM until much later.

This page is specifically about one finding: the moment the kernel has caught a memory failure red-handed, taken the bad physical pages out of service, and started counting them in /proc/meminfo under HardwareCorrupted. That number is the kernel saying, in writing, "I found memory I cannot trust, and I have stopped using it." It is not a guess, not a threshold, not a prediction — it is a count of pages already poisoned and retired because something went wrong in the silicon. When it's above zero, we know: a DIMM in this machine is failing.

Two things to get straight up front, because they confuse almost everyone. First, this is distinct from ECC error counts. ECC is your RAM's self-correction — error-correcting memory that catches single-bit flips and fixes them on the fly, logging each one without losing data. Those corrected counts are an early warning; HardwareCorrupted is the alarm that already went off. Second — and this surprises people — this works even on RAM that isn't ECC at all. The kernel's hwpoison machinery rides on the CPU's Machine Check Architecture (MCA): when the processor hits memory so broken it returns an uncorrectable error, it raises a Machine Check Exception, and the kernel responds by poisoning the page. ECC tells you about errors it fixed; hwpoison tells you about errors nobody could fix. By the end of this page you'll read that count, confirm the bad stick with the one tool that can actually be trusted, and know precisely which DIMM to pull — and at the end, the remarkable bit: why a cosmic ray can flip a bit in your RAM, and what the engineers did about it.

How You Notice

Bad RAM is a master of disguise — it almost never announces itself as "memory." It announces itself as chaos, and the trick is recognising the pattern. Here's each signal, with the command to see it on your own box right now:

HardwareCorrupted is above zero in /proc/meminfo. This is the rawest, most honest symptom there is — the kernel publishing its own count of poisoned memory as a plain line in a plain file. Read it:
```
grep -i HardwareCorrupted /proc/meminfo
```
A healthy box reads HardwareCorrupted: 0 kB, always, forever. Any non-zero value means the kernel has found and retired bad physical memory — there is no benign reading above zero. This is the exact line CleverUptime watches, and an empty (zero) result here is genuinely good news.
Random segfaults in programs that have no business crashing. The signature of bad RAM is unrelated programs dying. A long-stable daemon segfaults; a gcc build fails with an "internal compiler error" one run and succeeds the next; apt reports a corrupted package that downloads fine on retry. When crashes scatter across software that shares nothing but the same physical machine, suspect the one thing they do share — the RAM underneath them. Watch the kernel log for the crashes:
```
dmesg -T | grep -iE "segfault|general protection|trap"
journalctl -k -p err
```
The kernel narrates a memory failure out loud. When the CPU catches an uncorrectable error, the kernel logs it in plain text — the most direct symptom short of the meminfo count itself. Look:
```
dmesg -T | grep -iE "memory failure|hardware memory|mce|hardware error|poison"
```
Lines naming a Memory failure at a physical address, or MCE / Hardware error, are the kernel telling you — through dmesg — the exact page that went bad and what it did about it. We'll read these line by line below.
Mysterious data corruption — files, databases, archives. A tar or gzip that fails its own integrity check, a database that flags a corrupt page, a file whose checksum changed without anyone touching it. Bad RAM corrupts data in flight: a byte goes into memory clean and comes out wrong on its way to disk, and now the corruption is permanent, baked into your storage by a fault that was never on the disk at all. (This is the cruel overlap with data corruption and filesystem corruption — same wreckage, a cause that hides one layer down.)
Spontaneous reboots, freezes, or kernel panics. When the poisoned page belongs to the kernel itself rather than a user process, there's no process to kill — the machine panics or locks up and reboots. A server that reboots itself at irregular intervals with nothing in the application logs is a classic bad-RAM tell.

Any one of these means: stop debugging the software and go test the hardware. These crashes are not bugs in your programs — they're your programs faithfully executing instructions that arrived corrupted. The fix is never in the code.

How I Read It

The diagnosis has two halves, and you must do both: read what the kernel already knows, then prove it with a test that doesn't trust the very hardware under suspicion. Start with what's free and instant — the /proc/meminfo count and the kernel log in dmesg.

The single line that anchors this whole page:

grep -i HardwareCorrupted /proc/meminfo

HardwareCorrupted:    2052 kB

Read it dead simply: this is the number of kibibytes of physical memory the kernel has caught failing and taken out of service. Here, 2052 kB — call it half a dozen 4 KB pages the kernel no longer trusts and will never hand to a process again. The value is cumulative since boot and only ever climbs; it resets to 0 on reboot (poisoning lives in RAM, so a power cycle forgets it — which is exactly why the count is not the whole story, and why a reboot can make the symptom "vanish" while the bad stick sits there waiting to do it all again). Under the hood that number is a running total the kernel keeps as it handles each failure — pages it ignored, failed to isolate, delayed, and successfully recovered, all summed and converted to kibibytes. You don't need the breakdown; you need the headline: above zero means a DIMM is failing, full stop. There is no threshold to cross and no "acceptable" non-zero value. (Contrast that with memory usage, which is all about thresholds and percentages — this one is binary.)

Now the kernel log, where the failure was narrated as it happened. This is real hwpoison output — the exact shape the kernel prints when the CPU reports an uncorrectable error in a page a user process was touching:

[34775.674296] mce: Uncorrected hardware memory error in user-access at 3710b3400
[34775.675413] Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered
[34775.690310] Memory failure: 0x3710b3: already hardware poisoned
[34775.696247] Memory failure: 0x3710b3: Sending SIGBUS to app-01:3614 due to hardware memory corruption

Walk it line by line, because every line is a step in a small, careful machine:

mce: Uncorrected hardware memory error in user-access at 3710b3400 — the Machine Check Exception. The CPU itself, not the kernel, detected memory it could not read correctly, at physical address 0x3710b3400. "Uncorrected" is the weight-bearing word: this wasn't a single-bit flip ECC quietly fixed — it was bad enough that nothing could repair it. "user-access" means a user-space program touched it (as opposed to the kernel — which would be far worse).
Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered — the kernel's hwpoison handler taking over. It identified the failing page (0x3710b3, the page-frame number — the address with its low 12 bits dropped), worked out what that page was being used for (a "dirty LRU page" — application data not yet written back), and reports the outcome: Recovered. "Recovered" here means the kernel contained the damage cleanly — it isolated the page — not that your data survived. The data in that page is gone.
Memory failure: 0x3710b3: already hardware poisoned — a follow-up touch of the same page finding it's already been marked. Once poisoned, a page is permanently off-limits; the kernel will never allocate it again until reboot. This is the out-of-service state, the thing HardwareCorrupted is counting.
Memory failure: 0x3710b3: Sending SIGBUS to app-01:3614 due to hardware memory corruption — the consequence for the process. The program (here app-01, PID 3614) that owned that poisoned page gets a SIGBUS signal. Most programs don't handle SIGBUS and simply die — which is exactly the "random segfault-like crash in an innocent program" you saw in the symptoms. The crash wasn't the program's fault. The kernel killed it to stop it from reading corrupted memory. (You may also see the blunter form, MCE: Killing app-01:3614 due to hardware memory corruption fault at ... — same event, the kernel taking the process down.)

That sequence is the whole story of one bad page: CPU catches it, kernel poisons it, the owning process is sacrificed to protect your data. The meminfo count ticks up by one page. And it will happen again, because the DIMM is failing — not just that one page.

Proving It: The One Tool You Can Trust

Here's the deep problem with diagnosing memory, and it's almost philosophical: you cannot reliably test RAM using a program that is itself running in RAM. A memory tester loaded into the very memory it's checking can't test the pages holding its own code, can't move the operating system out of the way, and is sharing the bus with everything else the running system is doing. The kernel's HardwareCorrupted count and the log lines above are evidence — strong evidence — but they only catch the errors that happened to surface during normal work. To find bad RAM definitively, and to map it to a specific stick, you need to test memory while almost nothing else is using it.

That tool is memtest86+ — a tiny, self-contained operating system that boots instead of Linux, takes over the whole machine, and does nothing but hammer every byte of RAM with patterns designed to provoke failures. Because it owns the bare metal, it can test nearly all of physical memory, walk it with adversarial bit-patterns (all-ones, all-zeros, walking bits, address-in-its-own-cell), and do it for hours. It's the gold standard for one reason: it's the only test that isn't trusting the thing it's testing.

You boot it (from a USB stick, or your hoster's rescue/IPMI console), and let it run — ideally several full passes, overnight, because memory errors are often intermittent and a single pass can miss them. A clean stick produces a calm screen with a rising pass count and a big fat 0 in the error column. A bad stick produces something like this:

Tst  Pass   Failing Address          Good       Bad        Err-Bits  Count  CPU
---  ----   -----------------------  ---------  ---------   --------  -----  ---
  5   0     0000868d5880f4 - 33.6GB  00000000   00000002   00000002      1   0
  5   0     0000868d5880f4 - 33.6GB  00000000   00000004   00000004      3   0
  7   0     00009a3c1180a0 - 39.4GB  ffffffff   ffffffbf   00000040      1   2

Read it the way you'd read evidence at a crime scene:

Failing Address — the exact physical address that misbehaved, and its offset in your total RAM (33.6GB). A cluster of failures around the same address range is a single bad region; failures scattered everywhere can mean a worse fault (or a bad controller). Either way, the address is the thread you pull to find the stick.
Good vs Bad — what memtest wrote versus what it read back. Good 00000000, Bad 00000002 means it wrote all zeros and got a 2 back — a bit that refused to stay off. The mirror case, Good ffffffff, Bad ffffffbf, is a bit that refused to turn on.
Err-Bits — the XOR of Good and Bad: precisely which bits flipped. 00000002 means bit 1; 00000040 means bit
1. A single bit flipping repeatedly is the most common and most diagnosable failure — one stuck cell. Many bits flipping at once points to address-line faults or a dying controller.
Count — how many times that address has failed this run. Climbing counts on a stable address is a hard, repeatable fault — the clearest possible verdict.

Put together, the shape of the failures tells you what kind of fault you're looking at:

Memtest86+ pattern	What it points to
Single bit flipping repeatedly at one address, reproducible across passes	One stuck cell — the classic, definitive bad stick
Failures clustered around the same address range	A single bad region on one DIMM at that location
Many bits flipping at once, errors scattered across the whole range	Address-line fault, a failing memory controller, or an unstable overclock
Errors only under XMP/EXPO, gone at JEDEC stock speed	Not the RAM — the profile speed is past spec; back off and retest

One failing address with one bit, reproducible across passes, is enough. Memtest86+ does not produce false positives the way a flaky cable does — if it reports an error, the memory is bad. (The famous exception isn't a false positive at all: overclocked RAM via XMP/EXPO profiles can "fail" memtest because it's being pushed past spec. Drop to stock JEDEC speeds and retest; if the errors vanish, the stick is fine and the settings were the problem — the one case where memtest's verdict has an asterisk.)

Reading It by Example

Train the pattern-match. The readout on the left, what I'd actually conclude on the right:

HardwareCorrupted: 0 kB, no MCE lines, no scattered crashes → Healthy. The happy, and by far most common, case — this is what every well-behaved server reads, every time.
HardwareCorrupted: 2052 kB and Memory failure lines in dmesg → Confirmed bad RAM. The kernel has caught it and is retiring pages. Don't wait for more — boot memtest86+ to localise the stick, and plan a replacement.
Repeated segfault / "internal compiler error" / corrupt-archive failures across unrelated programs, df/SMART clean → Strongly suggestive of bad RAM even before the count moves. The shared factor is the memory. Test it.
Memtest86+: one address, one bit (Err-Bits 00000002), reproducible → A single failing cell. Classic, definitive bad stick. Identify and replace it.
Memtest86+: errors everywhere, many bits, across the whole range → A worse fault — bad address lines, a failing memory controller, or (sometimes) an unstable overclock. Retest at stock speed first to rule out settings; if it still fails broadly, it may be the board/CPU, not just a stick.
Errors only appear under XMP/EXPO, gone at JEDEC stock speed → Not bad RAM — bad settings. The memory can't run at the profile speed on this board; back off and it's fine.
HardwareCorrupted was non-zero, now reads 0 after a reboot → The count reset; the hardware did not heal. Poisoning lives in volatile memory, so a reboot wipes the evidence, not the fault. Test before you trust it.

How to Fix It

The fix for bad RAM is blunt and physical: identify the failing stick and replace it. There's no software repair, no firmware update, no setting that mends a dying DIMM. But the order matters, because data is at stake.

Danger

Bad RAM corrupts data silently and permanently — a byte that flips on its way from memory to disk is now wrong on disk, and no filesystem check will ever flag it because the disk faithfully stored exactly what it was given. The moment you have evidence of bad RAM, treat everything written since it started failing as suspect, and get your irreplaceable data backed up from a known-good source if you can. Do not run a heavy job that reads and rewrites large amounts of data (a database dump-and-restore, a big rsync, a RAID scrub) on a box with active memory faults — you risk laundering corruption into your backups. Stabilise the hardware first; trust the data second. No fix is more urgent than not making the corruption worse.

Then, by situation:

Confirm and localise with memtest86+. Before you pull anything, prove which stick. Boot memtest, note the failing addresses, then — the decisive move — test one stick at a time. Pull all but one DIMM, run a pass; rotate. The stick that throws errors alone is your culprit; the ones that pass alone are innocent. This is slower than guessing but it's the difference between fixing the problem and buying RAM you didn't need.
Reseat before you replace. Sometimes a DIMM has simply worked loose or has oxidised contacts — especially after shipping or rack work. Power down, remove the suspect stick, check the contacts, push it firmly back until both latches click, and retest. A surprising fraction of "bad RAM" is badly-seated RAM, and reseating costs nothing.
Replace the failing DIMM. If memtest confirms a stick fails in isolation, replace it. On your own hardware that's a power-down, swap, and a fresh memtest pass to confirm the errors are gone. Match the replacement to the survivors' spec (speed, capacity, ECC/non-ECC, and on many boards keep matched pairs in the right channels) — mixing mismatched DIMMs can itself cause instability.
On a rented or hosted box, open a ticket with the evidence. Paste the HardwareCorrupted value, the dmesg Memory failure lines, and the memtest86+ error output. That's an open-and-shut hardware case; a decent provider swaps the DIMM, often same-day, with no argument — because the kernel and an independent test both agree, in writing.
If the whole machine is suspect, move the workload. While you wait for a swap, the safest place for a production service is not a box actively corrupting memory. Failing over to another host removes the risk entirely until the hardware is fixed.

Pro Tip

If this is an ECC-memory server, rasdaemon often names the failing stick for you, by its motherboard label, before you ever reach for memtest. ras-mc-ctl --summary reads the memory controller's own error log:
Memory controller events summary:
Corrected on DIMM Label(s): 'DIMM_B1' location: 0:2:0:-1 errors: 5
No MCE errors.
DIMM_B1 is the physical slot — pull that stick, no guessing, no one-at-a-time dance. (It only works on ECC RAM with EDAC support, and the human-readable label needs a one-time mapping file under /etc/ras/dimm_labels.d/; without it you'll see an EDAC path like mc0/csrow2 instead — still a precise location, just less friendly.)

How to Avoid It

You cannot stop a memory cell from eventually wearing out or a cosmic ray from flipping a bit — that's physics, and the deep-dive below explains why. What you can do is make those events visible and survivable instead of silent and catastrophic. In order of payoff:

Buy ECC memory for anything that matters. This is the single biggest lever, and most people running their first servers don't know it exists. ECC RAM carries extra bits that let it detect and correct single-bit errors automatically and detect double-bit errors — turning a silent data corruption into a logged, corrected event (or a clean, loud failure instead of garbage). It costs a little more and needs a CPU and board that support it (most server platforms do; most consumer ones don't). For a machine holding data you can't afford to silently corrupt, ECC isn't a luxury — it's the baseline. The whole reason "real servers" use it is this page.
Watch the corrected-error trend. On ECC systems, a DIMM usually doesn't fail all at once — it starts throwing corrected single-bit errors (the memory errors the hardware fixes) for a while before it produces an uncorrectable one. Those corrected counts, climbing on one DIMM, are the early warning that lets you replace a stick before it ever corrupts anything. The whole game is replacing it during the corrected phase, not the uncorrectable one.
Burn-in new and reseated memory. New RAM, and any stick you've just reseated or swapped, deserves a full memtest86+ pass (or several) before it carries production data. Infant mortality is real — a fraction of DIMMs are bad out of the box — and finding it on a test bench beats finding it via mysterious crashes a month later.
Keep it cool and keep the power clean. Heat ages every component, memory included, and dirty or interrupted power stresses everything. Good airflow and a UPS won't make a bad stick good, but they slow the march toward the next one.

Note

Non-ECC RAM is not "fine," it's silent. A consumer machine with a flipping bit gives you exactly the symptoms on this page — random crashes, corrupt files — with no count, no log line, no warning, because there's no hardware checking the bits in the first place. (The hwpoison/HardwareCorrupted machinery can still catch the worst errors via MCE on non-ECC RAM, but it only sees failures bad enough to trip a Machine Check — the quiet single-bit flips sail right through uncaught.) If a non-ECC box is doing inexplicable things, bad RAM is higher on the suspect list, not lower — precisely because nothing else is watching.

The deepest version of all this isn't a one-off command — it's noticing the trend before the crash. A single corrected error is noise; a DIMM whose corrected count climbs day over day is a failure with a date on it, and you only catch that if something reads the count every day and compares. A manual grep of /proc/meminfo weeks apart misses exactly the signal that matters most.

How a Bit Goes Bad

Now the part you don't need mid-crisis but that gives the whole page a why — and it happens to be one of the more wonderful stories in computing, because the enemy here is, sometimes, literally the sky. Once you can picture how a memory cell holds a bit and how that bit can be knocked loose, every line above stops being trivia and becomes something you can reason out.

How RAM Holds a Bit at All

The "RAM" in your server is DRAM — dynamic random-access memory — and the word dynamic is the whole secret. Each bit is stored as an electric charge in a microscopic capacitor, paired with a single transistor that acts as a gate: charged capacitor reads 1, empty reads 0. Billions of these tiny buckets of charge, etched onto silicon, and your entire running system — every variable, every open file, the kernel itself — is just which buckets are full right now.

But a capacitor this small can't hold its charge. It leaks, continuously, and would forget its bit in a few milliseconds if left alone. So DRAM does something faintly heroic: it refreshes every cell, reading each one and writing it back, thousands of times a second, forever, just to keep your data from evaporating. The memory in your server is not a still pond holding bits; it's a fountain, every drop being caught and thrown back up before it can fall. (This is why DRAM loses everything the instant the power dies — and, in a lovely twist, why cold RAM forgets more slowly, the basis of the "cold boot attack" that recovers encryption keys from a chip chilled and yanked from a running machine. A thread worth an evening, but back to the fountain.)

Now you can feel where bad RAM comes from. Anything that disturbs that delicate charge corrupts a bit: a capacitor that leaks too fast to survive between refreshes, a transistor gate that's worn or shorted, a manufacturing flaw that only shows under heat or at a certain address pattern. The cell that reads Bad 00000002 in memtest is one bucket that won't hold its level — and once you know it's a leaking bucket on a refresh treadmill, the failure isn't mysterious at all.

The Bit-Flip From Space

Here's the one that sounds like a tall tale and is completely, documented-by-IBM true: some memory errors are caused by radiation, including particles from outer space. A cosmic ray slams into the upper atmosphere, shatters into a shower of energetic neutrons, and one of those neutrons — having travelled from a supernova, more or less — strikes a memory cell in your rack and deposits just enough charge to flip a 0 to a 1. No hardware is damaged. The cell is fine. But the bit is now wrong, and the program reading it has no idea. These are called soft errors, and they are real enough that chip-makers measure them, aerospace designers obsess over them, and the rate measurably rises at altitude — a server in Denver eats more cosmic bit-flips than the identical server in Amsterdam, because there's less atmosphere overhead shielding it. Run enough RAM for long enough and you will get hit. It is the single best argument for ECC there is.

This is the deep "why" behind ECC, and it's an elegant piece of engineering. ECC adds extra bits computed from your data using an error-correcting code (a Hamming code, the same family of maths that protects deep- space probe transmissions and CDs). The trick: the extra bits are arranged so that if any single bit in the protected word flips — from a cosmic ray, a leaky cell, anything — the hardware can not only detect it but compute exactly which bit went wrong and flip it back, on the fly, before your program ever sees it. A double-bit flip it can't fix, but it can still detect, and it refuses to hand you the bad data — failing loud instead of silent. That's the difference ECC buys: the cosmic ray still strikes, but instead of a corrupted database you get a logged, corrected event and a note that DIMM B1 took a hit. (The same Hamming-code idea, BTW, is why a scratched CD still plays and a barcode still scans with a chunk missing — redundancy arranged so cleverly that the missing piece can be recomputed. Claude Shannon and Richard Hamming worked this out in the 1940s at Bell Labs, and we've been quietly riding their maths ever since.)

Soft Errors vs Hard Errors — and Why memtest Matters

So there are really two kinds of memory error, and telling them apart is the whole diagnostic art:

Soft errors are transient — a cosmic ray, a one-off charge disturbance. The cell is fine; the bit was just knocked loose once. Rewrite it and it's correct again. ECC corrects these invisibly; on non-ECC RAM they're a one-in-a-blue-moon mystery crash you'll probably never explain. They do not mean your RAM is bad.
Hard errors are physical — a worn capacitor, a shorted transistor, a manufacturing defect. The cell fails repeatably, at the same address, every time you test it. This is what "bad RAM" means, and it's what memtest86+ is built to catch: the error that comes back, pass after pass, at the same address with the same bits.

That distinction is exactly why a single memtest pass isn't enough for a clean bill of health, but a single reproducible failure is enough to condemn a stick. One transient flip might be a soft error from the sky; an error that recurs at the same address across passes is a hard fault in the silicon — a bucket that will never hold its charge again. And it's why HardwareCorrupted > 0 is so unambiguous: by the time the kernel has caught an uncorrectable error and poisoned a page, you're almost always looking at a hard fault that's only going to spread.

So: billions of leaking buckets refreshed thousands of times a second, occasionally hit by a neutron from a dead star, each one holding a piece of everything your server is doing. That it works at all is a small miracle; that it occasionally doesn't is just the physics catching up. The kernel keeps an honest count of the casualties — but a count only ever buys you warning. The thing that actually saves your data is catching the trend, replacing the stick, and — for anything that matters — buying the ECC memory that turns a corrupted byte into a logged footnote.