Swap Thrashing: Symptoms, Diagnosis & Fixes

The box is alive but useless — every process is queued behind a disk, taking turns being memory.

What It Is

Swap thrashing is the state where a server spends almost all its energy shuffling memory pages between RAM and disk, and almost none of it doing your actual work. The machine is technically up — it answers a ping, the kernel is running, top even refreshes if you wait long enough — but everything that matters has slowed to a crawl, because the one resource every process secretly depends on, fast access to its own memory, has quietly been replaced with the slowest disk in the building. A request that took two milliseconds yesterday takes thirty seconds today, and nobody can tell you why, because nothing crashed. That's the cruelty of thrashing: it doesn't fail, it congeals.

Here's the one-sentence version, and it's worth fixing in your head before anything else on this page: thrashing means your working set no longer fits in RAM. The "working set" is just the chunk of memory your programs are actively touching right now — the pages they need this second to make progress. When that set is bigger than the physical RAM you have, the kernel can't keep it all resident, so it's forced to constantly evict pages to swap to make room — and then, a heartbeat later, fetch those very same pages back because something needed them. Out to disk, back from disk, out again, back again, thousands of times a second. The machine isn't computing. It's paging, frantically, in a loop it can't escape on its own.

And it feels like a paradox the first time you meet it, so let's name it now: the CPU looks idle, but the box is on its knees. You'll run top, see the CPU 90% idle, and think "plenty of headroom — so why is everything frozen?" The answer is one of the most important ideas in all of server performance, and once it clicks you'll never be fooled by an idle CPU again: the processors aren't busy computing, they're busy waiting — stalled on a disk read that hasn't come back yet, doing nothing but tapping their feet. This is a memory problem wearing an I/O costume, and the two are joined at the hip; sustained, the whole picture is high I/O wait, just with swap as the disk that's drowning. By the end of this page you'll read the thrash off vmstat in one glance, understand the death-spiral that makes it self-sustaining, and know exactly which lever to pull — the right one, in the right order, under pressure.

How You Notice

Thrashing announces itself through slowness far more than through any error message — which is exactly why it's misdiagnosed so often. Here's each tell, with the command to see it on your own box right now, so you can separate the real thing from a box that's merely a bit busy.

Everything is slow, but the CPU is idle. The signature contradiction. The box feels frozen — SSH lags, the web app times out, a ls takes seconds — yet the CPU sits mostly idle. Look at the two together:
```
top
```
In top's CPU line you'll see a high wa (I/O wait) figure and a low us/sy, with load average climbing well past your core count even though us is near zero. Load high and CPU idle is the fingerprint: the run queue is full of processes that aren't running, they're blocked, waiting on swap to hand back a page.
vmstat shows si and so pinned high, sample after sample. This is the rawest, most honest symptom there is — and the single command that confirms thrashing beyond doubt. Run it with a one-second interval so you see rates, not since-boot totals:
```
vmstat 1
```
The si (swap-in) and so (swap-out) columns are normally a calm 0 0. When both are sitting at thousands and staying there across every one-second sample, that is thrashing, plainly. We'll read a real one line by line below — it's the heart of this page.
Processes pile up in D state. A process waiting on a page that has to come back from disk drops into D — uninterruptible sleep — the same state a dying disk puts its readers in, because to the scheduler "waiting on swap" and "waiting on a stuck disk" are the same thing: blocked on I/O. List them:
```
ps -eo state,pid,comm | awk '$1 ~ /D/'
```
A growing column of D-state processes during a slowdown, with an idle CPU, points straight at memory pressure rather than a busy box.
free shows swap filling and RAM with almost nothing free. The cause, in two lines:
```
free -h
```
When Mem shows next to nothing available and Swap used is climbing, you've run out of room and the kernel has started leaning on disk to fake the rest. (Full reference on free.) A box that's using swap isn't necessarily thrashing — idle pages parked in swap are harmless and normal. Thrashing is swap being read and written constantly, which is the vmstat test, not the free one.
An OOM kill that ends the freeze. Sometimes the story ends with the kernel's OOM killer shooting a big process to reclaim memory, and the box springs back to life the instant it does. Check after the fact:
```
dmesg -T | grep -iE "Out of memory|oom-killer|Killed process"
journalctl -k | grep -i oom
```
If you find one, thrashing was the prelude — the slow agony before the kernel finally made the brutal-but-correct call. (That whole mechanism is out of memory.)

Any one of these — and especially the vmstat one — means the same thing: stop treating this as a slow CPU or a slow disk, and start treating it as not enough RAM for the work you're asking of it. That reframe is the entire game.

How I Read It

The one command that settles it is vmstat, and the trick — the part people miss — is the 1. Run plain vmstat and the swap columns show averages since boot, which on a box that thrashed for ten minutes last Tuesday look reassuringly small. Run vmstat 1 and every line after the first is a fresh one-second sample: live rates, which is exactly what thrashing is — a rate, not a total.

vmstat 1

Here's a real capture from a box (db-prod) mid-thrash — a database server whose working set outgrew its RAM. The first row is the since-boot average (ignore it); every row after is one second of reality:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4 18 8123904  61244   9012  73128 10480 14760 11320 15040 4821 9633  3  6  4 87  0
 5 21 8129772  58020   8904  71540 11960 16320 12800 16780 5103 10240  2  7  3 88  0
 3 19 8131008  60112   9100  72004  9840 12110  9960 13220 4477 8901  3  5  5 87  0
 6 23 8134560  57340   8820  70996 12740 17050 13600 17320 5290 10711  2  8  2 88  0

It looks like a wall of numbers, but only a handful matter, and they all tell the same story. Let me walk the columns that count, left to right, because once you see why each one moves you can read any vmstat thrash in five seconds:

r and b (procs). r is processes runnable (running or queued to run); b is processes blocked, waiting on I/O to complete. On a healthy box b hovers near zero. Here it's 18, 21, 19, 23 — eighteen-to-twenty-three processes frozen at any instant, every one of them stuck waiting for the disk to hand back a page of memory. That single column is the human cost of thrashing made visible.
swpd and free (memory). swpd is how much swap is in use (~8.1 GB and climbing line to line); free is genuinely-idle RAM — a desperate ~58–61 MB on what is not a small server. Almost no free memory, lots of swap used: the precondition for everything below.
si and so (swap) — the verdict. si is "memory swapped in from disk per second"; so is "memory swapped out to disk per second." These are the two columns the whole diagnosis rests on. Healthy: 0 0, all day. Here: si around 10,000–13,000 and so around 12,000–17,000, every single second, sample after sample. That is the kernel reading ~12 MB of pages back from disk and writing ~15 MB out to disk, per second, continuously — not a burst, a treadmill. Sustained, non-zero si and so together is the definition of thrashing. Not "swap is used" (that's fine); swap being churned.
bi and bo (io). Blocks in/out from block devices, in KiB/s. Notice they march in lockstep with si/so — the swap traffic is disk traffic, so the disk is now saturated carrying memory pages back and forth instead of doing useful reads and writes. This is the bridge to high I/O wait: swap thrashing is one of the most common things hiding behind a disk that looks pinned.
cs (system). Context switches per second — ~9,000–10,700 here. Every time a process blocks on a page fault the scheduler switches to another (which then also blocks on its own missing page), so the switch rate balloons. The CPU is incredibly busy changing its mind about who to run, and almost never actually running anyone.
us, sy, id, wa, st (cpu) — the paradox, resolved. Read this line and the whole illusion dissolves. us (user code) is 2–3%. sy (kernel code) is 5–8%. id (idle) is 2–5%. And wa (waiting for I/O) is 87–88%. The CPU is spending the overwhelming majority of its time doing nothing but waiting for the disk. That's why top says "idle" and the box is dead: idle and I/O-wait are different things, and a beginner reads wa as idle. It isn't. wa is the CPU standing at the mailbox, refusing to do anything else until the disk delivers the page it's blocked on. (st, steal, is 0 — no noisy cloud neighbour stealing cycles here; that's a different story.)

So the honest one-line reading of vmstat 1 for thrashing is: si and so both high and staying high, wa dominating the CPU, b and cs through the roof, free near zero. Five columns, one story — there's not enough RAM, and the kernel is using the disk as a slow, agonising substitute.

Note

The most common misread on this whole page is treating wa as spare capacity. It is the opposite of spare capacity: it's the CPU blocked, unable to advance, waiting on storage. A box at 90% wa has no headroom at all — it's as stuck as a box at 90% us, just stuck waiting instead of stuck computing. "The CPU's idle, so we're fine" is the sentence that turns a ten-minute thrash into an hour-long outage.

Confirming It Was Swap, Not Just a Slow Disk

vmstat 1 shows you the rate right now. To see how much swapping the box has done over its whole life — useful when you arrive after the fire, or want to know whether this is chronic — read the kernel's cumulative counters straight from /proc/vmstat:

grep -E "pswpin|pswpout" /proc/vmstat

pswpin 10516342
pswpout 15089770

These are lifetime totals: pswpin is the number of pages ever swapped in from disk, pswpout the number swapped out, both counting up since boot. On this box, ~10.5 million pages in and ~15.1 million pages out — at the standard 4 KiB page that's roughly 40 GB read back from swap and 57 GB written to it over its uptime. A healthy server that's never been under memory pressure shows these in the thousands, often near zero; numbers in the millions are a box that has spent real time on the swap treadmill. Because they're cumulative counters, the live signal is the delta between two readings a few seconds apart — which is precisely what vmstat 1's si/so columns compute for you. (Same data, two faces: vmstat does the subtraction; /proc/vmstat hands you the raw odometer.)

Reading It by Example

Train the pattern-match. The vmstat 1 readout on the left, what I'd actually conclude on the right:

si 0 so 0, wa 0, free healthy, swap used but quiet → Not thrashing. Idle pages parked in swap are fine — that's swap doing its job. Move on; the disk's clean and the CPU's free.
si 0 so 240, then back to 0, a one-off blip → A normal, healthy nudge: the kernel evicted some cold pages to reclaim RAM and went back to quiet. This is swap working as designed, not a problem. Don't chase it.
si ~11000 so ~15000, sustained every second, wa 87%, b in the teens → Textbook thrashing. The db-prod case above. The box is alive but useless; go to the fix ladder and start at the top — find and kill or limit the hog, then add RAM.
so high, si near zero, sustained → Pages flooding out to swap but not coming back — memory pressure building, the working set being squeezed onto disk, but not yet the full in-and-out treadmill. The early innings of thrashing, or a one-time large eviction. Find what's eating RAM now, before si climbs to meet so.
si/so high and an OOM kill in dmesg → Thrashing that crossed the line: the kernel thrashed for a while, then gave up and invoked the OOM killer. The kill is the cure the kernel chose; your job is to stop it recurring (less work, or more RAM).
si/so near zero but wa still high → Not swap — a genuinely slow or failing disk doing ordinary I/O. Same wa symptom, completely different cause; this page is the wrong one, go to high I/O wait.
si/so high and vm.swappiness reads 100 → Thrashing made worse by a kernel tuned to swap eagerly. The root cause is still "not enough RAM," but an over-eager swappiness is pouring petrol on it — relief below.

The fork at the bottom of all those is the one to internalise: si and so both sustained = a RAM shortage the kernel is papering over with disk. si/so quiet but wa high = a disk problem, not a memory one. That one distinction routes you to the right page every time.

How to Fix It

Thrashing is one of the few server emergencies where the immediate relief and the real fix are different things, and you usually want both, in order: stop the bleeding now, cure the cause after. But one warning comes first, because the fastest "fix" is also the one that loses data.

Danger

The tempting emergency move is swapoff -a to force everything back into RAM — and on a thrashing box that is out of memory, turning swap off can instantly trigger the OOM killer to shoot whatever's largest, which on a database server is the database. An ungraceful kill of a process mid-write can corrupt data or drop in-flight transactions. Never swapoff a thrashing production box as a reflex. If you must reduce swap pressure, first make room — kill or stop a known, expendable hog yourself, then consider swapoff on a box that now has the RAM to absorb it. Stopping the right process by hand is always safer than letting the kernel pick under duress.

Then, immediate relief first, root cause second:

Find the memory hog and deal with it (the first real move). Thrashing is almost always one process (or one runaway group of them) that ballooned past what the box can hold. Name it:
```
ps -eo pid,ppid,rss,comm --sort=-rss | head
top -o %MEM
```
ps sorted by rss (resident memory) puts the biggest consumer at the top. If it's a leak, a stuck batch job, or a misconfigured cache, stop or restart it — systemctl restart the offending service, or kill it if it's expendable — and the treadmill stops the instant its pages no longer need to be resident. A graceful restart beats a kill -9; a process given the chance to flush is a process that doesn't corrupt anything. This single step ends most thrashing in seconds, and it's why "what's eating the RAM?" is the first question, not "how do I tune swap?"
Add RAM — the only permanent cure for a too-small box. If the working set genuinely needs more memory than the machine has, no amount of tuning will help; you're rationing a resource that's simply too small. On a cloud instance that's a resize; on bare metal, more DIMMs. This is the honest answer most thrashing eventually points to, and the reframe matters: "we need a bigger server" is usually a reflex, but for thrashing specifically it's often literally correct — the work doesn't fit.
Cap the hog instead of buying RAM, where the app allows. Often the working set is bloated by configuration, not necessity: a database's buffer pool set larger than RAM, a JVM -Xmx bigger than the box, an app spawning more workers than memory can back. Right-sizing those (innodb_buffer_pool_size, -Xmx, worker counts) shrinks the working set to fit — frequently the better fix than more hardware, because a process configured to use more memory than exists was always going to thrash. Put memory-hungry services under a systemd MemoryMax= (a cgroup limit) and the kernel reclaims within that one service before it can drag the whole box down.
Reduce vm.swappiness for emergency relief. Swappiness is the knob (0–100) that tells the kernel how eagerly to swap out anonymous pages versus dropping file cache. Lower it and the kernel fights harder to keep program memory resident instead of paging it to disk:
```
sysctl vm.swappiness=10
```
This is relief, not a cure — it makes the kernel prefer RAM, which eases an over-eager-swapping box, but it cannot conjure memory that isn't there. On a box that's truly out of RAM, a low swappiness just routes the pressure toward OOM instead of swap. Use it to calm a box that's thrashing because it was tuned to swap too readily; don't expect it to save a box that's genuinely too small. (The setting and its trade-offs: swappiness too high.)
Temporarily swapoff/swapon to clear stale swap — after you have headroom. Once you've killed the hog and the box has real free RAM again, swapoff -a && swapon -a forces every swapped page back into memory and resets swap to empty, which clears out cold pages that were dragging on response times. Safe only once free -h shows ample available memory — never while still under pressure (see the Danger above).

Pro Tip

The fastest accurate triage, in three commands and under ten seconds: vmstat 1 to confirm it's thrashing (sustained si/so, high wa), ps -eo pid,rss,comm --sort=-rss | head to name the hog, then a targeted systemctl restart of the culprit to stop it. Confirm, name, stop — in that order. Reaching for swappiness or swapoff before you've named the hog is treating the symptom while the cause keeps growing.

How to Avoid It

Thrashing is preventable, because unlike a failing disk it isn't physics catching up with you — it's a workload outgrowing its memory, and that's a thing you can size for and stay ahead of. In rough order of payoff:

Size RAM to the working set, with headroom. The root cause is always "active memory > physical RAM," so the cure is always "enough RAM for the active memory, plus slack for spikes." Know roughly how much your app, database, and OS cache actually touch under load, and provision above it. A box that never approaches its memory ceiling never thrashes — full stop.
Bound the memory-hungry services. Anything that grows without a ceiling — a database buffer pool, a JVM heap, a worker pool, a cache with no eviction — is a thrash waiting for a busy Tuesday. Configure explicit limits (innodb_buffer_pool_size, -Xmx, maxmemory for Redis) that sum to less than RAM, and put services under systemd MemoryMax= so one runaway is contained to itself instead of taking the host down with it.
Right-size swap, and know what it's for. Swap is a shock absorber for cold pages and a runway that buys you time before the OOM killer fires — not extra RAM. A little swap is healthy; a lot of swap on a memory-starved box just gives the kernel more rope to thrash on before it gives up. Size it modestly and treat heavy, sustained swap use as the warning it is, not as capacity.
Tune vm.swappiness for a server, not a laptop. The desktop default (60, sometimes 100) favours swapping out program memory to keep file cache fat — sensible for a laptop, wrong for a server whose program memory is the work. A lower value (commonly 10, or even 1) keeps your processes resident and only swaps under genuine pressure. It won't save a too-small box, but it stops a right-sized one from swapping needlessly. (swappiness too high covers the trade-offs.)

And the deepest prevention isn't a number at all, it's a trend. A box at 60% memory is meaningless on its own; a box that climbed from 60% to 95% over a week is a thrash with a date on it — and the day swap usage starts rising and getting churned is the day before the outage, not the day of it. The signal that matters is RAM pressure building over time, with swap traffic appearing where there was none — a trajectory a single live free -h can never show you, and the one thing a monitor catches days early.

Why Thrashing Spirals — The Death-Spiral, Explained

Now the part you don't need mid-emergency but that makes the whole thing fall into place — and it's one of the more beautiful-and-terrible mechanisms in an operating system. To understand why thrashing, once it starts, gets worse instead of settling, you have to understand the one idea the whole of virtual memory rests on: your programs don't address physical RAM. They address an illusion, and the kernel maintains that illusion with sleight of hand.

The Beautiful Lie of Virtual Memory

Every process runs believing it has a vast, private, contiguous span of memory all to itself. It doesn't. The kernel hands each process a map of virtual addresses and quietly translates those, page by page (a page is 4 KiB on most systems), to wherever the data actually lives. And here's the trick that makes everything possible: a page's data doesn't have to live in RAM. It can live on disk — in swap for a program's own anonymous memory, or in the original file for code and mapped files — with the kernel keeping only a note that says "this page exists, but it's parked on disk right now."

When a process touches an address whose page is parked on disk, the CPU hits a page fault: it traps into the kernel, which says "hold on, I need to fetch that," reads the page back from disk into a free frame of RAM, updates the map, and lets the process continue as if nothing happened. A minor fault (the page was already in memory, just unmapped) is nearly free. A major fault — the one that defines thrashing — means an actual disk read, and disk is five orders of magnitude slower than RAM: nanoseconds become milliseconds. One major fault is invisible. A million a minute is a catastrophe. This whole graceful machine is what lets you run programs that need more memory than you have — right up until they actually need it all at once.

Why

This is also why an idle CPU can sit at 88% wa. When a process page-faults to disk, it can't continue until the page arrives, so the scheduler blocks it and runs someone else — who promptly faults on their missing page and blocks too. The CPU isn't idle in the "nothing to do" sense; it's idle in the "everyone who wants to run is waiting for the disk" sense. There is work — it's all just stuck behind storage. wa is the kernel's honest accounting of that stall, and it's why "idle CPU, frozen box" is not a contradiction at all once you see the machinery.

Each Rescue Causes the Next Disaster

Here's the spiral, and it's vicious precisely because every step is the kernel doing exactly the right thing. Picture a box whose working set is just barely too big for RAM — every page is needed, but they don't all fit.

A process touches a page that's been swapped out. Major fault. The kernel must read it back from disk — but RAM is full, so first it has to evict some other page to make a frame free. It picks the page that's been used least recently and writes it out to swap. The fault resolves, the process runs… for a few microseconds, until it touches the page that just got evicted — because on a box this tight, that page was needed too. Another major fault. To bring it back, the kernel evicts another needed page. Which is needed again almost immediately. Another fault. Another eviction. Another fault.

That's the death-spiral: every page brought in forces a needed page out, and every page forced out is faulted back in moments later. The kernel isn't malfunctioning — at each step it's reclaiming the coldest page available and servicing the most urgent fault, flawlessly. The problem is that on a box whose working set exceeds RAM, there is no cold page — they're all hot, all needed soon — so every eviction is a future fault and every fault forces a future eviction. The system locks into a self-feeding loop where it does nothing but page, and the more it pages the slower it gets, and the slower it gets the longer each process holds memory it can't make progress with, which makes the pressure worse. It is a traffic jam made of memory: each car inching forward only by forcing the car ahead to brake.

And it does not break on its own. Left alone, a thrashing box stays thrashing — there's no natural relief valve, because nothing finishes, so nothing frees memory, so the pressure never drops. Three things can end it: the workload drops (a client gives up and disconnects, shedding its memory), you intervene (kill the hog, add RAM), or the kernel reaches for its weapon of last resort and invokes the OOM killer — shooting the biggest process to forcibly reclaim its memory in one stroke. That OOM kill, brutal as it is, is often the kindest thing that happens to a thrashing box all day: it's the loop finally being cut. The kernel paged and paged trying to satisfy everyone, realised it never could, and made the hard call a human should have made minutes earlier.

So: thrashing isn't a bug, it's virtual memory's graceful illusion meeting a workload that calls its bluff. The kernel promised every process more memory than exists, most processes never cash the cheque at once — and on the day they do, the disk gets handed a job five orders of magnitude too slow for it, and the whole machine grinds to the speed of spinning rust. Hold that picture and every column of vmstat reads itself: si/so are the cheque being cashed, wa is the CPU waiting for the bank, and b is the queue of everyone stuck in line.

What Tips a Healthy Box Over the Edge

If thrashing is a working set outgrowing RAM, the practical question is what makes the working set jump — because a box can sit at 80% memory for months and then thrash hard on a Tuesday afternoon. A few triggers do almost all of the damage, and recognising them turns "it just fell over" into "of course it did."

A cache or buffer pool that grows to fit the data, not the RAM. The classic is a database told to use a buffer pool sized to the dataset rather than to the machine — innodb_buffer_pool_size set to "most of the data" on a box that doesn't have most-of-the-data worth of RAM. It behaves for as long as the hot rows fit in what RAM is actually free; the morning a report touches the cold tail of the table, the pool tries to pull all of it resident at once and the working set steps off a cliff. The fix isn't more RAM, it's a pool sized to the box — which is why right-sizing sits so high on the fix ladder above.

Concurrency multiplying per-request memory. Many stacks hold memory per in-flight request — a few hundred MB of working set per worker is invisible at ten concurrent requests and fatal at two hundred. A traffic spike, a retry storm, or a slow downstream that makes every request linger longer all do the same thing: they raise the number of requests resident simultaneously, and the working set is per-request memory times concurrency. The box didn't get a bigger job; it got the same job many more times at once. This is why a thundering herd often shows up first as a thrash, not as high CPU.

A leak finally cashing in. A slow memory leak is a thrash with a fuse on it. For days the leaked pages are cold — never touched again — so they drift out to swap and the box looks fine; free shows swap filling but vmstat's si/so stay quiet, because nobody's reading the leaked pages back. The trouble starts when the leak crowds the legitimate working set out of RAM: now the pages the app actually needs are the ones getting evicted, and they do get faulted back in. The leak itself never thrashes — it just shrinks the room the real work has to live in until the real work no longer fits.

A second tenant arriving. A box right-sized for one job thrashes the instant something else moves in next to it — a backup job that mmaps a huge file, a cron task that loads a dataset, a sidecar that balloons. Each is fine alone; together their working sets sum past RAM. This is the strongest argument for the systemd MemoryMax= discipline from the fix ladder: bound each tenant and a runaway is contained to its own cgroup instead of evicting its neighbours' hot pages and dragging the whole host onto the swap treadmill.

The thread through all four: thrashing is rarely "the box slowly used more memory." It's almost always a step change — a query that widened the working set, a spike that multiplied it, a leak that finally squeezed it, a neighbour that joined it — landing on a box that had less headroom than it looked. Which is the whole case for watching the trend: the headroom shrinking is the warning; the step change is just the day it ran out.