Page Cache: Explanation & Insights

Why your server reports 95% memory used while nothing is running — and why that is fine.

What It Is

The page cache is the kernel's automatic mechanism for keeping recently-read file data in RAM so that the next read of the same data is served from memory instead of from the disk. Every single read() system call that touches a file on a filesystem goes through the page cache. The first read fetches data from the physical disk — an HDD seek, an NVMe command, whatever the storage layer requires — and stores a copy in RAM, in units called pages (4 KB each on most Linux systems). The second read of the same data never reaches the disk at all. It comes straight from RAM, at memory speed, which is roughly ten thousand times faster than a spinning HDD and still ten to fifty times faster than the fastest SSD.

The kernel does this entirely on its own. No application has to ask for it, no administrator has to configure it, no daemon runs it. It's built into the virtual memory subsystem and has been since the earliest days of Linux. Every file you cat, every log line a daemon reads, every config file Apache loads at startup, every database page PostgreSQL fetches through the normal I/O path — all of it flows through the page cache, and all of it stays in RAM until something else needs that RAM more.

This is the single most important performance feature on a Linux server that nobody ever configures. A warmed-up server — one that has been running its normal workload for a few hours — typically serves 90% or more of its read I/O from cache. The disk is barely involved. Pull the power and reboot, and you'll feel the difference immediately: the first few minutes are sluggish as the cache fills back up. That warm-up period — the time it takes for the page cache to learn your working set — is the gap between "disk speed" and "RAM speed," and on a healthy server it closes fast.

Why It Matters

Three reasons, in order of how often they'll affect your life as a server admin.

First: performance. The page cache is the reason grep on a 2 GB log file takes 30 seconds the first time and half a second the second time. It's the reason a web server can serve the same static assets to ten thousand visitors without the disk breaking a sweat. It's the reason a build system that re-reads the same header files a thousand times doesn't grind to a halt. Without the page cache, every I/O operation would be a round trip to storage. With it, only the first one is.

Second: the great memory confusion. This is where the page cache causes the most trouble — not because it does anything wrong, but because it makes memory usage numbers look alarming to anyone who doesn't know the trick. Run free -h on a server that's been up for a week and you'll see something like this:

              total        used        free      buff/cache   available
Mem:           31Gi        4.2Gi       1.1Gi        26Gi        26Gi
Swap:         8.0Gi          0B        8.0Gi

At first glance this looks like the server is using 30 GB out of 32 GB — nearly full. Time to panic, right? Add more RAM? Kill some processes? No. Look at the buff/cache column: 26 GB. That's the page cache. The kernel is using 26 GB of RAM to cache file data — not because applications need it, but because the RAM is there and it would be wasteful to leave it empty. The real question is the available column: 26 GB. That's how much RAM applications could actually claim right now, because the kernel will evict cache pages instantly, without hesitation, the moment a process calls malloc() and needs the space.

The rule is simple and worth memorising: free RAM is wasted RAM. An idle page of RAM that holds nothing is a page that could be caching a file read and saving a disk I/O. The kernel fills every scrap of free RAM with cache because that's the optimal thing to do. The available number — not free, not total - used — is what tells you whether your server is actually running low on memory.

Third: swap interaction. When real application memory pressure builds and available drops toward zero, the kernel has two levers. It can evict clean cache pages (free, instant, painless — the data is still on disk and can be re-read later). Or it can swap out application pages (expensive, slow, painful — those pages have to be written to swap space and read back later). A healthy system under pressure evicts cache first and almost never swaps. If you see heavy swapping while buff/cache is still large, something is wrong with the balance — usually the vm.swappiness kernel parameter is set too high, or an application has mlock'd pages that the kernel can't reclaim.

How I Read It

free -h — The 10-Second Answer

The fastest way to understand your server's memory:

free -h

              total        used        free      buff/cache   available
Mem:           31Gi        4.2Gi       1.1Gi        26Gi        26Gi
Swap:         8.0Gi          0B        8.0Gi

The columns that matter:

buff/cache — RAM used by the page cache (file data) plus buffers (block device metadata). This is the cache.
available — RAM that applications can claim right now, including cache that would be evicted to make room. This is the number that answers "do I have enough memory?"
free — RAM that is literally unused, holding nothing. On a healthy server this number is low, and that's fine. Low free with high available means the kernel is doing its job.
used — RAM consumed by applications (not counting cache). This is real demand.

The old free command (before procps-ng 3.3.10, roughly 2014) didn't have an available column. It showed a -/+ buffers/cache row instead, which tried to do the same subtraction manually. If you're on a modern distro, you have available and you should use it. If someone sends you a free output without the available column, the server is running very old software and deserves attention for more reasons than memory.

/proc/meminfo — The Full Picture

/proc/meminfo is where free gets its numbers. It has far more detail:

grep -E 'MemTotal|MemFree|MemAvailable|Buffers|^Cached|SReclaimable|Dirty|Writeback' /proc/meminfo

MemTotal:       32768000 kB
MemFree:         1126400 kB
MemAvailable:   27340800 kB
Buffers:          204800 kB
Cached:         25395200 kB
SwapCached:            0 kB
SReclaimable:     512000 kB
Dirty:             12800 kB
Writeback:             0 kB

The total page cache size is approximately Cached + Buffers + SReclaimable. That's what buff/cache in free reports. Breaking it down:

Cached — file data pages. The bulk of the page cache. This is what you think of as "cached files."
Buffers — metadata buffers for block devices (superblocks, directory entries, inode tables). Small but important.
SReclaimable — slab allocator memory that the kernel can free under pressure (dentries and inodes, mostly). "Reclaimable" means the kernel will give it up when needed, just like cache.
Dirty — cache pages that have been written to but not yet flushed to disk. We'll cover this in detail below, because dirty pages are where data loss hides.
Writeback — pages currently being written to disk right now. Usually near zero unless a flush is in progress.

The fields above, at a glance:

Field	Meaning
`MemTotal`	Total usable RAM the kernel knows about
`MemFree`	RAM holding nothing at all — low on a healthy server
`MemAvailable`	RAM applications can claim right now, including evictable cache — the number that matters
`Buffers`	Block-device metadata buffers (superblocks, directory entries, inode tables)
`Cached`	File data pages — the bulk of the page cache
`SReclaimable`	Slab memory the kernel can reclaim under pressure (mostly dentries and inodes)
`Dirty`	Cache pages written but not yet flushed to disk — where data loss hides
`Writeback`	Pages being flushed to disk right now — usually near zero

Pro Tip

The MemAvailable field in /proc/meminfo is computed by the kernel itself — it's not just MemFree + Cached. The kernel accounts for low watermarks, fragmentation, and non-reclaimable slab memory. Trust MemAvailable over any manual arithmetic you could do with the other fields.

vmstat — Watching the Cache in Motion

vmstat shows you the page cache changing over time:

vmstat 1 5

The cache column in the memory section is the page cache size. The bi and bo columns (blocks in, blocks out) tell you how much I/O is actually reaching the disk. On a well-cached server, bi is low most of the time — reads are being served from cache. A sudden spike in bi means something is reading data that isn't cached: a new workload, a cold restart, or a dataset larger than RAM.

Dirty Pages and the Writeback Danger Zone

So far we've talked about the page cache as a read optimisation. But it also handles writes, and that's where the story gets sharper.

When a process writes to a file, the kernel doesn't send the data to disk immediately. It writes to the page cache and marks the page as dirty — meaning it contains data that exists in RAM but not yet on disk. The kernel will flush those dirty pages to disk later, in the background, batched for efficiency. This is called writeback, and it makes writes dramatically faster: the application returns from write() in microseconds instead of waiting milliseconds for the disk to confirm.

The danger should be obvious. Between the write() call and the background flush, the data exists only in RAM. If the power goes out — or the kernel panics, or someone trips over the wrong cable — those dirty pages are gone. The application thinks the data is saved; the disk has never seen it. This is not a theoretical risk. It happens, and it's the reason databases care so deeply about fsync().

The Knobs: vm.dirty_ratio and vm.dirty_background_ratio

Two kernel parameters control how much dirty data can accumulate before the system starts flushing:

sysctl vm.dirty_background_ratio vm.dirty_ratio

vm.dirty_background_ratio = 10
vm.dirty_ratio = 20

vm.dirty_background_ratio (default: 10) — when dirty pages exceed this percentage of total RAM, the kernel starts flushing in the background. Applications aren't slowed down; the kernel just begins writing dirty pages to disk behind the scenes.
vm.dirty_ratio (default: 20) — when dirty pages exceed this percentage, the kernel blocks the writing process until enough pages have been flushed. This is the hard ceiling. An application that hits this limit will stall on its write() call until the disk catches up.

The two knobs side by side. The "typical default" values below are the common case, but they vary by distribution and kernel version — always confirm with sysctl on the actual server rather than assuming:

Parameter	Meaning	Typical default
`vm.dirty_background_ratio`	Kernel starts flushing dirty pages in the background once they exceed this % of RAM	`10`
`vm.dirty_ratio`	Writing process is blocked/throttled until a flush completes once dirty pages exceed this % of RAM	`20`

On a server with 32 GB of RAM, the defaults mean up to 6.4 GB of dirty data can accumulate before any process is forced to wait. That's 6.4 GB of data that would be lost in a power failure. For a general-purpose server this is fine — the background flusher usually keeps the real dirty count well below the limit. For a database server, it's terrifying, which is why databases use fsync().

Warning

Raising vm.dirty_ratio to extreme values (50%, 80%) is a common "performance tuning" suggestion that trades safety for throughput. It works — writes feel faster because more data is buffered — but the crash window grows proportionally. On a server with 64 GB of RAM and dirty_ratio=80, you could lose 51 GB of un-flushed data in a power failure. The writeback storm when the kernel finally flushes will also saturate the disk and freeze every I/O-bound process for seconds or minutes. Leave the defaults alone unless you have a specific, measured reason to change them.

fsync — The Database's Best Friend

fsync() is the system call that means "flush this file's dirty pages to disk right now, and don't return until the disk confirms." It's the application's escape hatch from writeback buffering. When PostgreSQL commits a transaction, it calls fsync() on the WAL (write-ahead log). When MySQL/InnoDB commits, it calls fsync() on the redo log. When Redis does an AOF fsync, same story. The call is expensive — it waits for the physical disk — but it's the only way to guarantee that committed data survives a crash.

Disabling fsync() (or mounting a filesystem with nobarrier) is playing with fire. The database thinks its data is durable; the disk hasn't seen it yet. A crash in that window corrupts your data. This is not hypothetical: every experienced DBA has seen a database destroyed by a crash on a system that promised "it never goes down." Use fsync(). Accept the latency. That's the price of data you can trust.

O_DIRECT — When Applications Skip the Cache

Some applications bypass the page cache entirely by opening files with the O_DIRECT flag. This means reads and writes go straight to the disk, not through the kernel's cache. The application takes full responsibility for its own buffering.

Why would anything do this? Because some applications know their access pattern better than the kernel does, and maintaining a second copy of the data in the page cache would waste RAM. MySQL/InnoDB is the canonical example: it maintains its own buffer pool — a carefully managed in-process cache of database pages, sized by the DBA, with its own eviction policy tuned for database access patterns (LRU lists, adaptive hash indexes, change buffering). If InnoDB let the kernel cache the same pages in the page cache too, every database page would exist twice in RAM: once in InnoDB's buffer pool and once in the page cache. That's a 50% waste of memory on a server whose most precious resource is RAM. So InnoDB opens its data files with O_DIRECT and tells the kernel: "I'll handle caching. You handle the I/O."

PostgreSQL, interestingly, takes a different approach by default: it uses the page cache and relies on the kernel to manage buffering. It has a smaller shared_buffers (typically 25% of RAM) and lets the page cache handle the rest. This works well on Linux because the page cache is good at what it does, but it means PostgreSQL servers show very high buff/cache numbers in free — which is expected and healthy, not a sign of trouble.

The practical takeaway: if you're running a database server and free shows that most of your RAM is in buff/cache, check whether the database uses O_DIRECT. If it does (MySQL/InnoDB with innodb_flush_method=O_DIRECT), that cache is other files — logs, temp tables, OS files. If it doesn't (PostgreSQL default), that cache is your database data and it's working as designed.

The drop_caches Myth

Every few months, a forum post or a "Linux tuning guide" recommends running this:

echo 3 > /proc/sys/vm/drop_caches

This tells the kernel to evict all clean cache pages, all dentries, and all inodes from memory. The page cache goes to nearly zero. free will show a huge jump in free RAM. And then, over the next few minutes, the disk will grind as every application re-reads everything it needs, the cache fills back up, and you're right where you started — except you just inflicted a cold-start penalty on a warm server for no reason.

drop_caches is not a fix for anything. It doesn't free memory that applications need (the kernel already evicts cache automatically when applications need RAM). It doesn't solve memory leaks (leaked memory isn't in the cache). It doesn't improve performance (it destroys the very cache that was providing performance). The only legitimate use is benchmarking — when you specifically need to measure cold-cache I/O performance and want to start from a known empty state. In production, running drop_caches is the equivalent of emptying a library's card catalogue and telling everyone to go find their books again.

If you find yourself reaching for drop_caches because you think the server is "out of memory," read the available column in free -h first. If available is healthy (more than 10-15% of total RAM as a rough guide), you're fine. If available is genuinely low, the problem is real application memory pressure — a memory leak, an under-provisioned server, or a workload that outgrew its box — and dropping the cache won't fix any of those.

Gotchas

Things that catch people, roughly in order of how much confusion they cause:

"95% memory used" is normal. On a server that's been up for more than a few hours, seeing 90-95% of RAM as "used" in casual monitoring tools is expected and healthy. The page cache is doing its job. The only number that matters is available. If your monitoring triggers memory full alerts based on total - free, it will cry wolf on every healthy server. Alert on available dropping below a threshold instead.
Reboots are cold. After a reboot, the page cache is empty and every read is a disk I/O. A database server that normally handles 50,000 queries per second might struggle with 5,000 right after a restart. This is not a bug and it's not a sign of hardware trouble. The cache just needs time to warm up. Plan restarts during low-traffic windows, and don't be alarmed by slow performance in the first few minutes.
Large sequential reads can flush the cache. Running a full tar of a multi-gigabyte directory, or a mysqldump of a large database, reads so much data sequentially that it pushes your useful cached data out to make room for data that will never be read again. The kernel has mitigations for this (the fadvise system call, and internal heuristics that detect sequential access), but they're not perfect. If you notice performance dipping after a large backup, this is likely why.
NFS and network filesystems cache too. The page cache doesn't only apply to local disks. NFS and CIFS mounts also go through the page cache (unless mounted with direct options), which means your server can be caching file data from a remote storage server. This is usually helpful, but it means buff/cache on an NFS client includes data that lives on someone else's hardware.
Swap and cache are different eviction queues. The kernel prefers to evict cache before swapping application memory, but vm.swappiness controls the balance. If you see swapping while buff/cache is still high, lower vm.swappiness (the default of 60 is often too eager for a server workload — many admins set it to 10 or even 1).

History and Philosophy

The idea of caching disk data in RAM is as old as virtual memory itself — IBM mainframes in the 1960s did it, and every Unix system since BSD 4.2 (1983) has had some form of buffer cache. But early Unix had two separate caches: a buffer cache for block device I/O and a page cache for memory-mapped files. The buffer cache worked in fixed-size blocks (typically 512 bytes or 1 KB), while the page cache worked in pages (4 KB). They didn't share data, which meant the same file block could exist in both caches simultaneously — wasted RAM and a source of consistency bugs.

Linux unified them. Starting with the 2.4 kernel (January 2001), the page cache became the single, authoritative cache for all file data. The "buffers" that still appear in free and /proc/meminfo are a thin layer of metadata buffers, not a separate data cache. This unification was a significant engineering achievement: it eliminated the double-caching problem, simplified the writeback path, and made the whole system more memory-efficient. When you see "buff/cache" as a single column in free, that slash is a historical artifact of two things that used to be separate and are now, for all practical purposes, one.

The philosophy behind the page cache is one of the defining ideas of Linux's approach to memory management: unused RAM is wasted RAM. Other operating systems of the same era (Windows, particularly) took a more conservative approach — keeping free RAM available "just in case" and requiring applications to explicitly request caching. Linux took the opposite bet: fill every byte of free RAM with cached data, and evict it instantly when something more important needs it. This aggressive caching strategy means that a freshly-booted Linux server "looks" like it's running out of memory within hours of starting up, which has confused a generation of administrators — but it also means that Linux servers deliver extraordinary I/O performance without any tuning at all. The page cache is the silent hero behind every fast grep, every snappy web server, every database query that returns in microseconds instead of milliseconds. You never configure it. You rarely think about it. And it's working for you every second your server is up.