Filesystem: Explanation & Insights

The invisible layer that turns a slab of raw disk into something you can actually name, organise, and trust.

What It Is

A filesystem is the data structure that lives on a block device — a disk partition, an LVM volume, a loop-mounted image — and gives it structure. Without a filesystem, a disk is just a numbered sequence of 512-byte (or 4096-byte) sectors with no notion of files, directories, permissions, or names. The filesystem imposes all of that: it carves the raw space into allocation units, tracks which units belong to which file, records who owns each file and when it was last touched, and — critically — makes sure the whole structure survives a power cut or a kernel panic without turning into garbage. Every time you type ls, cat, cp, or rm, you're talking to a filesystem. Every time you mount a partition, you're attaching a filesystem to the directory tree. It is, quite literally, the foundation everything else sits on.

If you're coming from application development, you've probably never thought about this layer — it just works. But the moment you run a server, the choice of filesystem starts to matter. Different filesystems have different strengths, and picking the right one is a decision you make once — when you format the partition — and live with for years. This page will give you enough depth to make that decision well.

A filesystem's job breaks down into four responsibilities: naming (mapping paths like /var/log/syslog to physical locations), metadata (ownership, permissions, timestamps — stored in inodes), allocation (deciding where data lives on disk and tracking free space), and integrity (surviving crashes without corruption, usually through journaling).

Why It Matters

The filesystem is the single most consequential choice you make when formatting a disk, because changing it later means backing up everything, reformatting, and restoring — or more realistically, building a new server. Get it wrong and you'll live with the consequences for years: a filesystem that fragments under your workload, or one that silently corrupts data because it lacks checksums, or one that eats inodes faster than blocks so you hit "disk full" with gigabytes of space remaining.

The good news is that for the vast majority of Linux servers, the right answer is simple: ext4. It's the default on nearly every distribution for a reason. But understanding why it's the default — and when to deviate — requires knowing what the alternatives offer and what they cost. That's what the rest of this page is for.

How a Filesystem Works

Blocks and Allocation

The filesystem divides the block device into fixed-size blocks — typically 4 KB on Linux, matching the kernel's memory page size. Every file occupies one or more blocks. A 1-byte file still costs you one full 4 KB block (this is why millions of tiny files waste more space than you'd expect). An allocation bitmap or B-tree tracks which blocks are free and which are taken.

Modern filesystems use extents rather than tracking individual blocks: an extent says "this file occupies blocks 1000 through 1999" in a single record, rather than listing all 1000 blocks individually. ext4, XFS, and btrfs all use extents. This dramatically reduces metadata overhead for large files and cuts fragmentation because the allocator tries to lay down data in contiguous runs.

Inodes — The File's Identity Card

Every file and directory has an inode — a small metadata record that stores the file's size, ownership (uid/gid), permissions, timestamps (access, modify, change), and pointers to the data blocks. The inode does not store the filename — that belongs to the directory entry, which is just a mapping from a name to an inode number. This is why hard links work: two names pointing at the same inode.

On ext4, the total number of inodes is fixed at format time (one per 16 KB of space by default). Run out of inodes and you can't create new files even with plenty of free blocks — a classic "disk full" surprise. XFS allocates inodes dynamically, so this problem doesn't arise. Check your inode usage with df -i.

Journaling — Crash Insurance

This is the feature that separates a modern filesystem from its ancestors, and it's worth understanding properly.

When you write a file, the filesystem has to update multiple structures: the data blocks, the inode metadata, the directory entry, the allocation bitmap. If power dies halfway through those updates, some are committed and some aren't — and the on-disk structure is now internally inconsistent. In the old days (ext2, FAT), recovering from this required a full scan of the entire disk: fsck would walk every inode, every block, every directory, checking for contradictions. On a large disk, this took hours. Your server sat there, not booting, running fsck while you watched.

Journaling fixes this with a simple idea borrowed from database transaction logs: before making any change to the main filesystem structures, write a description of the intended change to a small, dedicated area called the journal (or log). If the system crashes mid-write, the recovery code just replays the journal — applying any completed transactions and discarding any incomplete ones. Recovery takes seconds, regardless of disk size. This is why ext4 boots cleanly after a crash and ext2 doesn't.

Most journaled filesystems offer two modes:

Metadata-only journaling (the default on ext4): only the structural changes (inodes, directories, bitmaps) are journaled. File data is written directly. A crash might lose the last few seconds of data writes, but the filesystem itself stays consistent.
Full data journaling: both metadata and data go through the journal. Safer, but every byte is written twice (once to the journal, once to its final location), which roughly halves write throughput. Rarely worth the cost on a server unless you absolutely cannot tolerate any data loss.

XFS and btrfs both journal metadata. ZFS takes a different approach entirely — it uses copy-on-write (CoW) semantics, which sidestep the problem: data is always written to a new location, and the old version remains intact until the new write is fully committed. No journal needed because the old state is never overwritten.

The Linux Filesystem Zoo

There are dozens of filesystems that Linux can mount, but on a server you'll realistically encounter five. Here they are, in order of how often you should reach for them.

ext4 — The Safe Default

ext4 (fourth extended filesystem) is the workhorse of Linux. It evolved from ext3, which evolved from ext2, which dates back to 1993 — over thirty years of battle-testing on every workload imaginable. It's journaled, it uses extents, it supports files up to 16 TB and volumes up to 1 EB (exabyte), and it's the default on Debian, Ubuntu, RHEL, and most other distributions.

What makes ext4 the right default is not that it's the fastest or the most feature-rich — it's that it's the most predictable. Performance is steady, not spiky. Recovery after a crash is quick and reliable. The fsck tooling is the most mature in the Linux ecosystem. Every sysadmin knows it. Every tool supports it. Every edge case has been hit and fixed over three decades. If you have no specific reason to choose something else, choose ext4. You'll sleep better.

Format with mkfs.ext4, inspect with tune2fs -l, check free space with df.

XFS — The Large-File Workhorse

XFS was born at Silicon Graphics in 1993, designed from the ground up for large files and parallel I/O — the IRIX workstations that rendered Jurassic Park were running XFS. It entered the Linux kernel in 2001 and became the default on RHEL 7+, CentOS, Rocky, and AlmaLinux.

Where XFS shines: large sequential writes (backups, media, logs), high-concurrency workloads (multiple threads writing to different files simultaneously), and volumes above 16 TB (where ext4 runs out of address space). Its allocation-group architecture splits the block device into independent regions, each with its own free-space index, so multiple CPUs can allocate in parallel without contention.

Where XFS is less ideal: workloads dominated by small-file random I/O (it historically lags ext4 slightly here, though modern versions have closed the gap). And one practical gotcha: XFS cannot be shrunk — you can grow it online with xfs_growfs, but if you need to make it smaller, you're backing up, reformatting, and restoring. Plan your partition sizes accordingly.

Format with mkfs.xfs, inspect with xfs_info, repair with xfs_repair.

btrfs — Snapshots and Checksums, with Caveats

btrfs (B-tree filesystem, sometimes pronounced "butter-FS") is Linux's answer to ZFS — a next-generation copy-on-write filesystem with built-in volume management, snapshots, checksumming, compression, and send/receive replication. It's an impressive feature list, and it's the default on openSUSE and Fedora Workstation.

The copy-on-write (CoW) design is both btrfs's superpower and its Achilles' heel. CoW means that modifying a block doesn't overwrite it in place — instead, a new block is written and the metadata is updated to point to it. This makes snapshots nearly free (just keep the old pointers) and enables checksumming of every block (detect silent corruption). But it also means that workloads with heavy random writes — databases in particular — cause massive fragmentation and write amplification. Every small random write becomes a metadata update cascade. Performance degrades over time, and df output becomes confusing because CoW accounting doesn't work like traditional allocation.

Warning

Do not run a database (MySQL, PostgreSQL, or any workload doing heavy random writes) on btrfs. The CoW overhead turns random writes into scattered allocations, fragments the data files beyond recovery, and drives I/O latency through the roof. Use ext4 or XFS for databases — or at minimum, disable CoW on the data directory with chattr +C.

btrfs is a fine choice for root filesystems, home directories, and workloads where snapshots are valuable (rolling back a bad upgrade, for instance). But treat it as a specialist tool, not a general-purpose default. Format with mkfs.btrfs, manage subvolumes with the btrfs command.

ZFS — The Gold Standard, If You Can Afford It

ZFS does everything btrfs does and more: block-level checksumming with automatic repair (self-healing, if you have redundancy), snapshots that are trivially cheap, native RAID (called RAIDZ, in levels 1/2/3 matching RAID 5/6 and beyond), send/receive for incremental replication, built-in compression, deduplication, and a unified volume manager that eliminates the need for LVM entirely.

The catch is that ZFS is not part of the mainline Linux kernel. It ships as a separate module (OpenZFS) due to CDDL/GPL licensing incompatibility, so kernel upgrades occasionally break it until the module is recompiled. It also wants RAM — the ARC (Adaptive Replacement Cache) defaults to claiming up to half your system memory, and ZFS below 8 GB for the ARC starts to feel cramped.

Use ZFS when data integrity is non-negotiable (file servers, backup targets, NAS appliances), when you want snapshots and replication without bolting on extra tools, and when you have the RAM budget. Don't use it on a 2 GB VPS or a system where kernel upgrades must never cause surprises.

ext2 / ext3 — Legacy, Use ext4 Instead

ext2 is ext4 without journaling. ext3 is ext4 with journaling but without extents. There is no reason to create either on a new server — ext4 is fully backward-compatible with both. If you inherit a server running ext3, upgrade in place: tune2fs -O extents,uninit_bg,dir_index /dev/sdX1 followed by fsck -f, and you have ext4. The one place you still see ext2: /boot partitions on older GRUB setups. Modern GRUB reads ext4 just fine.

FAT / vFAT — Not for Server Data

FAT (File Allocation Table) is the filesystem of USB sticks, SD cards, and EFI System Partitions. No permissions, no journaling, no hard links, 4 GB maximum file size (FAT32). You'll see it on your server exactly once: the 512 MB EFI partition. Leave it there and format everything else with a real filesystem.

At a Glance

Filesystem	Journaling	Max File	Max Volume	CoW	Checksums	Shrink	Best For
ext4	Yes	16 TB	1 EB	No	No (metadata only via journal)	Yes (offline)	General purpose, databases
XFS	Yes	8 EB	8 EB	No	No (CRC on metadata since v5)	No	Large files, parallel I/O
btrfs	CoW (no journal needed)	16 EB	16 EB	Yes	Yes (data + metadata)	Yes (online)	Snapshots, root volumes
ZFS	CoW (no journal needed)	16 EB	256 ZB	Yes	Yes (data + metadata)	No	Data integrity, NAS, backups
ext2	No	2 TB	32 TB	No	No	Yes (offline)	Legacy only
FAT32	No	4 GB	2 TB	No	No	No	EFI partition, USB sticks

Choosing a Filesystem

There's no universal answer, but this decision tree covers 95% of real-world servers:

General-purpose server, database, web app: ext4. Don't overthink it.
Large storage volume (>16 TB), media server, log aggregator: XFS.
You want snapshots for rollback and can avoid database workloads: btrfs.
Data integrity is paramount, you have 16+ GB RAM, you're comfortable with out-of-tree modules: ZFS.
Boot disk on a legacy BIOS system: ext4.
EFI System Partition: FAT32 (you don't get a choice — the UEFI spec mandates it).

If you're unsure, ext4. You can always add ZFS or btrfs pools on separate partitions later without touching your root filesystem.

How I Inspect It

Four commands tell you everything about the filesystems on a running server. You'll use the first one daily and the others when something feels off.

df — Space at a Glance

df (disk free) shows every mounted filesystem, its total size, used space, available space, and mount point:

df -hT

Filesystem     Type   Size  Used Avail Use% Mounted on
/dev/sda2      ext4   234G   89G  133G  41% /
/dev/sdb1      xfs    1.8T  1.2T  617G  67% /data
tmpfs          tmpfs  7.8G     0  7.8G   0% /dev/shm
/dev/sda1      vfat   512M  5.3M  507M   2% /boot/efi

The -T flag adds the filesystem type — invaluable when you inherit a server and don't know what's what. The -h flag makes the numbers human-readable. And don't forget df -i to check inode usage — a full inode table looks exactly like a full disk from the application's perspective.

mount / findmnt — What's Mounted Where

mount without arguments lists every mounted filesystem and its options. The output is dense; findmnt is the modern, structured alternative:

findmnt -t ext4,xfs,btrfs,zfs

TARGET   SOURCE    FSTYPE OPTIONS
/        /dev/sda2 ext4   rw,relatime,errors=remount-ro
/data    /dev/sdb1 xfs    rw,relatime,attr2,inode64,logbufs=8

This filters to real on-disk filesystems (skipping tmpfs, proc, sysfs, and the dozens of virtual mounts that clutter raw mount output). The OPTIONS column tells you whether the filesystem is read-write or read-only, and which mount flags are active.

lsblk — The Block Device Map

lsblk shows the relationship between disks, partitions, RAID arrays, LVM volumes, and their filesystems in a tree:

lsblk -f

NAME   FSTYPE FSVER LABEL UUID                                 MOUNTPOINTS
sda
├─sda1 vfat   FAT32       ABCD-1234                            /boot/efi
├─sda2 ext4   1.0         a1b2c3d4-e5f6-7890-abcd-ef1234567890 /
└─sda3 swap   1           12345678-abcd-ef01-2345-6789abcdef01 [SWAP]
sdb
└─sdb1 xfs                98765432-dcba-10fe-5432-1098fedcba76 /data

This is the command you reach for when you need to understand the physical layout: which disk holds which partition, what filesystem is on each, and where it's mounted.

blkid — Quick Type and UUID Lookup

blkid prints the filesystem type and UUID of every block device. Useful when editing /etc/fstab — always reference filesystems by UUID, never by /dev/sdX, because device names can reorder between reboots.

Cheat Sheet

# --- Identify what you have ---
df -hT                                          # space usage with filesystem types
df -i                                           # inode usage (catches the "full but not full" surprise)
findmnt -t ext4,xfs,btrfs,zfs                  # real filesystems, clean output
lsblk -f                                       # block device → filesystem → mountpoint tree
blkid                                           # UUIDs and types for all devices

# --- Create filesystems ---
mkfs.ext4 /dev/sdb1                             # ext4 — the safe default
mkfs.xfs /dev/sdb1                              # XFS — large volumes, parallel I/O
mkfs.btrfs /dev/sdb1                            # btrfs — snapshots and checksums
mkfs.vfat -F 32 /dev/sda1                       # FAT32 — only for EFI partitions

# --- Mount and persist ---
mount /dev/sdb1 /data                           # mount now
echo 'UUID=<uuid> /data ext4 defaults 0 2' >> /etc/fstab   # persist across reboots
mount -a                                        # test fstab without rebooting

# --- Inspect and tune ---
tune2fs -l /dev/sda2                            # ext4 superblock details (inode count, block size, features)
xfs_info /dev/sdb1                              # XFS geometry and features
btrfs filesystem show                           # btrfs pool overview
dumpe2fs /dev/sda2 | grep -i "block count"      # raw block stats for ext4

# --- Check and repair (unmounted!) ---
fsck -f /dev/sda2                               # ext4 — forced full check
xfs_repair /dev/sdb1                            # XFS — repair (always unmount first)
btrfs check /dev/sdc1                           # btrfs — read-only check (--repair is risky)

# --- Resize ---
resize2fs /dev/sda2                             # ext4 — grow to fill partition (online)
resize2fs /dev/sda2 100G                        # ext4 — shrink to 100G (offline only!)
xfs_growfs /data                                # XFS — grow (online, takes mountpoint not device)
# XFS cannot shrink. Plan your partitions.

Pro Tip

Always use UUIDs in /etc/fstab, never /dev/sdX names. If a disk is added, removed, or the BIOS probe order changes, device names reshuffle — and your server boots with the wrong filesystem on the wrong mount point, or doesn't boot at all. UUIDs are burned into the filesystem superblock and never change unless you reformat.

Gotchas

Things that catch people, roughly in order of how puzzling they are the first time:

Inode exhaustion looks like a full disk. df says 40% used, but touch newfile returns "No space left on device." Run df -i — if the inode Use% is 100%, that's your culprit. This happens on mail servers (millions of tiny files) and build caches. Fix: delete files, or recreate the filesystem with more inodes (mkfs.ext4 -N <count>).
Filesystem says 95% full but only 90% of blocks are used. ext4 reserves 5% of blocks for root by default — so ordinary users hit "full" at 95%. This is a safety margin so root can still log in and clean up. Adjust with tune2fs -m 1 /dev/sdX1 (1% is usually enough on a large data volume, but don't set it to 0 — you'll regret it the one time you need it).
/dev/sda is not the same disk after a reboot. Device names are assigned by probe order. Add a USB drive, change a BIOS setting, swap a SATA cable — and /dev/sda is suddenly your data disk while /dev/sdb is your boot disk. This is why /etc/fstab and every reference to a disk should use UUIDs.
btrfs df output is confusing. Because of CoW, btrfs allocates space in chunks and reports allocated-but-unused space as "used." The standard df output can show the disk at 80% when it's really at 60%, or vice versa. Use btrfs filesystem usage /mountpoint for the real numbers.
You can't shrink XFS. XFS grows online (xfs_growfs) but has no shrink operation at all. If you over-allocate a partition to XFS, your only option is backup, reformat smaller, restore. This is a design decision, not a bug — XFS prioritises large-scale performance over flexibility.
Mounting the wrong filesystem type silently succeeds. Omit -t and auto-detection picks the wrong superblock on an overwritten disk? You mount garbage. Always verify with blkid before mounting an unfamiliar device.

History and Philosophy

The Unix filesystem, designed by Ken Thompson and Dennis Ritchie at Bell Labs in the early 1970s, introduced the hierarchical directory tree, the inode as the core abstraction, the separation of name from metadata, and the principle that "everything is a file." Every Linux filesystem today is a descendant of that design.

The ext lineage is Linux's own story. ext was written by Remy Card in 1992 as the first filesystem designed specifically for Linux, replacing the borrowed MINIX filesystem that Linus Torvalds used to bootstrap the kernel. ext2 (1993) made Linux viable as a server OS. ext3 (2001, by Stephen Tweedie) added journaling — essential as disk sizes grew and nobody wanted to wait an hour for fsck. ext4 (2008) added extents, multi-block allocation, and delayed allocation — and has been the quiet default ever since, proving that boring reliability is its own kind of achievement.

ZFS, designed by Jeff Bonwick and Matt Ahrens at Sun Microsystems in 2005, took the opposite approach: build a new filesystem and volume manager from scratch, with checksumming as a foundational property. Licensed under the CDDL, it could never merge into the Linux kernel's GPL codebase — which is why btrfs was created in 2007 at Oracle to bring ZFS-like features to a GPL-compatible filesystem. Nearly twenty years later, ZFS remains the gold standard for data integrity, btrfs remains promising-but-uneven, and ext4 remains the thing that just works.

The philosophical split persists: some filesystems aim to be invisible — do one thing, do it well, get out of the way (ext4, XFS) — while others aim to be a platform that handles volume management, snapshots, and replication in one place (ZFS, btrfs). Neither is wrong. But on a server where your primary job is keeping things running, invisible has a lot going for it.