SSD: Explanation & Insights

Storage without moving parts — faster, quieter, and eventually mortal in a way that is entirely its own.

What It Is

SSD stands for Solid State Drive, and the name tells you what matters most about it: there are no spinning platters, no read/write heads hovering microns above a magnetic surface, no motors, no bearings, no moving parts at all. Where an HDD reads and writes by physically positioning a needle over the right track of a rotating disk — an operation that takes milliseconds because metal has to move — an SSD reads and writes by pushing electrons into tiny floating-gate transistors etched into silicon. Nothing moves. Access time drops from milliseconds to microseconds. And the entire personality of the drive — what it's good at, how it fails, what you need to do to take care of it — flows from that single fact.

The silicon inside every SSD is NAND flash memory, the same technology that stores photos on your phone and firmware on your router. NAND is organized in a strict hierarchy: individual memory cells (each storing one or more bits), grouped into pages (typically 4–16 KB — the smallest unit you can read or write), grouped into blocks (typically 256–512 pages, so 1–4 MB — the smallest unit you can erase). That mismatch between the write unit and the erase unit is the single most important thing to understand about flash storage, because it drives everything else: write amplification, wear leveling, TRIM, garbage collection, and ultimately, how your drive wears out. We'll unpack each of these in order.

If you're running Linux servers — which you are, or you wouldn't be here — you almost certainly have SSDs in at least some of them. Cloud VMs are backed by flash. Bare-metal servers boot from flash. Database servers run on flash because the random-read latency is roughly 100x better than a spinning disk. Understanding how flash works is not academic knowledge. It's the difference between a drive that lasts five years and one you burn through in eighteen months because nobody set up TRIM.

Why It Matters

The pitch for SSDs is straightforward: they are faster at everything a server does. Random reads, random writes, sequential throughput, time-to-first-byte — flash wins on every axis, often by an order of magnitude or more. An HDD doing random 4K reads manages about 100 IOPS. A decent SATA SSD does 80,000. An NVMe drive does 500,000 or more. That's not a percentage improvement; it's a different universe. Databases that were I/O-bound on spinning rust become CPU-bound on flash. Build systems finish in minutes instead of hours. A server that used to take 90 seconds to boot does it in 8.

But flash is not a drop-in replacement for spinning disks with all the same rules. It has its own physics, its own failure mode, its own care and feeding. An HDD will happily accept writes to the same sector a trillion times — the magnetic coating doesn't wear. Flash cells wear out. An HDD degrades gradually — bad sectors accumulate over months, giving you plenty of warning. An SSD tends to either work perfectly or stop working entirely, often with much less lead time. An HDD doesn't care whether you TRIM; an SSD without TRIM slows down and wears faster. These differences are not minor footnotes — they change how you buy, deploy, monitor, and replace your storage.

The Erase-Before-Write Problem

This is the part that explains almost everything unusual about flash, and it's worth understanding completely. In an HDD, overwriting is trivial: the head parks over the old sector and writes new data on top of it. Done. In NAND flash, you cannot overwrite a cell that already contains data. You have to erase it first, and here's the catch: while you can write at the page level (4–16 KB), you can only erase at the block level (1–4 MB). That's hundreds of pages wiped at once, even if you only needed to change one of them.

So what happens when the filesystem wants to update a single 4K page inside a block that's mostly full of valid data? The drive can't just erase the block — that would destroy the other valid pages. Instead, the flash translation layer (FTL) — a piece of firmware inside the drive's controller — performs a shuffle:

Read the entire block into an internal buffer.
Modify the page(s) that need updating.
Write the whole block to a new, already-erased block.
Mark the old block as ready for garbage collection.

This shuffle is happening constantly, invisibly, inside every SSD you've ever used. It's the controller's full-time job, and it introduces a concept that dominates SSD engineering: write amplification.

Write Amplification

Write amplification (WA) is the ratio between the amount of data the host (your server) asks the drive to write and the amount of data the drive actually writes to the flash. If the OS writes 1 GB and the drive internally moves 3 GB of data around to accommodate those writes, the write amplification factor is 3×.

A WA of 1× is the theoretical minimum — every byte the OS writes hits the flash exactly once. In practice, you'll see WA factors between 1.1× and 10×, depending on the workload and how well the drive is maintained. Write-heavy workloads with small random writes (databases, logging) amplify more than large sequential writes (backups, video). A drive that hasn't been TRIMmed is worse: the controller doesn't know which blocks are free, so it has to do more shuffling to find clean erase blocks.

Why does this matter? Because every write the flash does counts against the drive's lifetime, whether you asked for it or not. High write amplification means your drive wears out faster than the raw host-write numbers suggest. When enterprise SSD vendors quote endurance in TBW (terabytes written), that's host writes — the actual flash writes are higher by the WA factor, and the controller is managing the difference.

Wear Leveling

Every NAND flash cell can be erased and rewritten a finite number of times. The exact count depends on the cell type (we'll get to that in Cell Types), but the principle is universal: a cell wears out. If the controller wrote to the same physical blocks over and over — because they happen to map to frequently-updated files — those blocks would die early while the rest of the flash sat untouched. The drive would shrink, then fail, long before the total flash was exhausted.

Wear leveling is the controller's strategy for preventing this. It keeps a running count of how many times each block has been erased (the erase count) and deliberately moves data around so that all blocks wear at roughly the same rate. Hot data on heavily-erased blocks gets relocated to fresh blocks; cold data on barely-touched blocks gets pushed out to make room. The goal is a flat histogram: no block dramatically more worn than any other.

There are two flavors. Dynamic wear leveling only redistributes among blocks that are actively being written. It's simple but has a blind spot: blocks holding cold, static data (an OS image that never changes) never enter the rotation, so they stay fresh while everything else wears. Static wear leveling is smarter — it periodically moves even cold data to hot blocks, forcing every cell in the drive to share the load. Most modern SSDs do static wear leveling.

You can see the result in smartctl. On a SATA SSD, look for attribute 177 (Wear_Leveling_Count) — a number from 100 (new) counting down to 0 (rated life exhausted). On an NVMe drive, the equivalent is Percentage Used in the health log, counting up from 0% toward 100%. Both are the controller's own estimate of how evenly the flash has been used.

TRIM

Here is the most actionable thing on this page: if your Linux server has an SSD, make sure TRIM is working. More drives are degraded by missing TRIM than by any other cause.

The problem: when you delete a file, the filesystem marks the blocks as free in its own metadata — but it doesn't tell the drive. The SSD's controller still thinks those blocks contain valid data and will copy them around during garbage collection, wasting write cycles and wearing the flash for no reason. This is wasted work — write amplification inflated by stale bookkeeping.

TRIM (the ATA command; the NVMe equivalent is called Deallocate) is the filesystem telling the drive: "these logical blocks are no longer in use; do what you want with them." The controller can then erase those physical blocks proactively, keeping a pool of clean blocks ready for future writes. The result is lower write amplification, more consistent performance, and longer drive life.

On Linux, there are two ways to deliver TRIM:

Continuous TRIM — the discard mount option in /etc/fstab:

/dev/sda1  /  ext4  defaults,noatime,discard  0 1

Every delete immediately sends a TRIM command to the drive. Simple and effective, but each TRIM is a small I/O, and on write-heavy workloads the overhead adds up.

Periodic TRIM — the fstrim.timer systemd unit:

systemctl enable fstrim.timer
systemctl start fstrim.timer

This runs fstrim once a week (Sunday by default), sending one big batch of TRIM commands at once. Lower overhead during normal operation, with the trade-off that recently-deleted blocks are stale until the next timer fires.

Best practice: use fstrim.timer. It's what every major distribution enables by default on new installs now, it works with every filesystem that supports TRIM (ext4, XFS, btrfs), and it avoids the per-delete overhead. If your server predates this default — or was installed from a minimal image — check:

systemctl is-enabled fstrim.timer

If the answer is anything other than enabled, fix it. It takes ten seconds and will extend the useful life of every SSD on the box.

Pro Tip

If you're running an SSD behind a RAID array or an LVM layer, verify that TRIM actually passes through. Check lsblk -D and look for non-zero values in the DISC-GRAN and DISC-MAX columns. A zero means the layer is eating your TRIM commands silently, and the drive is flying blind.

Cell Types

Not all NAND is created equal. The number of bits stored per cell determines the drive's capacity, speed, endurance, and cost — and the industry has been cramming more bits per cell for decades, trading longevity for density.

Type	Bits per Cell	Voltage Levels	Endurance (P/E cycles)	Typical Use
SLC	1	2	60,000–100,000	Enterprise write-intensive
MLC	2	4	3,000–10,000	Enterprise mixed
TLC	3	8	1,000–3,000	Consumer, enterprise read-heavy
QLC	4	16	100–1,000	Cold storage, read-mostly

The physics is straightforward: each additional bit doubles the number of distinct voltage levels the cell must distinguish. An SLC cell is binary — charged or not, 1 or 0, two states, easy to read, easy to write, tolerant of degradation. A QLC cell must distinguish sixteen different charge levels, with a shrinking margin between each. The more levels, the longer each write takes (the controller must position the charge more precisely), the less degradation the cell can tolerate before levels blur together, and the more error correction the controller needs to apply.

For servers, the practical takeaway is:

Database servers, write-heavy logging, anything with heavy random writes: use enterprise TLC or MLC drives. They cost more per gigabyte, but the endurance and write consistency justify it.
General-purpose servers, boot disks, read-heavy workloads: TLC is the sweet spot. It's what most datacenter SSDs ship with today and it will last years under normal workloads.
QLC: fine for read archives and cold storage, but not for anything that writes regularly. The endurance is genuinely low — a 1 TB QLC drive rated at 200 TBW will hit its limit in months under a serious database workload.

Endurance: TBW and DWPD

Every SSD comes with an endurance rating, and understanding it is the difference between panicking over a SMART warning and shrugging at one.

TBW (Terabytes Written) is the total amount of host data the manufacturer guarantees the drive can accept before the flash is considered worn. A consumer 1 TB SSD might be rated at 600 TBW. An enterprise 1 TB drive might be rated at 6,000 TBW — ten times more, because enterprise drives use more durable cells, more aggressive wear leveling, and more spare area.

DWPD (Drive Writes Per Day) is TBW re-expressed as a rate: how many times you can overwrite the entire drive's capacity every day for the warranty period (usually 5 years). A 1 TB drive rated at 1 DWPD for 5 years means 1 TB × 365 × 5 = 1,825 TBW. Enterprise drives range from 1 to 10+ DWPD; consumer drives are typically 0.3–0.5 DWPD.

Here's the reassuring reality: most servers never come close to their endurance limits. A server writing 50 GB/day to a 1 TB drive rated at 600 TBW would take over 30 years to exhaust it. The drives that hit their limits are database servers doing sustained heavy random writes, logging servers ingesting terabytes daily, or swap-heavy systems that are really running out of memory. For everything else, endurance is a warranty number, not a practical concern.

You can see where your drive stands right now:

# SATA SSD — look for attribute 246 (Total_LBAs_Written)
smartctl -A /dev/sda | grep -i written

# NVMe — the health log shows it directly
smartctl -a /dev/nvme0 | grep -E 'Data Units Written|Percentage Used'

Data Units Written:                 42,917,469 [21.9 TB]
Percentage Used:                    3%

That NVMe drive has written 21.9 TB and the controller estimates 3% of its rated life consumed. At this rate, it has decades left. The Percentage Used counter is the one to watch — it's the drive's own assessment, accounting for wear leveling and actual cell degradation, not just raw bytes written. When it crosses 100%, the drive has exceeded its rating but will usually keep working; past 200%, start planning a replacement. See SSD worn out for exactly how to read that transition.

How SSDs Fail

This is the part that catches HDD veterans off guard, because SSDs fail differently — and the difference is not always kinder.

An HDD announces its death gradually. Bad sectors accumulate one by one over months. SMART attributes like Reallocated_Sector_Ct tick upward slowly. You get plenty of warning. The drive degrades, gets worse, and eventually becomes unreliable — but at each step you can still read most of the data. The metaphor is a slow leak.

An SSD is more of a cliff. The controller manages the flash so well that everything looks perfect right up until it doesn't. Common failure patterns:

Sudden read-only: the drive detects its flash is too worn or too error-prone to guarantee safe writes, so it locks itself into a read-only mode to preserve your data. This is actually the good outcome — your data is still there, you just can't write to it. Copy everything off and replace the drive.
Sudden death: the controller firmware crashes or the power management circuit fails, and the drive simply vanishes from the bus. No warning, no gradual decline, just gone. This is rarer on enterprise drives (which have better power-loss protection) but it happens.
Slow write degradation: as the flash wears, the controller has to do more error correction and more garbage collection. Writes get slower, latency spikes appear, and tail latencies (the worst-case requests) get much worse. The drive still works, but its performance is degrading in a way that shows up in application response times before it shows up in SMART data.

The lesson: monitor SMART attributes on your SSDs, but don't rely on them alone. A healthy SMART report doesn't guarantee a healthy drive — it means the controller hasn't noticed a problem yet. Have a backup. Have RAID. Treat every SSD as a device that could disappear between one I/O and the next, and design your storage accordingly.

SATA SSD vs NVMe SSD

A common confusion: SATA SSDs and NVMe drives use the same flash, store bits the same way, and wear out the same way. The difference is the interface — how the drive talks to the CPU.

A SATA SSD speaks the AHCI protocol over a SATA cable, limited to a single command queue of depth 32. Maximum throughput: about 560 MB/s (the SATA III ceiling). The protocol was designed for spinning disks and never anticipated flash speeds. It's adequate — a SATA SSD is still enormously faster than an HDD — but the interface is the bottleneck, not the flash.

An NVMe drive speaks its own protocol directly over PCIe lanes, with up to 65,535 queues of 65,536 commands each. Maximum throughput: 3,500–7,000 MB/s for PCIe 3.0/4.0 drives, 14,000+ MB/s for PCIe 5.0. More importantly, the latency is lower: NVMe cuts out the AHCI translation layer and talks almost directly to the flash controller.

For servers, the recommendation is simple: use NVMe for anything that needs performance (databases, boot disks, application storage). Use SATA SSDs where you need the capacity and the slower interface is acceptable (bulk storage, backups, read-mostly archives). Both need the same care — TRIM, SMART monitoring, backup — because the flash underneath is identical.

How I Inspect It

Two tools tell you everything about your SSDs: smartctl for the drive's health and endurance, and lsblk for the device topology.

Health and Endurance

# NVMe drive — the health log is the gold standard
smartctl -a /dev/nvme0

# SATA SSD — the attributes table
smartctl -A /dev/sda

The key fields to watch on an NVMe drive:

SMART/Health Information (NVMe Log 0x02):
Critical Warning:                   0x00
Temperature:                        37 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    3%
Data Units Written:                 42,917,469 [21.9 TB]
Media and Data Integrity Errors:    0

Percentage Used is your primary wear indicator — 0% is new, 100% is rated life, anything above 100% is borrowed time. Available Spare is the remaining pool of reserve blocks the controller can swap in when a cell wears out — once it drops below the threshold, the drive is running on its last legs. Media and Data Integrity Errors at zero is what you want; any non-zero value means the drive has served corrupted data at least once, which is a disk failing signal, not just wear.

On a SATA SSD, the equivalent attributes are:

Attribute	What It Means
5 — Reallocated_Sector_Ct	Bad cells replaced by spares. Rising = real defects.
177 — Wear_Leveling_Count	Counts from 100 (new) to 0 (end of rated life).
179 — Used_Rsvd_Blk_Cnt_Tot	Reserve blocks consumed. Like NVMe Available Spare, inverted.
231 — SSD_Life_Left	Percentage remaining. Some vendors use 233 instead.
246 — Total_LBAs_Written	Lifetime host writes. Multiply by 512 for bytes.

Device Topology

lsblk -o NAME,SIZE,TYPE,ROTA,TRAN,MODEL

NAME     SIZE TYPE ROTA TRAN  MODEL
sda    465.8G disk    0 sata  Samsung SSD 870
nvme0n1  1.8T disk    0 nvme  Samsung SSD 990 PRO
├─nvme0n1p1  512M part
├─nvme0n1p2    1G part
└─nvme0n1p3  1.8T part

The ROTA column is the quick test: 0 means no rotational media — it's an SSD. 1 means spinning platters — it's an HDD. The TRAN column tells you the interface: SATA or NVMe.

Filesystem Recommendations

The filesystem you put on an SSD matters more than you'd think, because flash has different performance characteristics than spinning disks and some filesystems handle that better than others.

ext4: the safe default. Excellent TRIM support, low write amplification, journaling protects against corruption on power loss. Most Linux servers should use ext4 unless there's a specific reason not to.

XFS: excellent for large files and high-throughput sequential workloads. Great TRIM support. The default on RHEL for good reason.

btrfs: use with caution on SSDs. The copy-on-write (CoW) nature means every overwrite becomes a new write to a different location — which is exactly the pattern that inflates write amplification. For write-heavy workloads (databases, VMs), btrfs can burn through flash endurance noticeably faster than ext4. If you need btrfs features (snapshots, checksums), set nodatacow on database directories or use a separate ext4 volume for heavy-write data.

Bottom line: ext4 or XFS for SSDs, unless you have a specific reason for something else. And whatever you choose, mount with noatime — it eliminates write-on-read metadata updates that add pointless wear.

Gotchas

Things that catch people with SSDs, in the order they tend to sting:

No TRIM configured. The most common SSD mistake on Linux servers. The drive fills up with stale block mappings, write amplification climbs, performance degrades, and the flash wears faster. Run systemctl is-enabled fstrim.timer on every server with an SSD. If it says anything other than enabled, fix it now.
"FAILED" doesn't mean dead. smartctl reports FAILED! on the overall health assessment when the drive has exceeded its rated endurance — even if the drive is still working perfectly. This is wear, not damage. Read SSD worn out before you panic.
Power loss is the real killer. SSDs have write caches backed by volatile RAM. A sudden power loss (no UPS, no graceful shutdown) can lose in-flight writes and, in the worst case, corrupt the FTL mapping table — bricking the drive. Enterprise SSDs have power-loss protection capacitors that flush the cache on outage; consumer drives often don't. On a server, use enterprise drives or disable the write cache (hdparm -W 0 /dev/sda).
Encryption overhead is free. Most modern SSDs have hardware AES engines. Full-disk encryption via the drive's own OPAL or eDrive interface adds zero performance overhead. Software encryption (LUKS) adds a small CPU cost but is still negligible on modern processors.
Over-provisioning is not wasted space. Enterprise SSDs reserve 10–28% of their raw flash for wear leveling, garbage collection, and bad-block replacement. This isn't usable space — it's the drive's internal maintenance budget. Don't partition the drive to 100% of its reported capacity and wonder why performance drops; leave the over-provisioned area alone.
Temperature matters. Flash cells retain data less reliably at high temperatures, and prolonged heat accelerates wear. Enterprise drives throttle writes above ~70°C. A drive consistently above 60°C needs better airflow — check smartctl for the temperature and your server's thermal monitoring for ambient conditions.

History and How We Got Here

The idea of using solid-state memory for storage is older than you'd think. The first semiconductor-based storage devices appeared in the 1970s and 1980s — expensive, tiny, and limited to military and aerospace applications where the vibration immunity of no-moving-parts storage justified the cost. These early devices used RAM backed by batteries (essentially a ramdisk with a UPS), not flash.

NAND flash itself was invented in 1987 by Fujio Masuoka at Toshiba, who had earlier invented NOR flash in 1980. The "NAND" name comes from the logic-gate topology of the cell array — cells wired in series, like a NAND gate chain. Masuoka's insight was that wiring cells in series (instead of NOR's parallel arrangement) dramatically reduced the silicon area per bit, making high-density storage economically feasible.

The first commercial SSDs aimed at PCs appeared around 2007–2008, and they were terrible: slow controllers, primitive wear leveling, and a tendency to lock up under sustained writes. The gap between the idea of flash storage and a reliable implementation took years to close, mostly through controller firmware improvements. By 2012, SATA SSDs were reliable enough for consumer laptops. By 2015, NVMe brought the interface up to speed with the media. By 2020, it was genuinely hard to buy a bad SSD from a major vendor — the controllers had gotten that good.

The direction since then has been relentless: more bits per cell (SLC → MLC → TLC → QLC, with PLC — five bits per cell — in labs), more layers of cells stacked vertically (3D NAND, now past 200 layers), and ever-more-sophisticated controllers to manage the increasingly fragile media. Each generation is cheaper and denser, but endurance per cell keeps dropping. The controller firmware compensates — better error correction, smarter wear leveling, more aggressive garbage collection — and in practice, the net result is drives that are cheaper, bigger, and still last years. But the trend is clear: the controller is doing more and more work to make less and less durable media look reliable. It's a race between physics and firmware, and so far firmware is winning.