Server Overheating: Symptoms, Diagnosis & Fixes

A hot server doesn't crash first. It quietly slows down, hopes you won't notice, and only falls over when the heat wins.

What It Is

Overheating is what happens when the heat a server makes outruns the heat it can shed. Every transistor that switches turns a sliver of electricity into warmth, billions of times a second, and a CPU under full load is a small electric heater you happen to compute on. That heat has one escape route: into the heatsink, into the air, out of the chassis, out of the room. Break any link in that chain — a stopped fan, a dust-choked heatsink, a server room whose air conditioning gave up over a long weekend — and the heat piles up, and the chip starts taking measures to save its own life.

Here's the part that catches almost everyone, and it's why this page exists: a modern server does not let itself burn out. It slows down instead. When the CPU crosses its thermal limit it deliberately drops its own clock speed — throttling — trading performance for survival, right down to a crawl, before it ever lets the temperature reach the point of damage. And it does this silently: no error in your app's log, no crash, no alert. Your service just gets mysteriously, persistently slow, and stays that way for as long as the box runs hot — which can be weeks, because nobody checks the temperature of a machine that's technically still up. The dramatic failures — the random 3 a.m. reboot, the box that won't power back on until it cools — are the rare end of overheating; the common end is a server running at 60% of the speed you pay for, indefinitely, while everyone blames the database. By the end of this page you'll spot both, read the temperature off the silicon, catch throttling in the act, find the one fan that died, and know which of these you can fix with a screwdriver and which was never yours to fix at all.

How You Notice

Heat trouble shows its face in a handful of ways, and the quiet ones are the ones worth learning. Each comes with the command to confirm it on your own box right now:

Everything slow, but nothing busy. The signature of throttling, and maddening because every usual suspect looks innocent: load is normal, memory's free, the disk is fine — and yet requests crawl. The CPU is there; it's just running at half its rated clock to keep cool. The tell is the clock speed:
```
grep MHz /proc/cpuinfo
```
If a chip with a 3.4 GHz base clock shows 1197.000 across its cores under load, it isn't idling — it's throttling, and the missing gigahertz is your missing performance.
A sudden performance cliff under load. Fine until the box gets busy, then it falls off a ledge as the chip heats up and clocks down. Light load runs cool and quick; sustained load cooks and crawls. That load-dependence is the fingerprint of heat, not of a slow disk or a high load average.
Random reboots and shutdowns. When throttling isn't enough — a dead fan, a blocked heatsink — the hardware hits its hard limit and cuts power instantly to save the chip: thermal shutdown. The box vanishes mid-request, often refusing to return until it's cooled. A server that reboots under heavy load and runs fine when idle is overheating until proven otherwise.
Fans screaming. The most audible symptom and the most honest: a chassis howling at full tilt is telling you the cooling system is maxed out and still losing. (A new whine after months of quiet can also be a bearing starting to go — the fan on its way to becoming the cause of the next overheat.)
Thermal events in the kernel log. The kernel narrates this in plain text the moment the hardware trips:
```
dmesg -T | grep -iE "thermal|throttl|temperature|critical temp"
```
Lines like CPU1: Core temperature above threshold, cpu clock throttled are the chip raising its hand. An empty result under load is genuinely reassuring; a wall of them is a confession. (dmesg is the rawest source there is for this.)

Any one of these points the same direction: stop debugging your application and go read the temperature. That's one command, and it's next.

How I Diagnose It

The whole truth about a server's heat is sitting in a file. Linux exposes every temperature sensor through /sys, the kernel's window onto the hardware, and you read it with nothing but cat — no tools to install, no daemon to configure.

Read the Temperature Straight From the Silicon

The kernel groups hardware-monitoring sensors under /sys/class/hwmon, each reporting temperatures in millidegrees Celsius. A quick tour:

for f in /sys/class/hwmon/hwmon*/temp*_input; do
  printf '%s = %d°C\n' "$f" "$(( $(cat "$f") / 1000 ))"
done

The values come in millidegrees — 58000 means 58 °C — which is why everything divides by a thousand. There's a second door to the same room, the kernel's thermal zones:

for z in /sys/class/thermal/thermal_zone*/temp; do
  printf '%s: %d°C\n' "$z" "$(( $(cat "$z") / 1000 ))"
done

Both read the same hardware. If you have the lm-sensors package, its sensors command is the friendly front end over exactly these files — it just formats them and adds fan RPMs and voltages — but the files are always there, package or not. That's the magic worth pausing on: there's no secret thermometer hidden inside a tool. The temperature of your CPU is a file, the kernel keeps it current, and reading it is cat. Everything fancier is decoration over that one fact.

Know What's Hot and What's Fine

A number means nothing without a scale. Modern CPUs are designed to run warm — far warmer than the cool-to-the-touch instinct expects — and the figure that governs everything is Tjmax, the junction temperature at which the chip starts protecting itself. For most server and desktop CPUs Tjmax is around 100 °C. That's not the melting point; it's the line at which the chip decides it's worked hard enough and begins throttling.

Package / core temperature	What it means
30–50 °C	Idle or light load. Healthy, plenty of headroom.
50–70 °C	Normal under real load. Nothing to see here.
70–85 °C	Working hard. Fine briefly; for sustained load, eye the cooling.
85–95 °C	Hot. Approaching the limit; throttling may already have begun.
~100 °C (Tjmax)	The wall. The chip is throttling now to avoid damage.
Past Tjmax, sustained	Thermal shutdown territory — the box cuts power to survive.

The rule I carry: a server idling in the 30s and 40s and peaking into the 70s under load is perfectly healthy. A server sustained above 85–90 °C is in trouble — not destroyed, but already paying the throttle tax and shortening its own life. Brief spikes to 90+ during a burst are normal; living there is not.

Catch the Throttle in the Act

A high temperature tells you the box is hot. To prove that heat is actually costing you performance, watch the clock speed move. Every core exposes its current frequency:

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq

These come in kilohertz, so 1197000 is 1.197 GHz. Put that next to the chip's base clock (lscpu prints it) and the verdict is immediate: a core sitting below its base clock while the system is busy is being held down, and heat is the usual hand on the brake. The cleanest way to catch it is to watch the number while you push the box:

watch -n1 'grep MHz /proc/cpuinfo'

Run a load and watch the MHz climb, hover, then — if it's overheating — sag back down as the temperature wins. That sag is the throttle, live on your screen. (stress pegs the cores while you keep one terminal on the temperatures and one on the clocks: heat up, gigahertz down, the whole problem in two windows.) The kernel even keeps a tally — core_throttle_count under /sys/devices/system/cpu/cpu0/thermal_throttle/ ticking upward is the hardware admitting, on the record, how many times heat has forced its hand.

Reading It by Example

Readout on the left, what I'd conclude on the right:

45 °C idle, 72 °C under full load, clocks at or above base → Healthy. Exactly what well-cooled silicon looks like. Nothing to do.
Pinned at 99–100 °C under load, cores at 1.2 GHz against a 3.4 GHz base → Throttling, right now. The box is shedding more than half its speed to survive. Find out why the cooling can't keep up.
dmesg full of cpu clock throttled and throttle_count climbing hourly → Chronic overheating, the quiet tax, caught — it's been robbing you for a while.
Normal temps, normal clocks, app still slow → Not heat. Go look at high load or poor performance instead. Don't let a cool box send you chasing a thermal ghost.
Reboots under heavy load, runs for days when idle → Cooling fails only when heat output is high. A marginal cooler, a half-dead fan, or a heatsink barely making contact.
One CPU socket 40 °C cooler than the other → A cooling problem isolated to one socket — a dead fan over it, or a heatsink come loose. The asymmetry is the diagnosis.

How to Fix It

The cause decides the cure. Work them roughly in this order — cheapest and most common first.

Warning

Before you reach for a screwdriver, settle one question that changes everything: is this a physical machine you own, or a virtual machine on someone else's hardware? Inside a cloud guest you can read a temperature that looks alarming, but you cannot touch a single fan, heatsink, or air conditioner — the heat is the host's problem, on hardware you'll never see. Read the last fix in this list first if that's you.

Clean the dust and fix the airflow. The most common cause by a wide margin, and the easiest fix. Dust packs into a heatsink's fins like felt and turns a radiator into an insulator; cooling that was fine a year ago slowly suffocates. Power down, open the box, and clear the dust from fins, fan blades, and intake filters with gentle compressed air. While you're in there, make sure nothing chokes the airflow path — a cable across the intake, empty rack slots left open so cold air short-circuits past the hot gear instead of through it.
Find the fan that died. A single stopped fan turns a healthy server into an overheating one in minutes. Fan speeds live right next to the temperatures:
```
cat /sys/class/hwmon/hwmon*/fan*_input
```
These are RPM. A fan reading 0 while its neighbours spin at three or four thousand is your culprit — dead, seized, or unplugged. Fans are cheap, standardized, and the easiest part in the machine to swap; the temperature usually drops the moment the new one spins up. (If every fan reads zero, suspect a controller or BIOS setting before six simultaneous deaths.)
Reseat the heatsink and renew the thermal paste. The heatsink only works pressed flat against the chip through a thin film of thermal paste — the grease that bridges the microscopic gaps between two never-perfectly-flat metal surfaces. Over years that paste dries, cracks, and stops conducting; a heatsink can also work loose from a knock during maintenance. If temps are high with clean dust and healthy fans, power down, lift the heatsink, wipe off the crusted paste, apply a fresh thin layer, and clamp it back evenly. A bad paste job is a classic cause of one core running far hotter than its siblings.
Lower the ambient temperature. A server can only dump heat into air cooler than itself, so the room sets the floor. If the rack inlet air is already 35 °C, no fan does enough. Check the room's cooling, the hot-aisle/cold-aisle layout, and whether something nearby dumps exhaust into your intake. A failed air conditioner over a weekend has cooked more servers than any dead fan — and it takes out every box in the room at once, which is its own kind of tell.
Check the BIOS fan curve. Sometimes the fans are healthy but the firmware is told to keep them quiet. A curve set to "stay silent until 80 °C" leaves the cooling holding back while the chip bakes. In the BIOS/BMC, set the fan profile to ramp earlier and harder for a server meant to work. Noise is cheaper than throttling.
Reduce or cap the sustained load. If the cooling is genuinely at its limit and you can't improve it today, take heat off the chip by taking work off it: move a heavy batch job off-peak, cap the CPU frequency, or pin the worst offender to fewer cores. A tourniquet, not a cure — but it can keep a box out of thermal shutdown until you fix the real cooling.
If it's a VM or cloud guest, you can't fix the heat from inside. A throttling host slows your guest, and nothing inside it can cool silicon you don't own — likely shared with whatever noisy neighbour is cooking the same machine. What you can do is recognize it: if your instance's performance sags in a way that smells thermal, the cause is on the host, and the fix is to open a ticket with the provider or migrate to fresh hardware. Diagnosing it correctly still matters — it stops you wasting a day tuning a guest for a problem that was never yours to solve.

Pro Tip

When one core runs much hotter than the rest, the cause is almost always physical contact, not load — a heatsink loose on that socket or paste that dried and cracked. Even heat across all cores points at airflow or ambient; lopsided heat points at the mount. Reading per-core temps from hwmon turns "the server is hot" into "the cooler over socket 1 is loose" — a far more useful sentence to act on.

How to Avoid It

You can't repeal thermodynamics, but you can stop heat from ever becoming a surprise. In order of leverage:

Monitor the temperature continuously. This is the whole game, because overheating's worst form is the silent one. A server throttling at 95 °C looks, to every ordinary metric, like a healthy box that's just a bit slow — and it can sit like that for weeks, quietly handing back a third of the performance you pay for, until someone finally reads a sensor. The only defense is something watching the temperature and throttle counters all the time, so the slow tax gets flagged the day it starts. A single manual cat, run once when you happen to suspect trouble, misses exactly the gradual creep that matters most.
Keep the airflow clean. Dust is patient and cumulative; it doesn't cause an outage, it causes a slow decline that looks like the server "just getting old." A periodic clean of filters, heatsinks, and fan blades keeps the cooling you paid for actually working.
Don't run a box at its thermal limit. A server that idles at 80 °C has no headroom — the first warm afternoon or dust buildup tips it straight into throttling. Specify cooling for the peak sustained load plus a margin. A machine that peaks at 70 °C has somewhere to go when conditions worsen; one that peaks at 95 °C is already on the edge.

Note

Heat is not only a performance problem; it's a lifespan problem. Fan bearings, capacitors, the disks, and the silicon itself all age faster hot — a rough engineering rule is that sustained high temperature roughly doubles a component's failure rate for every 10–15 °C. A server you let run hot doesn't just go slow today; it dies younger, and takes its disks with it. Cooling is one of the cheapest forms of reliability insurance there is.

How a Chip Defends Itself

Now the part you don't need in an emergency, but that turns every number above from trivia into something you can reason out. Why does a CPU throttle at all, rather than run flat-out until it dies the way a light bulb burns until it pops?

A transistor steers current with an electric field across an impossibly thin insulating layer. Pack billions onto a fingernail of silicon, flick them billions of times a second, and each switch leaks a sliver of its energy as heat. The cruel twist: hotter silicon leaks more. As temperature climbs, the transistors leak more current even idle, which makes more heat, which raises the temperature further — left unchecked, a chip can run away with itself. So its designers etched a thermostat into the silicon, faster and more trustworthy than any operating system. Tiny diodes are sprinkled across the die, their electrical behaviour shifting predictably with temperature, so the chip can feel its own heat in real time. The hardware watches these against Tjmax, and when a reading nears the limit it climbs a graceful ladder in fractions of a second:

First, it slows the clock. The CPU's speed is set by a clock signal ticking billions of times a second; drop that rate and the chip does less work per second and so makes less heat. This is the throttling you measured in the cpufreq files — the chip choosing slow-but-alive over fast-but-dead, without asking you.
Then it drops the voltage too. Heat scales with the square of voltage, so trimming voltage alongside frequency cools the chip far faster than slowing it alone. The two move together, hunting for the fastest speed it can hold at the temperature it's stuck with.
Finally, if even a crawl can't hold the line, it cuts power entirely. Thermal shutdown — the silicon's last resort, below the operating system, faster than software could react. It's why an overheating server vanishes mid-request instead of logging a polite goodbye: there was no time for polite. The chip chose to be a working server tomorrow over a dead one tonight.

Here's the thing to take away: the throttling that ruins your afternoon is the same mechanism that saves your hardware. The chip isn't malfunctioning when it slows down — it's succeeding, sacrificing the speed you wanted to preserve the silicon you need. Your job isn't to stop it throttling by force; it's to give it enough cooling that it never has to. Every degree of headroom in the airflow is a degree the chip never spends slowing itself down — which is exactly how a healthy server feels: fast, quiet, and never once thinking about the heat.