Post-mortem: CPU Spike to 100% in Drupal Production — Analysis and Resolution

It's 9:47 AM on a Tuesday morning. A New Relic notification comes in: CPU at 94% on the production servers. Thirty seconds later, a second alert: average response time at 8 seconds. Then a Slack message from the client: "The site is down, our teams can't work anymore."

This post-mortem is an illustrative scenario, built from patterns regularly encountered on Drupal platforms in production. The timeline, the commands, the decisions, and above all the mistakes are representative of this type of incident.

The goal is not to show that everything always goes smoothly. The goal is to document how such an incident unfolds, why, and what we put in place so it doesn't happen again.

Platform Context

Drupal 11 on containerized infrastructure
PHP 8.3 with PHP-FPM
Redis for sessions and cache
Varnish as an internal reverse proxy
Cloudflare as the edge CDN
New Relic for application observability
Blackfire available but not permanently active
Typical traffic: ~4,000 visitors/day, ~40 req/s at peak

Incident Timeline

09:47 — First New Relic Alert

CPU at 94% on the application container. The alert is configured on an 80% threshold sustained for more than 2 minutes. The trigger means the situation had been going on for a while even before the notification.

First action: check the New Relic dashboard.

Immediate finding: the number of in-flight web transactions had tripled compared to normal. Throughput had gone from 40 req/s to around 140 req/s within 15 minutes.

09:51 — Access Log Analysis

SSH into the production environment:

ssh deploy@prod-web-01

Reading the access logs in real time:

tail -f /var/log/nginx/access.log | grep -v "varnishd"

Immediate observation: an abnormal volume of requests on URLs with varied query strings. Example pattern:

185.220.x.x - - [31/May/2026:09:48:12 +0000] "GET /?page=1&sort=asc&filter=new HTTP/1.1" 200 48291
185.220.x.x - - [31/May/2026:09:48:12 +0000] "GET /?page=2&sort=desc&filter=old HTTP/1.1" 200 48301
185.220.x.x - - [31/May/2026:09:48:13 +0000] "GET /?page=1&sort=asc&filter=featured HTTP/1.1" 200 48288

Several hundred requests per minute from a range of IPs within the same /24 block. The query strings differ on every request, which prevents any cache hit.

09:54 — Confirmation: Bot Storm + Systematic Cache Miss

Extracting the most active IPs:

awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20

Result: the top 5 IPs accounted for 3,200 requests over the last 10 minutes. Clearly non-human.

Checking on the Cloudflare side: the requests do arrive through Cloudflare (CF-Ray headers present), but Cloudflare forwards them all. No WAF rule blocks this pattern.

Checking the Varnish cache:

varnishstat -1 -f MAIN.cache_hit,MAIN.cache_miss

Abnormally high cache miss ratio: 78% misses instead of the usual 15%. Explanation: the unknown query strings push every request out of the cache. Varnish does not normalize the parameters — it treats ?page=1&sort=asc and ?page=1&sort=desc as two different URLs.

Every miss travels all the way to Drupal. Drupal generates the full page. The CPU blows up.

10:02 — First Emergency Measure: IP Blocking at the Cloudflare Level

Creating an emergency Cloudflare WAF rule:

(ip.src in {185.220.0.0/24}) => Block

Immediate effect: requests from those IPs are blocked. CPU drops back to 60%. Not normal yet, but the pressure eases.

Problem: the bots use multiple IP ranges. Within a few minutes, new IPs take over from other /24 blocks.

10:09 — Second Measure: Challenge on Suspicious User Agents

Rather than chasing IPs one by one, we switch to behavior-based targeting. Cloudflare rule added:

(http.user_agent contains "python-requests") or 
(http.user_agent contains "curl") or 
(http.user_agent eq "") 
=> Managed Challenge (CAPTCHA)

Result: request volume drops by 60%. Simple bots (Python scripts with no JS handling) no longer get through.

10:14 — Third Measure: Query String Normalization in Varnish

The real underlying problem remains untouched: Varnish does not normalize query strings. Even with the bots blocked, a misconfigured Google crawl or a partner who adds UTM parameters to every request can reproduce the same effect.

Change in the Varnish VCL configuration:

sub vcl_recv {
    # Normalize the query strings
    set req.url = regsuball(req.url, "([?&])(utm_[^&]*)", "");
    set req.url = regsuball(req.url, "([?&])(sort|filter|page)=[^&]*", "");
    
    # Remove the trailing ? if all params have been stripped
    set req.url = regsub(req.url, "\?$", "");
}

Note: this change required a redeployment and was tested in staging before going to production. You don't modify Varnish live without a safety net.

10:31 — Back to Normal

CPU drops back to 22%. Average response time back under 400ms. The client is informed. The site is stable.

Total incident duration: 44 minutes.

Root Cause Analysis

The incident has three causes, not just one. That's always the case in real production incidents.

Cause 1 — No Upstream Rate Limiting

Cloudflare was configured in standard proxy mode, with no rate limiting rule enabled. Any IP could send as many requests as it wanted with no friction.

Cloudflare rate limiting is available on all paid plans. It wasn't enabled because "it had never caused a problem." A classic mistake.

Cause 2 — Varnish Not Configured to Normalize Query Strings

By default, Varnish treats every unique combination of query strings as a distinct URL. Without normalization, a bot that varies its parameters on every request bypasses the cache 100% of the time, even if the rendered content is identical.

This configuration should have been in place from the start. It wasn't.

Cause 3 — No Application-Level Circuit Breaker

When Drupal receives more requests than it can handle, it keeps accepting them until the CPU saturates. There is no native mechanism in Drupal that says: "I'm saturated, I'll return a temporary 503 instead of drowning further."

This behavior creates a spiral: the more loaded the CPU is, the longer each request takes to process, the more the queue of pending requests grows, the higher the CPU climbs.

What We Should Have Had in Place

Cloudflare Rate Limiting

A simple rule to put in place on every project:

Path: /*
Threshold: 100 requests per IP per minute
Action: Managed Challenge
Duration: 10 minutes

For sensitive endpoints (/admin, /user, /api): a much lower threshold, 10 req/min, with a direct Block action.

Query String Normalization in Varnish from the Start

The VCL configuration must be reviewed on every project to:

Strip parameters that have no impact on content (UTM, fbclid, gclid, etc.)
Normalize the order of parameters that are part of the cache key
Explicitly define the allowlist of accepted parameters

Alerting on the Varnish Cache Miss Ratio

New Relic or CloudWatch can monitor varnishstat. A cache miss ratio that exceeds 40% over a 5-minute window should trigger an alert, not wait until CPU is at 94%.

A Watchlist of Bots

Maintain a list of user agents known for aggressive scraping and create a Cloudflare rule that systematically sends them to a Managed Challenge. This list builds up incident after incident — better to start early.

Measures Put in Place After the Incident

1. Rate Limiting Enabled on All Production Environments

A global rule at 100 req/min/IP, with specific rules at 10 req/min on:

/admin/*
/user/*
/api/*
/node/*/edit

2. Varnish VCL Revised with Query String Normalization

An allowlist of parameters permitted in the cache key. Any parameter not listed is stripped before the request reaches the cache.

3. New Relic Throughput Alert

Triggers if the number of transactions/minute exceeds 3× the average of the last 7 days for more than 3 minutes.

4. Incident Response Runbook

A document in Confluence that spells out exactly what to do in case of a CPU spike:

Open New Relic → Transaction overview → identify the most expensive URLs
Open the access logs → identify abnormal IPs
Block the IP ranges in Cloudflare → check the effect on CPU
If insufficient → temporarily enable Cloudflare "Under Attack" mode
If Drupal is unreachable → switch to maintenance mode + notify the client
Post-incident: analyze the logs, identify the root cause, document

5. Quarterly Load Testing

A k6 or Gatling test on the staging environment to verify the stack holds up at 3× nominal traffic. This test would have revealed the query string normalization weakness well before the incident.

Lessons Learned

Most production incidents are not caused by a single problem. Here, the bot was the trigger, but the real problem was the absence of rate limiting and the misconfigured cache. Without the bot, the platform was stable. But it was fragile.

Alerting on CPU alone is insufficient. At 94% CPU, it's too late. You need to alert upstream: cache miss ratio, abnormal throughput, rising P95 response time. CPU is a consequence, not a cause.

Varnish configuration is not "set and forget." It must be reviewed on every project, with particular attention to URL normalization. A base VCL template should be part of the starter kit for any Drupal project using Varnish.

Document during the incident, not just after. While resolving it, we took notes in a dedicated Slack thread. This post-mortem is built from those notes. Without them, the timeline would have been impossible to reconstruct precisely.

Conclusion

A 100% CPU spike in production is stressful. It always will be. But the difference between a team that handles the incident in 44 minutes with a solid prevention plan, and a team that spends 4 hours hunting for the cause in the dark, is preparation.

Rate limiting, query string normalization, multi-level alerting, and a documented runbook: these are four simple elements that cost little to put in place and can prevent a lot of damage — technical, commercial, and relational.

Enterprise Drupal platforms generally don't fall because of a bug in core. They fall because one element of their stack wasn't configured to withstand an unforeseen scenario. This post-mortem is a concrete example.

Drupal Production Performance Post-mortem New Relic Cloudflare Bot Traffic Incident Drupal 11