It's 9:47 AM on a Tuesday morning. A New Relic notification comes in: CPU at 94% on the production servers. Thirty seconds later, a second alert: average response time at 8 seconds. Then a Slack message from the client: "The site is down, our teams can't work anymore."
This post-mortem is an illustrative scenario, built from patterns regularly encountered on Drupal platforms in production. The timeline, the commands, the decisions, and above all the mistakes are representative of this type of incident.
The goal is not to show that everything always goes smoothly. The goal is to document how such an incident unfolds, why, and what we put in place so it doesn't happen again.
CPU at 94% on the application container. The alert is configured on an 80% threshold sustained for more than 2 minutes. The trigger means the situation had been going on for a while even before the notification.
First action: check the New Relic dashboard.
Immediate finding: the number of in-flight web transactions had tripled compared to normal. Throughput had gone from 40 req/s to around 140 req/s within 15 minutes.
SSH into the production environment:
ssh deploy@prod-web-01Reading the access logs in real time:
tail -f /var/log/nginx/access.log | grep -v "varnishd"Immediate observation: an abnormal volume of requests on URLs with varied query strings. Example pattern:
185.220.x.x - - [31/May/2026:09:48:12 +0000] "GET /?page=1&sort=asc&filter=new HTTP/1.1" 200 48291
185.220.x.x - - [31/May/2026:09:48:12 +0000] "GET /?page=2&sort=desc&filter=old HTTP/1.1" 200 48301
185.220.x.x - - [31/May/2026:09:48:13 +0000] "GET /?page=1&sort=asc&filter=featured HTTP/1.1" 200 48288Several hundred requests per minute from a range of IPs within the same /24 block. The query strings differ on every request, which prevents any cache hit.
Extracting the most active IPs:
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20Result: the top 5 IPs accounted for 3,200 requests over the last 10 minutes. Clearly non-human.
Checking on the Cloudflare side: the requests do arrive through Cloudflare (CF-Ray headers present), but Cloudflare forwards them all. No WAF rule blocks this pattern.
Checking the Varnish cache:
varnishstat -1 -f MAIN.cache_hit,MAIN.cache_missAbnormally high cache miss ratio: 78% misses instead of the usual 15%. Explanation: the unknown query strings push every request out of the cache. Varnish does not normalize the parameters — it treats ?page=1&sort=asc and ?page=1&sort=desc as two different URLs.
Every miss travels all the way to Drupal. Drupal generates the full page. The CPU blows up.
Creating an emergency Cloudflare WAF rule:
(ip.src in {185.220.0.0/24}) => BlockImmediate effect: requests from those IPs are blocked. CPU drops back to 60%. Not normal yet, but the pressure eases.
Problem: the bots use multiple IP ranges. Within a few minutes, new IPs take over from other /24 blocks.
Rather than chasing IPs one by one, we switch to behavior-based targeting. Cloudflare rule added:
(http.user_agent contains "python-requests") or
(http.user_agent contains "curl") or
(http.user_agent eq "")
=> Managed Challenge (CAPTCHA)Result: request volume drops by 60%. Simple bots (Python scripts with no JS handling) no longer get through.
The real underlying problem remains untouched: Varnish does not normalize query strings. Even with the bots blocked, a misconfigured Google crawl or a partner who adds UTM parameters to every request can reproduce the same effect.
Change in the Varnish VCL configuration:
sub vcl_recv {
# Normalize the query strings
set req.url = regsuball(req.url, "([?&])(utm_[^&]*)", "");
set req.url = regsuball(req.url, "([?&])(sort|filter|page)=[^&]*", "");
# Remove the trailing ? if all params have been stripped
set req.url = regsub(req.url, "\?$", "");
}Note: this change required a redeployment and was tested in staging before going to production. You don't modify Varnish live without a safety net.
CPU drops back to 22%. Average response time back under 400ms. The client is informed. The site is stable.
Total incident duration: 44 minutes.
The incident has three causes, not just one. That's always the case in real production incidents.
Cloudflare was configured in standard proxy mode, with no rate limiting rule enabled. Any IP could send as many requests as it wanted with no friction.
Cloudflare rate limiting is available on all paid plans. It wasn't enabled because "it had never caused a problem." A classic mistake.
By default, Varnish treats every unique combination of query strings as a distinct URL. Without normalization, a bot that varies its parameters on every request bypasses the cache 100% of the time, even if the rendered content is identical.
This configuration should have been in place from the start. It wasn't.
When Drupal receives more requests than it can handle, it keeps accepting them until the CPU saturates. There is no native mechanism in Drupal that says: "I'm saturated, I'll return a temporary 503 instead of drowning further."
This behavior creates a spiral: the more loaded the CPU is, the longer each request takes to process, the more the queue of pending requests grows, the higher the CPU climbs.
A simple rule to put in place on every project:
Path: /*
Threshold: 100 requests per IP per minute
Action: Managed Challenge
Duration: 10 minutesFor sensitive endpoints (/admin, /user, /api): a much lower threshold, 10 req/min, with a direct Block action.
The VCL configuration must be reviewed on every project to:
New Relic or CloudWatch can monitor varnishstat. A cache miss ratio that exceeds 40% over a 5-minute window should trigger an alert, not wait until CPU is at 94%.
Maintain a list of user agents known for aggressive scraping and create a Cloudflare rule that systematically sends them to a Managed Challenge. This list builds up incident after incident — better to start early.
A global rule at 100 req/min/IP, with specific rules at 10 req/min on:
/admin/*/user/*/api/*/node/*/editAn allowlist of parameters permitted in the cache key. Any parameter not listed is stripped before the request reaches the cache.
Triggers if the number of transactions/minute exceeds 3× the average of the last 7 days for more than 3 minutes.
A document in Confluence that spells out exactly what to do in case of a CPU spike:
A k6 or Gatling test on the staging environment to verify the stack holds up at 3× nominal traffic. This test would have revealed the query string normalization weakness well before the incident.
Most production incidents are not caused by a single problem. Here, the bot was the trigger, but the real problem was the absence of rate limiting and the misconfigured cache. Without the bot, the platform was stable. But it was fragile.
Alerting on CPU alone is insufficient. At 94% CPU, it's too late. You need to alert upstream: cache miss ratio, abnormal throughput, rising P95 response time. CPU is a consequence, not a cause.
Varnish configuration is not "set and forget." It must be reviewed on every project, with particular attention to URL normalization. A base VCL template should be part of the starter kit for any Drupal project using Varnish.
Document during the incident, not just after. While resolving it, we took notes in a dedicated Slack thread. This post-mortem is built from those notes. Without them, the timeline would have been impossible to reconstruct precisely.
A 100% CPU spike in production is stressful. It always will be. But the difference between a team that handles the incident in 44 minutes with a solid prevention plan, and a team that spends 4 hours hunting for the cause in the dark, is preparation.
Rate limiting, query string normalization, multi-level alerting, and a documented runbook: these are four simple elements that cost little to put in place and can prevent a lot of damage — technical, commercial, and relational.
Enterprise Drupal platforms generally don't fall because of a bug in core. They fall because one element of their stack wasn't configured to withstand an unforeseen scenario. This post-mortem is a concrete example.