{"componentChunkName":"component---src-templates-post-jsx","path":"/en/post-mortem-cpu-spike-100-production-drupal","result":{"data":{"markdownRemark":{"html":"<p>It's 9:47 AM on a Tuesday morning. A New Relic notification comes in: CPU at 94% on the production servers. Thirty seconds later, a second alert: average response time at 8 seconds. Then a Slack message from the client: <em>\"The site is down, our teams can't work anymore.\"</em></p>\n<p>This post-mortem is an illustrative scenario, built from patterns regularly encountered on Drupal platforms in production. The timeline, the commands, the decisions, and above all the mistakes are representative of this type of incident.</p>\n<p>The goal is not to show that everything always goes smoothly. The goal is to document how such an incident unfolds, why, and what we put in place so it doesn't happen again.</p>\n<hr>\n<h2>Platform Context</h2>\n<ul>\n<li>Drupal 11 on containerized infrastructure</li>\n<li>PHP 8.3 with PHP-FPM</li>\n<li>Redis for sessions and cache</li>\n<li>Varnish as an internal reverse proxy</li>\n<li>Cloudflare as the edge CDN</li>\n<li>New Relic for application observability</li>\n<li>Blackfire available but not permanently active</li>\n<li>Typical traffic: ~4,000 visitors/day, ~40 req/s at peak</li>\n</ul>\n<hr>\n<h2>Incident Timeline</h2>\n<h3>09:47 — First New Relic Alert</h3>\n<p>CPU at 94% on the application container. The alert is configured on an 80% threshold sustained for more than 2 minutes. The trigger means the situation had been going on for a while even before the notification.</p>\n<p>First action: check the New Relic dashboard.</p>\n<p>Immediate finding: the number of in-flight web transactions had tripled compared to normal. Throughput had gone from 40 req/s to around 140 req/s within 15 minutes.</p>\n<h3>09:51 — Access Log Analysis</h3>\n<p>SSH into the production environment:</p>\n<div class=\"gatsby-highlight\" data-language=\"bash\"><pre class=\"language-bash\"><code class=\"language-bash\"><span class=\"token function\">ssh</span> deploy@prod-web-01</code></pre></div>\n<p>Reading the access logs in real time:</p>\n<div class=\"gatsby-highlight\" data-language=\"bash\"><pre class=\"language-bash\"><code class=\"language-bash\"><span class=\"token function\">tail</span> -f /var/log/nginx/access.log <span class=\"token operator\">|</span> <span class=\"token function\">grep</span> -v <span class=\"token string\">\"varnishd\"</span></code></pre></div>\n<p>Immediate observation: an abnormal volume of requests on URLs with varied query strings. Example pattern:</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">185.220.x.x - - [31/May/2026:09:48:12 +0000] &quot;GET /?page=1&amp;sort=asc&amp;filter=new HTTP/1.1&quot; 200 48291\n185.220.x.x - - [31/May/2026:09:48:12 +0000] &quot;GET /?page=2&amp;sort=desc&amp;filter=old HTTP/1.1&quot; 200 48301\n185.220.x.x - - [31/May/2026:09:48:13 +0000] &quot;GET /?page=1&amp;sort=asc&amp;filter=featured HTTP/1.1&quot; 200 48288</code></pre></div>\n<p>Several hundred requests per minute from a range of IPs within the same <code class=\"language-text\">/24</code> block. The query strings differ on every request, which prevents any cache hit.</p>\n<h3>09:54 — Confirmation: Bot Storm + Systematic Cache Miss</h3>\n<p>Extracting the most active IPs:</p>\n<div class=\"gatsby-highlight\" data-language=\"bash\"><pre class=\"language-bash\"><code class=\"language-bash\"><span class=\"token function\">awk</span> <span class=\"token string\">'{print <span class=\"token variable\">$1</span>}'</span> /var/log/nginx/access.log <span class=\"token operator\">|</span> <span class=\"token function\">sort</span> <span class=\"token operator\">|</span> <span class=\"token function\">uniq</span> -c <span class=\"token operator\">|</span> <span class=\"token function\">sort</span> -rn <span class=\"token operator\">|</span> <span class=\"token function\">head</span> -20</code></pre></div>\n<p>Result: the top 5 IPs accounted for 3,200 requests over the last 10 minutes. Clearly non-human.</p>\n<p>Checking on the Cloudflare side: the requests do arrive through Cloudflare (<code class=\"language-text\">CF-Ray</code> headers present), but Cloudflare forwards them all. No WAF rule blocks this pattern.</p>\n<p>Checking the Varnish cache:</p>\n<div class=\"gatsby-highlight\" data-language=\"bash\"><pre class=\"language-bash\"><code class=\"language-bash\">varnishstat -1 -f MAIN.cache_hit,MAIN.cache_miss</code></pre></div>\n<p>Abnormally high cache miss ratio: 78% misses instead of the usual 15%. Explanation: the unknown query strings push every request out of the cache. Varnish does not normalize the parameters — it treats <code class=\"language-text\">?page=1&amp;sort=asc</code> and <code class=\"language-text\">?page=1&amp;sort=desc</code> as two different URLs.</p>\n<p>Every miss travels all the way to Drupal. Drupal generates the full page. The CPU blows up.</p>\n<h3>10:02 — First Emergency Measure: IP Blocking at the Cloudflare Level</h3>\n<p>Creating an emergency Cloudflare WAF rule:</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">(ip.src in {185.220.0.0/24}) =&gt; Block</code></pre></div>\n<p>Immediate effect: requests from those IPs are blocked. CPU drops back to 60%. Not normal yet, but the pressure eases.</p>\n<p>Problem: the bots use multiple IP ranges. Within a few minutes, new IPs take over from other /24 blocks.</p>\n<h3>10:09 — Second Measure: Challenge on Suspicious User Agents</h3>\n<p>Rather than chasing IPs one by one, we switch to behavior-based targeting. Cloudflare rule added:</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">(http.user_agent contains &quot;python-requests&quot;) or \n(http.user_agent contains &quot;curl&quot;) or \n(http.user_agent eq &quot;&quot;) \n=&gt; Managed Challenge (CAPTCHA)</code></pre></div>\n<p>Result: request volume drops by 60%. Simple bots (Python scripts with no JS handling) no longer get through.</p>\n<h3>10:14 — Third Measure: Query String Normalization in Varnish</h3>\n<p>The real underlying problem remains untouched: Varnish does not normalize query strings. Even with the bots blocked, a misconfigured Google crawl or a partner who adds UTM parameters to every request can reproduce the same effect.</p>\n<p>Change in the Varnish VCL configuration:</p>\n<div class=\"gatsby-highlight\" data-language=\"vcl\"><pre class=\"language-vcl\"><code class=\"language-vcl\">sub vcl_recv {\n    # Normalize the query strings\n    set req.url = regsuball(req.url, &quot;([?&amp;])(utm_[^&amp;]*)&quot;, &quot;&quot;);\n    set req.url = regsuball(req.url, &quot;([?&amp;])(sort|filter|page)=[^&amp;]*&quot;, &quot;&quot;);\n    \n    # Remove the trailing ? if all params have been stripped\n    set req.url = regsub(req.url, &quot;\\?$&quot;, &quot;&quot;);\n}</code></pre></div>\n<blockquote>\n<p><strong>Note:</strong> this change required a redeployment and was tested in staging before going to production. You don't modify Varnish live without a safety net.</p>\n</blockquote>\n<h3>10:31 — Back to Normal</h3>\n<p>CPU drops back to 22%. Average response time back under 400ms. The client is informed. The site is stable.</p>\n<p>Total incident duration: <strong>44 minutes</strong>.</p>\n<hr>\n<h2>Root Cause Analysis</h2>\n<p>The incident has three causes, not just one. That's always the case in real production incidents.</p>\n<h3>Cause 1 — No Upstream Rate Limiting</h3>\n<p>Cloudflare was configured in standard proxy mode, with no rate limiting rule enabled. Any IP could send as many requests as it wanted with no friction.</p>\n<p>Cloudflare rate limiting is available on all paid plans. It wasn't enabled because \"it had never caused a problem.\" A classic mistake.</p>\n<h3>Cause 2 — Varnish Not Configured to Normalize Query Strings</h3>\n<p>By default, Varnish treats every unique combination of query strings as a distinct URL. Without normalization, a bot that varies its parameters on every request bypasses the cache 100% of the time, even if the rendered content is identical.</p>\n<p>This configuration should have been in place from the start. It wasn't.</p>\n<h3>Cause 3 — No Application-Level Circuit Breaker</h3>\n<p>When Drupal receives more requests than it can handle, it keeps accepting them until the CPU saturates. There is no native mechanism in Drupal that says: \"I'm saturated, I'll return a temporary 503 instead of drowning further.\"</p>\n<p>This behavior creates a spiral: the more loaded the CPU is, the longer each request takes to process, the more the queue of pending requests grows, the higher the CPU climbs.</p>\n<hr>\n<h2>What We Should Have Had in Place</h2>\n<h3>Cloudflare Rate Limiting</h3>\n<p>A simple rule to put in place on every project:</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">Path: /*\nThreshold: 100 requests per IP per minute\nAction: Managed Challenge\nDuration: 10 minutes</code></pre></div>\n<p>For sensitive endpoints (<code class=\"language-text\">/admin</code>, <code class=\"language-text\">/user</code>, <code class=\"language-text\">/api</code>): a much lower threshold, 10 req/min, with a direct Block action.</p>\n<h3>Query String Normalization in Varnish from the Start</h3>\n<p>The VCL configuration must be reviewed on every project to:</p>\n<ul>\n<li>Strip parameters that have no impact on content (UTM, fbclid, gclid, etc.)</li>\n<li>Normalize the order of parameters that are part of the cache key</li>\n<li>Explicitly define the allowlist of accepted parameters</li>\n</ul>\n<h3>Alerting on the Varnish Cache Miss Ratio</h3>\n<p>New Relic or CloudWatch can monitor <code class=\"language-text\">varnishstat</code>. A cache miss ratio that exceeds 40% over a 5-minute window should trigger an alert, not wait until CPU is at 94%.</p>\n<h3>A Watchlist of Bots</h3>\n<p>Maintain a list of user agents known for aggressive scraping and create a Cloudflare rule that systematically sends them to a Managed Challenge. This list builds up incident after incident — better to start early.</p>\n<hr>\n<h2>Measures Put in Place After the Incident</h2>\n<h3>1. Rate Limiting Enabled on All Production Environments</h3>\n<p>A global rule at 100 req/min/IP, with specific rules at 10 req/min on:</p>\n<ul>\n<li><code class=\"language-text\">/admin/*</code></li>\n<li><code class=\"language-text\">/user/*</code></li>\n<li><code class=\"language-text\">/api/*</code></li>\n<li><code class=\"language-text\">/node/*/edit</code></li>\n</ul>\n<h3>2. Varnish VCL Revised with Query String Normalization</h3>\n<p>An allowlist of parameters permitted in the cache key. Any parameter not listed is stripped before the request reaches the cache.</p>\n<h3>3. New Relic Throughput Alert</h3>\n<p>Triggers if the number of transactions/minute exceeds 3× the average of the last 7 days for more than 3 minutes.</p>\n<h3>4. Incident Response Runbook</h3>\n<p>A document in Confluence that spells out exactly what to do in case of a CPU spike:</p>\n<ol>\n<li>Open New Relic → Transaction overview → identify the most expensive URLs</li>\n<li>Open the access logs → identify abnormal IPs</li>\n<li>Block the IP ranges in Cloudflare → check the effect on CPU</li>\n<li>If insufficient → temporarily enable Cloudflare \"Under Attack\" mode</li>\n<li>If Drupal is unreachable → switch to maintenance mode + notify the client</li>\n<li>Post-incident: analyze the logs, identify the root cause, document</li>\n</ol>\n<h3>5. Quarterly Load Testing</h3>\n<p>A k6 or Gatling test on the staging environment to verify the stack holds up at 3× nominal traffic. This test would have revealed the query string normalization weakness well before the incident.</p>\n<hr>\n<h2>Lessons Learned</h2>\n<p><strong>Most production incidents are not caused by a single problem.</strong> Here, the bot was the trigger, but the real problem was the absence of rate limiting and the misconfigured cache. Without the bot, the platform was stable. But it was fragile.</p>\n<p><strong>Alerting on CPU alone is insufficient.</strong> At 94% CPU, it's too late. You need to alert upstream: cache miss ratio, abnormal throughput, rising P95 response time. CPU is a consequence, not a cause.</p>\n<p><strong>Varnish configuration is not \"set and forget.\"</strong> It must be reviewed on every project, with particular attention to URL normalization. A base VCL template should be part of the starter kit for any Drupal project using Varnish.</p>\n<p><strong>Document during the incident, not just after.</strong> While resolving it, we took notes in a dedicated Slack thread. This post-mortem is built from those notes. Without them, the timeline would have been impossible to reconstruct precisely.</p>\n<h2>Conclusion</h2>\n<p>A 100% CPU spike in production is stressful. It always will be. But the difference between a team that handles the incident in 44 minutes with a solid prevention plan, and a team that spends 4 hours hunting for the cause in the dark, is preparation.</p>\n<p>Rate limiting, query string normalization, multi-level alerting, and a documented runbook: these are four simple elements that cost little to put in place and can prevent a lot of damage — technical, commercial, and relational.</p>\n<p>Enterprise Drupal platforms generally don't fall because of a bug in core. They fall because one element of their stack wasn't configured to withstand an unforeseen scenario. This post-mortem is a concrete example.</p>","excerpt":"It's 9:47 AM on a Tuesday morning. A New Relic notification comes in: CPU at 94% on the production servers. Thirty seconds later, a second alert: average…","frontmatter":{"date":"2026-05-31","metaDate":"2026-05-31","title":"Post-mortem: CPU Spike to 100% in Drupal Production — Analysis and Resolution","tags":["Drupal","Production","Performance","Post-mortem","New Relic","Cloudflare","Bot Traffic","Incident","Drupal 11"],"path":"/post-mortem-cpu-spike-100-production-drupal","cover":{"childImageSharp":{"fluid":{"base64":"data:image/jpeg;base64,/9j/2wBDABALDA4MChAODQ4SERATGCgaGBYWGDEjJR0oOjM9PDkzODdASFxOQERXRTc4UG1RV19iZ2hnPk1xeXBkeFxlZ2P/2wBDARESEhgVGC8aGi9jQjhCY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2P/wgARCAANABQDASIAAhEBAxEB/8QAFgABAQEAAAAAAAAAAAAAAAAAAQAF/8QAFgEBAQEAAAAAAAAAAAAAAAAAAAIE/9oADAMBAAIQAxAAAAHIR1yVH//EABQQAQAAAAAAAAAAAAAAAAAAACD/2gAIAQEAAQUCX//EABQRAQAAAAAAAAAAAAAAAAAAABD/2gAIAQMBAT8BP//EABQRAQAAAAAAAAAAAAAAAAAAABD/2gAIAQIBAT8BP//EABQQAQAAAAAAAAAAAAAAAAAAACD/2gAIAQEABj8CX//EABkQAAMAAwAAAAAAAAAAAAAAAAABEBEhMf/aAAgBAQABPyFR90KZP//aAAwDAQACAAMAAAAQwA//xAAUEQEAAAAAAAAAAAAAAAAAAAAQ/9oACAEDAQE/ED//xAAUEQEAAAAAAAAAAAAAAAAAAAAQ/9oACAECAQE/ED//xAAcEAADAAEFAAAAAAAAAAAAAAAAAREhMUFhcfD/2gAIAQEAAT8Qjkx5CpNqXUEGoYbU/9k=","aspectRatio":1.5001937233630376,"src":"/static/e94c64bca92ef8d77addbf763315523d/88110/cover.jpg","srcSet":"/static/e94c64bca92ef8d77addbf763315523d/0b320/cover.jpg 480w,\n/static/e94c64bca92ef8d77addbf763315523d/60b32/cover.jpg 960w,\n/static/e94c64bca92ef8d77addbf763315523d/88110/cover.jpg 1920w,\n/static/e94c64bca92ef8d77addbf763315523d/40175/cover.jpg 2880w,\n/static/e94c64bca92ef8d77addbf763315523d/e58c2/cover.jpg 3840w,\n/static/e94c64bca92ef8d77addbf763315523d/e5b5f/cover.jpg 3872w","srcWebp":"/static/e94c64bca92ef8d77addbf763315523d/d1a9d/cover.webp","srcSetWebp":"/static/e94c64bca92ef8d77addbf763315523d/bc3bf/cover.webp 480w,\n/static/e94c64bca92ef8d77addbf763315523d/39337/cover.webp 960w,\n/static/e94c64bca92ef8d77addbf763315523d/d1a9d/cover.webp 1920w,\n/static/e94c64bca92ef8d77addbf763315523d/fcbe1/cover.webp 2880w,\n/static/e94c64bca92ef8d77addbf763315523d/c136d/cover.webp 3840w,\n/static/e94c64bca92ef8d77addbf763315523d/228a9/cover.webp 3872w","sizes":"(max-width: 1920px) 100vw, 1920px"},"resize":{"src":"/static/e94c64bca92ef8d77addbf763315523d/c4f3a/cover.jpg"}}}}}},"pageContext":{"isCreatedByStatefulCreatePages":false,"pathSlug":"/post-mortem-cpu-spike-100-production-drupal","locale":"en","prev":{"fields":{"locale":"en"},"frontmatter":{"path":"/drupal-ai-content-transformation","title":"Why AI Will Transform the Editorial Experience in Drupal","tags":["Drupal 11","IA","Contribution","AI"]}},"next":null}}}