Common Causes of Production System Failures

Common Causes of Production System Failures

6/29/20255 min read

A dimly lit room features a desktop monitor and a laptop on a desk, both displaying lines of code and digital graphics. The primary focus is on programming and software development, with a dark, tech-centric theme.
A dimly lit room features a desktop monitor and a laptop on a desk, both displaying lines of code and digital graphics. The primary focus is on programming and software development, with a dark, tech-centric theme.

Unplanned System Outages and Prevention

The probability of failure/outage is highest when changes are deployed. Deployment failures generally have a few narrow possible root causes: 1. defects in the code (that testing didn't find), 2. Missing deployment requirement, 3. Misconfiguration for target environment

1. Failure after Deployment

Narrow possible Root Cause:

Do Test or Staging systems fail in the same way? Are they running the same code? Are they running on the same dependency version stack? Did the failure occur in the test or staging environment initially?

What Happens:
When test and staging don't fail this suggests the deployment missed something. Perhaps someone made a manual change that wasn't included in the deploy scripts to make the other environments work. A new build and deploy may be required to install a new dependency. Perhaps a service that doesn't run in other environments was disabled but is necessary for production.

Real-World Example:
During a cloud migration, one team disabled a health check endpoint by accident. The load balancer began terminating healthy instances, interpreting them as failed. Traffic dropped to zero.

Prevent Fail:

  • Use infrastructure-as-code (IaC) and keep configurations version-controlled.

  • Integrate pre-deploy validations (like terraform plan diffs and linter checks).

  • Test config changes in a sandbox or staging environment identical to prod.

2. Valid Load or Traffic Spikes

Symptoms:
The system becomes unresponsive when the load increases for a long period of time.

Isolate Root Cause:
Look for particularly heavy service calls. A few calls under overall higher volume can push the system over a threshold and begin starving other services of resources. If the heavy service calls aren't required for the system to function like a history of past purchases, they can be temporarily disabled with a static message like "History currently unavailable. Check again later." This can often be enough to restore the system to stability.

Additional Root Cause Analysis, inspect queue depths, cache hit rates, cache out of memory errors, system logs for timeouts, and long run times, sentry for system crashes, and database health (slow queries, locks, missing indexes). Increasing pod sizes, instance classes, or solving service problems like out of memory by sizing up to a larger instance with more memory may do the trick.

Getting a production system stable while under load may involve prioritizing which users can access the system and temporarily disabling some logins.

Large queue depths may be stuck processing things no longer important to critical time-sensitive features. The workers may be working on jobs for low-priority clean-up instead of working on jobs with financial and even SLA implications.

Prevent Fail:

  • Perform realistic load and stress testing that simulates usage peaks and realistic-ramp saturation.

  • Even when the only requirement is to test for specific performance test targets. Make the case to push beyond test targets to find soft/hard failure limits. Provided the target limit reliably passes, push to 3x or 10x. Presumably, the target limit is the minimum threshold, and a safety margin above it is prudent. Systems tend to become less efficient over time and a high safety margin guards against dropping below the minimum threshold.

  • Implement auto-scaling with safe upper limits.

  • Use rate-limiting and circuit breakers to degrade gracefully under duress.

3. Upstream or Third-Party Service Failures

Possible Symptoms:

  • The app no longer accepts payments (payment service outage or account problem)

  • The app stops creating delivery routes (Geolocation API outage?)

  • Some users can no longer login using google, facebook, or apple. The Auth provider may be experiencing an outage or app permission may have been removed.

  • System is very slow, logs show long wait times even timeouts for third party api requests.

Isolate Root Cause:
What status code is the service returning, and how quickly is it returning? Or is it timing out?

  • 200 - OK

  • 401 and 403 may indicate bad/expired credentials

  • 429 - Too Many Requests: Indicates too many calls are being made and exceeding rate limiting. Need to make fewer calls and with more time between them. Possibly upgrade to a higher tier with a higher threshold.

  • 5xx - Server error

If the api requests were working check the 3rd party website and find their status page to determine if the services are running ok.

If the fault is on their end you may need to wait for it to resolve or submit a support request. If production is impacted, disable the service if it's not required. If the service is required the only viable solution may be to find a suitable backup or replacement possibly to just retry later.

Prevent Fail:

  • Design services with fallbacks and timeouts.

  • Use caching or graceful degradation strategies for non-critical dependencies.

  • Monitor upstream SLAs and alert on response delays—not just outright failures.

4. Failed or Partial Deployments

What Happens:
Only half the containers receive the new version. A feature flag flips before the code behind it is deployed. A rollback only undoes the backend, leaving the frontend broken.

Prevent Fail:

  • Use blue/green or canary deployments.

  • Implement post-deploy smoke tests to validate core functionality.

  • Automate rollback detection and auto-recovery when deploys fail health checks.

5. Database Issues: Locks, Migrations, or Resource Limits

What Happens:
A long-running query locks a table. A schema migration forgets an index. Connection pools max out. Even a 1-second DB slowdown can ripple across services.

Prevent Fail:

  • Test database migrations under load in pre-prod environments.

  • Add timeouts, retries, and connection pool monitoring.

  • Index and tune queries with real-world data volume in mind.

6. Code Defects that Escape Testing

What Happens:
A conditional branch no one thought to test. A race condition in a rarely used API. Bugs that pass local and unit tests but surface under concurrency or scale.

Prevent Fail:

  • Expand test coverage to include integration, contract, and performance tests.

  • Use chaos engineering and fault injection to surface edge case behavior.

  • Treat QA as a lifecycle-wide discipline, not just a pre-release gate.

7. Fragile Dependencies

Symptoms:
A service suddenly fails to provide request responses. No errors, time outs, no new server logs. The system is probably locked up and a restart may fix.

When a restart doesn't resolve although a new error is occurring it may have a dependency that is also in a bad state. Sometimes you can determine this through the startup logging. A broader system reset is often the solution.

What Happens:
Distributed systems are inherently complex and microservices may depend on each other’s startup sequences, have untracked dependencies, or cascading failures no one sees coming.

In some cases a log message may indicate a connection was lost, refused, timed out, or reset. Services not designed to handle these exception cases fail in unpredictable ways. These failures are often very infrequent occurring once a month, once a quarter, or less. The impact can be severe yet go unnoticed for hours. The system may appear to work properly in all regards obvious to system users. And no failure may be displayed, some things take time for a system to complete. Like after placing an order it must be fulfilled and shipped. These last steps may stop occurring and go unnoticed because they normally take some time.

Prevent Fail:

  • Use proven connection pooling.

  • Exception handling for lost connections that reconnect.

  • Maintain a clear service dependency graph.

  • Use health checks and dependency injection to manage startup order and readiness.

  • Keep systems observable, not just monitored—so you can trace root causes.

8. Human Error

What Happens:
Manual restarts. Fat-fingered CLI commands. Accidentally deleting the wrong S3 bucket. Some of the most devastating outages come from simple human mistakes.

Prevent Fail:

  • Limit production access and require peer-reviewed infrastructure changes.

  • Provide "safe mode" tooling with guardrails (e.g., confirmation prompts, dry runs).

  • Embrace blameless postmortems and turn mistakes into systemic improvements.

Final Thoughts: Design for Failure

You can’t prevent every outage, but you can limit the impact. Building resilient systems means accepting failure as inevitable—and preparing for it with:

  • Observability-first design

  • Fail-safes and graceful degradation

  • Clear runbooks and practiced incident response

Most importantly, treat quality as a cross-functional goal, not a QA silo. Outage prevention is a team sport involving developers, testers, SREs, and product owners alike.

About the Author:
Jeff Persson is a QA and performance testing strategist with over two decades of experience building high-reliability systems. He specializes in cloud migrations, infrastructure testing, and continuous quality practices across DevOps pipelines.