Guide

Config Drift Detection

Configuration drift is one of the most common causes of "works on my machine" and "works in staging but not in prod" issues. This guide explains what it is, why it happens, and how to detect it before it causes an incident.

What Is Config Drift?

Configuration drift occurs when configuration files that should be consistent across environments gradually diverge over time. It's the silent accumulation of small differences — a timeout value changed in production but not staging, a feature flag enabled in dev but forgotten elsewhere, a database connection string updated in one environment but not the others.

These differences are rarely introduced intentionally. They build up through hotfixes applied directly to production, environment-specific overrides that were meant to be temporary, manual changes that bypassed version control, or merge conflicts that were resolved inconsistently.

Why Config Drift Is Dangerous

Config drift is particularly insidious because it's invisible until something breaks. Here are real-world scenarios where drift causes incidents:

  • Connection pool exhaustion: Production has a max pool size of 10 while staging has 100. Load testing in staging passes, but production crashes under the same load.
  • Feature flag inconsistency: A feature is enabled in dev and staging for testing, but the flag never gets set in production. The deployment goes out but the feature doesn't work, leading to confusion and rollback.
  • Timeout mismatches: API gateway timeouts differ between environments. A service that responds within limits in staging times out in production due to a stricter threshold nobody knew about.
  • Security policy gaps: CORS settings, rate limits, or authentication requirements differ between environments, creating security vulnerabilities in production that don't exist in development.

The common thread is that testing passes in one environment but fails in another. The root cause is almost always a configuration difference that nobody was tracking.

Types of Configuration That Drift

Config drift can affect any type of configuration, but these are the most common culprits:

  • Application config — JSON/YAML files with feature flags, API URLs, timeouts, retry policies
  • Infrastructure config — Terraform state, Kubernetes manifests, CloudFormation templates
  • Database config — Connection strings, pool sizes, migration state
  • Service mesh config — Routing rules, circuit breakers, load balancer settings
  • CI/CD config — Build variables, deployment flags, environment-specific overrides
  • Security config — CORS policies, rate limits, authentication settings, TLS configuration

Detection Strategy: Manual Comparison

The most straightforward approach is periodic manual comparison of configuration files across environments. This works best for small teams and when configs change infrequently.

The process:

  1. Export configuration from each environment (dev, staging, production)
  2. Load all configs into a comparison tool
  3. Review differences and classify them as intentional or drift
  4. Fix any unintentional drift by syncing the correct values
  5. Document intentional differences (e.g., environment-specific URLs)

For JSON configuration files, PolyJSON is ideal for this workflow. You can load configs from all your environments and compare them simultaneously, rather than running separate two-file diffs and trying to correlate the results.

Detection Strategy: Automated CI/CD Checks

For teams that need continuous monitoring, integrating drift detection into your CI/CD pipeline catches drift as soon as it's introduced rather than after it causes an incident.

Implementation approaches:

  • Schema validation: Define a schema for your config files and validate that all environments conform. Tools like JSON Schema can enforce required keys, value types, and allowed values.
  • Diff scripting: Write a script that pulls configs from all environments and compares them programmatically. Flag unexpected differences and send alerts.
  • Config-as-code: Generate environment-specific configs from a single source of truth using templates. This prevents drift by design, since all environment configs are derived from the same base.
  • Periodic audits: Schedule a weekly or monthly job that compares configs across environments and generates a report. Review the report to catch drift early.

Intentional vs. Unintentional Differences

Not every difference across environments is drift. Some differences are intentional and expected:

  • Environment-specific URLs — Database hosts, API endpoints, CDN URLs
  • Credentials — Different secrets per environment (these should be in a secrets manager, not config files)
  • Performance tuning — Higher resource limits in production than in dev
  • Debug settings — Verbose logging in dev, minimal logging in production

The key practice is to document intentional differences. When you know which keys are expected to vary, you can focus your drift detection on the keys that should be identical. Some teams maintain an "allowlist" of keys that are permitted to differ, and flag everything else as potential drift.

Prevention Best Practices

  1. Single source of truth. Use a config management system or templating approach where environment-specific values are parameterized, not hardcoded. All changes flow through the same pipeline.
  2. Version everything. Keep all configuration in version control. Every change should be a commit with a clear message explaining why the change was made.
  3. No manual changes to production. Require all config changes to go through your deployment pipeline. If a hotfix is needed, make it through the pipeline with an expedited review, not by editing files directly.
  4. Separate secrets from config. Use a secrets manager (like HashiCorp Vault, AWS Secrets Manager, or environment variables) for credentials. This keeps your config files clean and comparable.
  5. Regular audits. Schedule periodic reviews of environment configs. Even with automation, human review catches logical issues that scripts miss.
  6. Immutable infrastructure. When possible, deploy new infrastructure instead of modifying existing instances. This eliminates the opportunity for configuration to drift over time.

Using PolyJSON for Drift Detection

PolyJSON is designed specifically for the kind of multi-file comparison that drift detection requires. Here's a practical workflow:

  1. Export your JSON config from each environment (dev, staging, production)
  2. Open PolyJSON and add each config as a separate file, named by environment
  3. Select your production config as the base (since that's the source of truth for what's actually running)
  4. Add dev and staging as comparison targets
  5. Review the highlighted differences — additions, removals, and changes are color-coded
  6. Use the "hide unchanged" option to focus on just the differences

Since PolyJSON processes everything in your browser, you can safely compare configs containing internal service names, API endpoints, and other sensitive-but-not-secret information without worrying about data leaving your machine.

Conclusion

Config drift is inevitable in any system with multiple environments, but catching it early prevents costly incidents. Combine automated detection in your CI/CD pipeline with periodic manual reviews using a multi-file comparison tool. Document intentional differences, investigate unexpected ones, and aim for a single source of truth wherever possible.

Ready to check your configs for drift? Open PolyJSON and compare your environment configurations side by side.