Share this post on:

Introduction

In complex architectures involving multiple chained proxies certain configurations can expose subtle race conditions, particularly around HTTP 503 errors when there is a mismatch in TCP keep-alive expectations between Envoy and upstream services.

Scenario

Consider a chain of proxies:

Client → Envoy (Proxy A) → Envoy (Proxy B) → Upstream Service

This behavior can lead to a subtle race condition:

  1. Proxy A pulls a connection from its pool, believing it is active.
  2. Proxy B has closed this connection due to an idle timeout.
  3. Proxy A sends a request; the TCP connection resets.
  4. Envoy reports a 503 to the client—even though the upstream might have been healthy if a new connection had been established immediately.

Mitigation Strategies

To reduce the risk of this 503 race condition:

  1. Align Keep-Alive Timeouts: Ensure all chained proxies have compatible idle timeout and max requests per connection settings. Envoy HTTP idletimeout should be less than upstream proxy!
  2. Use TCP Health Checks: Enable active health checks so Envoy can detect closed upstream connections before attempting reuse.
  3. Disable Aggressive Connection Reuse: In cases where upstream stability is uncertain, reducing connection reuse or lowering idle timeouts can help.
  4. Enable Retry Policies: Configure Envoy retries for idempotent requests to transparently recover from transient 503s caused by stale connections.
  5. Monitor TCP Resets: Observability into connection resets between proxies can help identify when mismatched keep-alives are the root cause.

Check real HTTP keepalive timeout

Run tcpdump:

sudo tcpdump -iany -n -nn host 51.75.162.65 and port 443

then, check the RST flag sent from the upstream proxy, deduct the time from the last ACK packet = your IDLE timeout.

Share this post on:

Leave a Comment

Your email address will not be published. Required fields are marked *