Introduction
In complex architectures involving multiple chained proxies certain configurations can expose subtle race conditions, particularly around HTTP 503 errors when there is a mismatch in TCP keep-alive expectations between Envoy and upstream services.
Scenario
Consider a chain of proxies:
Client → Envoy (Proxy A) → Envoy (Proxy B) → Upstream Service
This behavior can lead to a subtle race condition:
- Proxy A pulls a connection from its pool, believing it is active.
- Proxy B has closed this connection due to an idle timeout.
- Proxy A sends a request; the TCP connection resets.
- Envoy reports a 503 to the client—even though the upstream might have been healthy if a new connection had been established immediately.
Mitigation Strategies
To reduce the risk of this 503 race condition:
- Align Keep-Alive Timeouts: Ensure all chained proxies have compatible idle timeout and max requests per connection settings. Envoy HTTP idletimeout should be less than upstream proxy!
- Use TCP Health Checks: Enable active health checks so Envoy can detect closed upstream connections before attempting reuse.
- Disable Aggressive Connection Reuse: In cases where upstream stability is uncertain, reducing connection reuse or lowering idle timeouts can help.
- Enable Retry Policies: Configure Envoy retries for idempotent requests to transparently recover from transient 503s caused by stale connections.
- Monitor TCP Resets: Observability into connection resets between proxies can help identify when mismatched keep-alives are the root cause.
Check real HTTP keepalive timeout
Run tcpdump:
sudo tcpdump -iany -n -nn host 51.75.162.65 and port 443
then, check the RST flag sent from the upstream proxy, deduct the time from the last ACK packet = your IDLE timeout.