A tale of a costly ride with the Application Load Balancer

TL;DR: Due to a questionable behavior of the Application Load Balancer (ALB), we were charged $25K extra. The scenario is easily replicable and you could get a (very) large bill too.  There is no fix available in ALB as of this writing—short of stopping usage of ALB—if your environment is vulnerable to this issue. The ALB receives and charges for incoming data even after you close the TCP connection. 

If you are doing any of the below, there is a high chance that you would be billed for more than you use! 

  1. Your app responds to an HTTP request and closes the TCP connection. You expect to not receive any more data (by ALB) on the closed TCP connection on your behalf and expect no further changes for incoming data on that TCP connection. 
  2. Your app looks at incoming HTTP request headers and rejects them early. For example, looking at Host or X-Forwarded-For header, and responding immediately with 4xx. 
  3. Your app looks at incoming HTTP request’s Content-Length header (or computes dynamically if Chunked-Encoding) and responds with 413 when it crosses the allowed payload threshold. 

Introduction 

It all started with us getting an additional $25K ALB bill out of nowhere. We use hundreds of ALBs in Freshworks and out of which one ended up with this unusually large bill. 

So we started digging into what contributed to this bill. 

Our setup

We started by looking into various complex ALB LCU calculations, and arrived at the one that looked out of place (w.r.t to the previous bills): ProcessedBytes

Typically we were dealing with ~1TB per day but it shot up to ~118TB! 

We were surprised by this since we have internal metrics and alerts configured for this at the HAProxy level, so should have been caught early. To see if we missed this, we looked at our internal metrics, at HAProxy, which did not agree with the ALB’s ProcessedBytes metric. 

To further narrow down, we looked at the ALB access logs, specifically the received bytes field to re-confirm. To our surprise, it was not matching with the ALB’s ProcessedBytes but was matching our internal metric. 

Now we know that the ALB does process more bytes than it forwards to the backend, which is what led to the increased bill. What we don’t know yet is how it is happening and if there is something we could do to avoid this. 

Unfortunately ALB’s ProcessedBytes CloudWatch metric cannot be correlated with the ALB access log directly, so we had to think of some other way to find out. 

We did know that the request had to be accepted by the backend to start processing further, so we started looking at the requests we were rejecting with 4xx statuses,  filtered by the time we started seeing the increase in the ProcessedBytes metric. We came upon a pattern, for a specific customer we were rejecting certain kinds of requests at regular intervals, starting at the exact problematic date/time window. 

Looking at our routing/rate limiting rules at HAProxy: 

  • The customer was being throttled for one specific API call, but we do that as soon as we receive the request headers (i.e., as soon as we can look at the Host and Path values, respond with 429) 
  • The bytes_received and bytes_sent values in HAProxy confirmed this: we just received a few hundred bytes, and then responded with 429. This was also confirmed by ALB’s received_bytes metric for this customer’s problematic requests. 
  • We also send Connection: close header as part of 429 response, so the underlying TCP connection was completely closed. 

So we were just wondering, could it be that the ALB continued to receive data even after we responded and closed the connection? To find out, we ran a simple experiment in a controlled environment: 

  • Set up an ALB to forward requests to HAProxy, which will immediately respond with 429 (and close the connection). 
  • Write a client that continues to send the data regardless of the response it got and see if the ALB allows it. 

Client: Assuming that ALB’s public IP is 1.2.3.4, the following command sends data to it until we reach 60000000 bytes or our TCP connection gets closed. 

HAProxy server side packet capture 

  • We received the request in frame 1 and responded in frame 11 (total time in ~1ms) 
  • We properly closed the connection (i.e., send connection: close, followed by half-close, followed by full-close) 

Client side packet capture 

Client capture 1: Request sent in frame 4 

Client capture 2: Response received in frame 287 

Client capture 3: For the next 30 seconds, data transmission has been allowed 

  • As seen in frame 287, as soon the client sent the request, a response was received 
  • But the client was allowed to continue to send the data (since ALB continuous to ACK the data) 
  • As seen in frame 4661, after ~30 seconds, ALB closes the connection and client gets a TCP RST 
  • In that ~30 second duration, the client was allowed to send as much data as possible, which was received and discarded by ALB but we were charged for it!   

Summary 

  1. Customer connects to ALB over HTTP1.1 
  2. Starts sending a request 
  3. ALB receives it, chooses an instance foo in backend and starts forwarding 
  4. Instance foo starts receiving request 
    1. As soon as it receives the header (and possibly some payload), it decides (by looking at Host header etc) to reject the request 
    2. It sends 429 response with Connection: Close header 
    3. Then it half-closes the connection (i.e., write endpoint of its connection) 
    4. Then does a full close 
    5. At this point, the TCP connection between ALB and instance foo is completely closed. 
  5. ALB forwards the response as it is to Customer 
  6. ALB half-closes the connection to Customer (for which it receives ACK) 
    1. At this point, this TCP connection is practically gone, it cannot be used for further requests 
  7. Customer continues to send the data to ALB 
    1. ALB continues to receive, throws away all the received data (but we get charged for it!) 
    2. This continues to happen for ~30 seconds 
  8. Due to which we were charged $25K 

Resolution 

From AWS/ALB, there is no solution available for this problem. So if you are using an ALB, you may get hit by this any time, so beware of it. 

What other possible interim options are there? 

  • Configure CloudWatch alarm for ALB LCU surges to identify problematic clients/requests.
  • If the problematic client(s) come from a few IPs, a security group rule can be created blocking those IPs. But beware that security groups are stateful, having too many rules impacts the performance.
    • We couldn’t do this as these problematic requests came from wide range of IPs.
  • Other option is to blackhole the customer at DNS/route53 if there is a dedicated subdomain for the customer (like we do) 
    • With the downside that all traffic for this customer will get blocked. In our case, the customer shipped a faulty client which was doing this only for a specific API request/path, but we had no other option but to block all traffic from the customer.

Another suggestion which we received from the AWS-ALB team was to use NLB in TLS mode which could have solved this problem, but unfortunately NLB does not support WAF or any other layer 7 capabilities. We are in a catch-22 situation and waiting for the AWS-ALB team to fix this behavior.