In this post I will detail the investigation process I followed while responding to a request backlog alert.
High level overview of the alerting service
flowchart TD
HAProxy --> App --> Storage
I’ve redacted any identifiable information, however the methodolgy remains the same.
Frontend (HAProxy)
The halog
tool can generate lots of useful reports. Adjust the time slice so you have a decent buffer before and after the time of the incident.
The snippet below will generate a series of reports:
- URL averages
- URL total times OK averages
- Connect/response histogram 1
The -e
flag filters by errors.
sudo cat /var/log/haproxy.log|halog -time "$(date '+%s' -d '6 hour ago')" -H -ua|awk 'NR==1;NR>1{print $0 | "sort -rnk4"}'|head -n25|column -t
sudo cat /var/log/haproxy.log|halog -e -time "$(date '+%s' -d '6 hour ago')" -H -uto|awk 'NR==1;NR>1{print $0 | "sort -rnk4"}'|head -n25|column -t
sudo cat /var/log/haproxy.log|halog -time "$(date '+%s' -d '6 hour ago')" -pct -q
sudo cat /var/log/haproxy.log|halog -time "$(date '+%s' -d '6 hour ago')" -H -srv|awk 'NR==1;NR>1{print $0 | "sort -rnk12"}'|column -t
Application server
Capture 5 minutes worth of TCP header data. Limit pcap file size to 500M. Absolute sequence numbers 2 .
intf=
timeout 300s tcpdump -i "${intf?}" -vvvSs 1500 -C500 -w cap_5m.pcap 'tcp && net 10.0.0.0/16'
Analyse the data with tshark
> cap.analysis \
tshark -r cap_5m.pcap -t ud \
-z \
endpoints,tcp \
-z \
expert,comment,tcp \
-T \
fields \
-e \
_ws.col.Time \
-e \
ip.src \
-e \
ip.dst \
-e \
tcp.dstport \
-e \
_ws.expert.message \
-e \
_ws.expert.severity \
-e \
_ws.expert.group \
-e \
_ws.col.Info
Identify any problematic flows from the cap.analysis
file tshark generated. In my case, I observed a high rate of TCP Window Full responses from the server, indicating that the application was pushing too much data to it, or the server may be having systemic overload issues.
You can confirm your theory by monitoring tcp statistics over time.
System level tcp stats.
while sleep 300; stdbuf -o0 nstat -p; done tee nstat.sout
Endpoint level statistics
ip -s tcpmetrics
provides similar data, I prefer the socket stats tool ss
.
for i in {0..300}; do
echo === ; ss -in state synchronized 'dst 10.0.0.10 '; echo ===
sleep 1
done | tee sockets2.sout
awk '/ESTAB/&&/10.0.0/&&and($4,$5){print "";print;print "";getline;l=patsplit($0,f,/[^:[:blank:]]*/);for(i=1;i<=l;i+=2) printf "%24s %24s\n", f[i],f[i+1]}' sockets2.sout
From the output I observed the following:
The application server misses too many acknowledgements and backs off from sending data (backoff 1
).
rto 208
rtt 4.714/0.643
ato 40
mss 1374
pmtu 1426
rcvmss 536
advmss 1374
cwnd 143
send 333.4Mbps
pacing_rate 400.1Mbps
delivery_rate 257.2Mbps
busy 11672ms
rwnd_limited 11388ms(97.6%)
unacked 126
retrans 0/12
reordering 74
notsent 1051110
minrtt 0.993
rto 216
backoff 1
rtt 14.491/16.568
ato 40
mss 1374
pmtu 1426
rcvmss 536
advmss 1374
cwnd 143
send 108.5Mbps
pacing_rate 130.2Mbps
delivery_rate 3.3Mbps
busy 12680ms
rwnd_limited 12360ms(97.5%)
retrans 0/16
reordering 74
notsent 1023662
minrtt 0.993
rto 208
rtt 5.519/0.312
ato 40
mss 1374
pmtu 1426
rcvmss 536
advmss 1374
cwnd 217
send 432.2Mbps
pacing_rate 518.6Mbps
delivery_rate 254.7Mbps
busy 15700ms
rwnd_limited 15324ms(97.6%)
unacked 133
retrans 0/17
reordering 74
notsent 1719216
minrtt 0.993
We can get richer statistics by analysing the packet capture with tshark. The output is easily grep’d. My favourite statistic report is the io,stat table, you pass it packet fields to aggregate over and produce a configurable histogram. In my example below I created a per second packet rate and window size histogram.
tshark -r cap_5m.pcap -q -z http_srv,tree
tshark -r cap_5m.pcap -q -z http_srv,tree -z conv,tcp
grep -P 'Window Full|PUT' cap.analysis|head;
tshark -n -q -r *.pcap -z io,stat,1,"COUNT(tcp.window_size_value)tcp.window_size_value and ip.addr==10.0.0.10 && tcp.dstport == 8080","MIN(tcp.window_size_value)tcp.window_size_value and ip.addr==10.0.0.10 && tcp.dstport == 8080","MAX(tcp.window_size_value)tcp.window_size_value and ip.addr==10.0.0.10 && tcp.dstport == 8080","AVG(tcp.window_size_value)tcp.window_size_value and ip.addr==10.0.0.10 && tcp.dstport == 8080" |grep -v ' 0 '
2022-03-22 09:45:08.845938 10.0.1.100 10.0.0.10 8080 PUT /api/resource HTTP/1.1\r\n 2097152 33554432 PUT /api/resource HTTP/1.1
2022-03-22 09:45:09.116693 10.0.1.100 10.0.0.10 8080 PUT /api/resource HTTP/1.1 [TCP segment of a reassembled PDU]
2022-03-22 09:45:09.245875 10.0.1.100 10.0.0.10 8080 TCP window specified by the receiver is now completely full 6291456 33554432 [TCP Window Full] Continuation[Packet size limited during capture]
2022-03-22 09:45:09.250035 10.0.1.100 10.0.0.10 8080 TCP window specified by the receiver is now completely full 6291456 33554432 [TCP Window Full] Continuation[Packet size limited during capture]
2022-03-22 09:45:09.250133 10.0.1.100 10.0.0.10 8080 TCP window specified by the receiver is now completely full 6291456 33554432 [TCP Window Full] Continuation[Packet size limited during capture]
2022-03-22 09:45:09.253985 10.0.1.100 10.0.0.10 8080 TCP window specified by the receiver is now completely full 6291456 33554432 [TCP Window Full] Continuation[Packet size limited during capture]
2022-03-22 09:45:09.255907 10.0.1.100 10.0.0.10 8080 TCP window specified by the receiver is now completely full 6291456 33554432 [TCP Window Full] Continuation[Packet size limited during capture]
2022-03-22 09:45:09.257952 10.0.1.100 10.0.0.10 8080 TCP window specified by the receiver is now completely full 6291456 33554432 [TCP Window Full] Continuation[Packet size limited during capture]
2022-03-22 09:45:09.262310 10.0.1.100 10.0.0.10 8080 TCP window specified by the receiver is now completely full 6291456 33554432 [TCP Window Full] Continuation[Packet size limited during capture]
2022-03-22 09:45:09.268328 10.0.1.100 10.0.0.10 8080 TCP window specified by the receiver is now completely full 6291456 33554432 [TCP Window Full] Continuation[Packet size limited during capture]
======================================================================================================
| IO Statistics |
| |
| Duration: 270.938461 secs |
| Interval: 1 secs |
| |
| Col 1: COUNT(tcp.window_size_value)tcp.window_size_value and ip.addr==10.0.0.10 && tcp.dstport |
| == 8080 |
| 2: MIN(tcp.window_size_value)tcp.window_size_value and ip.addr==10.0.0.10 && tcp.dstport == |
| 8080 |
| 3: MAX(tcp.window_size_value)tcp.window_size_value and ip.addr==10.0.0.10 && tcp.dstport == |
| 8080 |
| 4: AVG(tcp.window_size_value)tcp.window_size_value and ip.addr==10.0.0.10 && tcp.dstport == |
| 8080 |
|----------------------------------------------------------------------------------------------------|
| |1 |2 |3 |4 | |
| Interval | COUNT | MIN | MAX | AVG | |
|--------------------------------------------| |
| 9 <> 10 | 898 | 508 | 65142 | 651 | |
| 10 <> 11 | 849 | 508 | 508 | 508 | |
| 11 <> 12 | 5 | 507 | 508 | 507 | |
| 24 <> 25 | 6 | 508 | 65142 | 11280 | |
| 25 <> 26 | 51 | 507 | 65142 | 3042 | |
| 32 <> 33 | 6 | 486 | 62314 | 10791 | |
| 33 <> 34 | 91 | 486 | 62314 | 1844 | |
| 34 <> 35 | 5 | 485 | 486 | 485 | |
| 38 <> 39 | 6 | 508 | 65142 | 11280 | |
| 47 <> 48 | 121 | 508 | 65142 | 2110 | |
| 48 <> 49 | 80 | 507 | 508 | 507 | |
| 56 <> 57 | 4 | 508 | 65142 | 16667 | |
| 57 <> 58 | 22 | 507 | 65142 | 6383 | |
| 82 <> 83 | 6 | 508 | 65142 | 11280 | |
| 83 <> 84 | 43 | 507 | 65142 | 3514 | |
| 98 <> 99 | 6 | 508 | 65142 | 11280 | |
| 99 <> 100 | 42 | 507 | 65142 | 3585 | |
| 100 <> 101 | 6 | 508 | 65142 | 11280 | |
| 102 <> 103 | 7 | 508 | 65142 | 18975 | |
| 103 <> 104 | 164 | 507 | 508 | 507 | |
| 105 <> 106 | 13 | 508 | 65142 | 15424 | |
| 106 <> 107 | 34 | 507 | 508 | 507 | |
| 115 <> 116 | 6 | 508 | 65142 | 11280 | |
| 117 <> 118 | 556 | 508 | 65142 | 740 | |
| 118 <> 119 | 5 | 507 | 508 | 507 | |
| 127 <> 128 | 6 | 508 | 65142 | 11280 | |
| 128 <> 129 | 4 | 508 | 65142 | 16667 | |
| 129 <> 130 | 501 | 508 | 65142 | 637 | |
| 131 <> 132 | 5 | 507 | 508 | 507 | |
| 159 <> 160 | 6 | 508 | 65142 | 11280 | |
| 161 <> 162 | 4 | 508 | 65142 | 16667 | |
| 162 <> 163 | 149 | 507 | 65142 | 941 | |
| 187 <> 188 | 4 | 508 | 65142 | 16667 | |
| 188 <> 189 | 2 | 508 | 508 | 508 | |
| 225 <> 226 | 10 | 508 | 65142 | 13435 | |
| 226 <> 227 | 76 | 507 | 65142 | 1358 | |
| 240 <> 241 | 6 | 508 | 65142 | 11280 | |
| 242 <> 243 | 444 | 508 | 65142 | 799 | |
| 243 <> 244 | 7 | 508 | 65142 | 18975 | |
| 244 <> 245 | 898 | 507 | 508 | 507 | |
| 245 <> 246 | 496 | 508 | 508 | 508 | |
| 246 <> 247 | 516 | 508 | 508 | 508 | |
| 247 <> 248 | 496 | 508 | 65142 | 638 | |
| 248 <> 249 | 482 | 508 | 65142 | 776 | |
| 249 <> 250 | 485 | 507 | 508 | 507 | |
| 250 <> 251 | 602 | 508 | 508 | 508 | |
| 251 <> 252 | 629 | 508 | 508 | 508 | |
| 252 <> 253 | 664 | 508 | 508 | 508 | |
| 253 <> 254 | 519 | 508 | 508 | 508 | |
| 254 <> 255 | 719 | 508 | 508 | 508 | |
| 255 <> 256 | 646 | 508 | 508 | 508 | |
| 256 <> 257 | 764 | 508 | 508 | 508 | |
| 257 <> 258 | 376 | 508 | 508 | 508 | |
| 260 <> 261 | 11 | 507 | 65142 | 6383 | |
| 263 <> 264 | 51 | 508 | 65142 | 4310 | |
| 264 <> 265 | 19 | 507 | 508 | 507 | |
| 267 <> 268 | 3 | 509 | 65142 | 22053 | |
| 268 <> 269 | 7 | 508 | 65142 | 9741 | |
| 269 <> 270 | 80 | 507 | 65142 | 2123 | |
| 270 <> Dur | 17 | 507 | 65142 | 8112 | |
======================================================================================================
Next steps
At this point it’s best to raise a bug with the developers. In the mean time, spawning additional application workers to spread the queue around may help ungrease the wheel a bit.
Longterm we should confirm the retransmissions aren’t caused by poor network conditions. The application could be patched to establish a long lived connection while pushing data into object storage. It currently creates a new session per push. Longer connection life-times would ensure the TCP congestion algorithm heuristics have time to stabilise.
<percentile> <request count> <Request Time*> <Connect Time**> <Response Time***> <Data Time****>