2022-03-22 Application Performance Analysis

In this post I will detail the investigation process I followed while responding to a request backlog alert.

High level overview of the alerting service

flowchart TD
    HAProxy --> App --> Storage

I’ve redacted any identifiable information, however the methodolgy remains the same.

Frontend (HAProxy)

The halog tool can generate lots of useful reports. Adjust the time slice so you have a decent buffer before and after the time of the incident.

The snippet below will generate a series of reports:

  • URL averages
  • URL total times OK averages
  • Connect/response histogram 1

The -e flag filters by errors.

sudo cat /var/log/haproxy.log|halog -time "$(date '+%s' -d '6 hour ago')" -H -ua|awk 'NR==1;NR>1{print $0 | "sort -rnk4"}'|head -n25|column -t
sudo cat /var/log/haproxy.log|halog -e -time "$(date '+%s' -d '6 hour ago')" -H -uto|awk 'NR==1;NR>1{print $0 | "sort -rnk4"}'|head -n25|column -t
sudo cat /var/log/haproxy.log|halog -time "$(date '+%s' -d '6 hour ago')" -pct -q
sudo cat /var/log/haproxy.log|halog  -time "$(date '+%s' -d '6 hour ago')" -H -srv|awk 'NR==1;NR>1{print $0 | "sort -rnk12"}'|column -t

Application server

Capture 5 minutes worth of TCP header data. Limit pcap file size to 500M. Absolute sequence numbers 2 .

intf=
timeout 300s tcpdump -i "${intf?}" -vvvSs 1500 -C500 -w cap_5m.pcap 'tcp && net 10.0.0.0/16'

Analyse the data with tshark

> cap.analysis \
  tshark -r cap_5m.pcap -t ud \
    -z \
    endpoints,tcp \
    -z \
    expert,comment,tcp \
    -T \
    fields \
    -e \
    _ws.col.Time \
    -e \
    ip.src \
    -e \
    ip.dst \
    -e \
    tcp.dstport \
    -e \
    _ws.expert.message \
    -e \
    _ws.expert.severity \
    -e \
    _ws.expert.group \
    -e \
    _ws.col.Info

Identify any problematic flows from the cap.analysis file tshark generated. In my case, I observed a high rate of TCP Window Full responses from the server, indicating that the application was pushing too much data to it, or the server may be having systemic overload issues.

You can confirm your theory by monitoring tcp statistics over time.

System level tcp stats.

while sleep 300; stdbuf -o0 nstat -p; done tee nstat.sout

Endpoint level statistics

ip -s tcpmetrics provides similar data, I prefer the socket stats tool ss.

for i in {0..300}; do
  echo === ; ss -in state synchronized 'dst 10.0.0.10 '; echo ===
  sleep 1
done | tee sockets2.sout
awk '/ESTAB/&&/10.0.0/&&and($4,$5){print "";print;print "";getline;l=patsplit($0,f,/[^:[:blank:]]*/);for(i=1;i<=l;i+=2) printf "%24s %24s\n", f[i],f[i+1]}' sockets2.sout

From the output I observed the following:

The application server misses too many acknowledgements and backs off from sending data (backoff 1).

                     rto                      208
                     rtt              4.714/0.643
                     ato                       40
                     mss                     1374
                    pmtu                     1426
                  rcvmss                      536
                  advmss                     1374
                    cwnd                      143
                    send                333.4Mbps
             pacing_rate                400.1Mbps
           delivery_rate                257.2Mbps
                    busy                  11672ms
            rwnd_limited           11388ms(97.6%)
                 unacked                      126
                 retrans                     0/12
              reordering                       74
                 notsent                  1051110
                  minrtt                    0.993

                     rto                      216
                 backoff                        1
                     rtt            14.491/16.568
                     ato                       40
                     mss                     1374
                    pmtu                     1426
                  rcvmss                      536
                  advmss                     1374
                    cwnd                      143
                    send                108.5Mbps
             pacing_rate                130.2Mbps
           delivery_rate                  3.3Mbps
                    busy                  12680ms
            rwnd_limited           12360ms(97.5%)
                 retrans                     0/16
              reordering                       74
                 notsent                  1023662
                  minrtt                    0.993

                     rto                      208
                     rtt              5.519/0.312
                     ato                       40
                     mss                     1374
                    pmtu                     1426
                  rcvmss                      536
                  advmss                     1374
                    cwnd                      217
                    send                432.2Mbps
             pacing_rate                518.6Mbps
           delivery_rate                254.7Mbps
                    busy                  15700ms
            rwnd_limited           15324ms(97.6%)
                 unacked                      133
                 retrans                     0/17
              reordering                       74
                 notsent                  1719216
                  minrtt                    0.993

We can get richer statistics by analysing the packet capture with tshark. The output is easily grep’d. My favourite statistic report is the io,stat table, you pass it packet fields to aggregate over and produce a configurable histogram. In my example below I created a per second packet rate and window size histogram.

tshark  -r cap_5m.pcap -q -z http_srv,tree
tshark  -r cap_5m.pcap -q -z http_srv,tree -z conv,tcp
grep -P 'Window Full|PUT' cap.analysis|head;
tshark -n -q -r *.pcap -z io,stat,1,"COUNT(tcp.window_size_value)tcp.window_size_value and ip.addr==10.0.0.10 && tcp.dstport == 8080","MIN(tcp.window_size_value)tcp.window_size_value and ip.addr==10.0.0.10 && tcp.dstport == 8080","MAX(tcp.window_size_value)tcp.window_size_value and ip.addr==10.0.0.10 && tcp.dstport == 8080","AVG(tcp.window_size_value)tcp.window_size_value and ip.addr==10.0.0.10 && tcp.dstport == 8080" |grep -v ' 0 '
2022-03-22 09:45:08.845938      10.0.1.100      10.0.0.10       8080    PUT /api/resource HTTP/1.1\r\n   2097152 33554432        PUT /api/resource HTTP/1.1
2022-03-22 09:45:09.116693      10.0.1.100      10.0.0.10       8080    PUT /api/resource HTTP/1.1  [TCP segment of a reassembled PDU]
2022-03-22 09:45:09.245875      10.0.1.100      10.0.0.10       8080    TCP window specified by the receiver is now completely full     6291456 33554432        [TCP Window Full] Continuation[Packet size limited during capture]
2022-03-22 09:45:09.250035      10.0.1.100      10.0.0.10       8080    TCP window specified by the receiver is now completely full     6291456 33554432        [TCP Window Full] Continuation[Packet size limited during capture]
2022-03-22 09:45:09.250133      10.0.1.100      10.0.0.10       8080    TCP window specified by the receiver is now completely full     6291456 33554432        [TCP Window Full] Continuation[Packet size limited during capture]
2022-03-22 09:45:09.253985      10.0.1.100      10.0.0.10       8080    TCP window specified by the receiver is now completely full     6291456 33554432        [TCP Window Full] Continuation[Packet size limited during capture]
2022-03-22 09:45:09.255907      10.0.1.100      10.0.0.10       8080    TCP window specified by the receiver is now completely full     6291456 33554432        [TCP Window Full] Continuation[Packet size limited during capture]
2022-03-22 09:45:09.257952      10.0.1.100      10.0.0.10       8080    TCP window specified by the receiver is now completely full     6291456 33554432        [TCP Window Full] Continuation[Packet size limited during capture]
2022-03-22 09:45:09.262310      10.0.1.100      10.0.0.10       8080    TCP window specified by the receiver is now completely full     6291456 33554432        [TCP Window Full] Continuation[Packet size limited during capture]
2022-03-22 09:45:09.268328      10.0.1.100      10.0.0.10       8080    TCP window specified by the receiver is now completely full     6291456 33554432        [TCP Window Full] Continuation[Packet size limited during capture]
======================================================================================================
| IO Statistics                                                                                      |
|                                                                                                    |
| Duration: 270.938461 secs                                                                          |
| Interval:   1 secs                                                                                 |
|                                                                                                    |
| Col 1: COUNT(tcp.window_size_value)tcp.window_size_value and ip.addr==10.0.0.10 && tcp.dstport     |
|        == 8080                                                                                     |
|     2: MIN(tcp.window_size_value)tcp.window_size_value and ip.addr==10.0.0.10 && tcp.dstport ==    |
|        8080                                                                                        |
|     3: MAX(tcp.window_size_value)tcp.window_size_value and ip.addr==10.0.0.10 && tcp.dstport ==    |
|        8080                                                                                        |
|     4: AVG(tcp.window_size_value)tcp.window_size_value and ip.addr==10.0.0.10 && tcp.dstport ==    |
|        8080                                                                                        |
|----------------------------------------------------------------------------------------------------|
|            |1      |2      |3      |4      |                                                       |
| Interval   | COUNT |  MIN  |  MAX  |  AVG  |                                                       |
|--------------------------------------------|                                                       |
|   9 <>  10 |   898 |   508 | 65142 |   651 |                                                       |
|  10 <>  11 |   849 |   508 |   508 |   508 |                                                       |
|  11 <>  12 |     5 |   507 |   508 |   507 |                                                       |
|  24 <>  25 |     6 |   508 | 65142 | 11280 |                                                       |
|  25 <>  26 |    51 |   507 | 65142 |  3042 |                                                       |
|  32 <>  33 |     6 |   486 | 62314 | 10791 |                                                       |
|  33 <>  34 |    91 |   486 | 62314 |  1844 |                                                       |
|  34 <>  35 |     5 |   485 |   486 |   485 |                                                       |
|  38 <>  39 |     6 |   508 | 65142 | 11280 |                                                       |
|  47 <>  48 |   121 |   508 | 65142 |  2110 |                                                       |
|  48 <>  49 |    80 |   507 |   508 |   507 |                                                       |
|  56 <>  57 |     4 |   508 | 65142 | 16667 |                                                       |
|  57 <>  58 |    22 |   507 | 65142 |  6383 |                                                       |
|  82 <>  83 |     6 |   508 | 65142 | 11280 |                                                       |
|  83 <>  84 |    43 |   507 | 65142 |  3514 |                                                       |
|  98 <>  99 |     6 |   508 | 65142 | 11280 |                                                       |
|  99 <> 100 |    42 |   507 | 65142 |  3585 |                                                       |
| 100 <> 101 |     6 |   508 | 65142 | 11280 |                                                       |
| 102 <> 103 |     7 |   508 | 65142 | 18975 |                                                       |
| 103 <> 104 |   164 |   507 |   508 |   507 |                                                       |
| 105 <> 106 |    13 |   508 | 65142 | 15424 |                                                       |
| 106 <> 107 |    34 |   507 |   508 |   507 |                                                       |
| 115 <> 116 |     6 |   508 | 65142 | 11280 |                                                       |
| 117 <> 118 |   556 |   508 | 65142 |   740 |                                                       |
| 118 <> 119 |     5 |   507 |   508 |   507 |                                                       |
| 127 <> 128 |     6 |   508 | 65142 | 11280 |                                                       |
| 128 <> 129 |     4 |   508 | 65142 | 16667 |                                                       |
| 129 <> 130 |   501 |   508 | 65142 |   637 |                                                       |
| 131 <> 132 |     5 |   507 |   508 |   507 |                                                       |
| 159 <> 160 |     6 |   508 | 65142 | 11280 |                                                       |
| 161 <> 162 |     4 |   508 | 65142 | 16667 |                                                       |
| 162 <> 163 |   149 |   507 | 65142 |   941 |                                                       |
| 187 <> 188 |     4 |   508 | 65142 | 16667 |                                                       |
| 188 <> 189 |     2 |   508 |   508 |   508 |                                                       |
| 225 <> 226 |    10 |   508 | 65142 | 13435 |                                                       |
| 226 <> 227 |    76 |   507 | 65142 |  1358 |                                                       |
| 240 <> 241 |     6 |   508 | 65142 | 11280 |                                                       |
| 242 <> 243 |   444 |   508 | 65142 |   799 |                                                       |
| 243 <> 244 |     7 |   508 | 65142 | 18975 |                                                       |
| 244 <> 245 |   898 |   507 |   508 |   507 |                                                       |
| 245 <> 246 |   496 |   508 |   508 |   508 |                                                       |
| 246 <> 247 |   516 |   508 |   508 |   508 |                                                       |
| 247 <> 248 |   496 |   508 | 65142 |   638 |                                                       |
| 248 <> 249 |   482 |   508 | 65142 |   776 |                                                       |
| 249 <> 250 |   485 |   507 |   508 |   507 |                                                       |
| 250 <> 251 |   602 |   508 |   508 |   508 |                                                       |
| 251 <> 252 |   629 |   508 |   508 |   508 |                                                       |
| 252 <> 253 |   664 |   508 |   508 |   508 |                                                       |
| 253 <> 254 |   519 |   508 |   508 |   508 |                                                       |
| 254 <> 255 |   719 |   508 |   508 |   508 |                                                       |
| 255 <> 256 |   646 |   508 |   508 |   508 |                                                       |
| 256 <> 257 |   764 |   508 |   508 |   508 |                                                       |
| 257 <> 258 |   376 |   508 |   508 |   508 |                                                       |
| 260 <> 261 |    11 |   507 | 65142 |  6383 |                                                       |
| 263 <> 264 |    51 |   508 | 65142 |  4310 |                                                       |
| 264 <> 265 |    19 |   507 |   508 |   507 |                                                       |
| 267 <> 268 |     3 |   509 | 65142 | 22053 |                                                       |
| 268 <> 269 |     7 |   508 | 65142 |  9741 |                                                       |
| 269 <> 270 |    80 |   507 | 65142 |  2123 |                                                       |
| 270 <> Dur |    17 |   507 | 65142 |  8112 |                                                       |
======================================================================================================

Next steps

At this point it’s best to raise a bug with the developers. In the mean time, spawning additional application workers to spread the queue around may help ungrease the wheel a bit.

Longterm we should confirm the retransmissions aren’t caused by poor network conditions. The application could be patched to establish a long lived connection while pushing data into object storage. It currently creates a new session per push. Longer connection life-times would ensure the TCP congestion algorithm heuristics have time to stabilise.

Footnotes
1.
<percentile> <request count> <Request Time*> <Connect Time**> <Response Time***> <Data Time****>
2.
TCP sequence numbers are generally unhelpful when capturing packets from a server. This is due to segment offloading to the NIC.
#tech #journal #networking #observability