[Libwebsockets] Possible bug, ping/pong with polling on Windows seems to send pings too late

Casey Banner kcbanner at gmail.com
Thu Oct 8 05:18:27 CEST 2020


Hello,

Recently having upgraded to v4.1.0 from v3.1.0, I'm having an issue
with the new retry system. What seems to be happening is the ping
doesn't get sent until right before hangup check, and the pong comes
back too late.

In our use case, we want to maintain the connection (with ping/pongs)
even when there is no data being sent. The reason being the user may
take an action and we don't want the latency of establishing a new
connection. We need the ping/pongs to stop the AWS load balancer from
killing our connection due to no data transfer.

This was working on v3.1.0 with the now deprecated ping/pong interval.

I've configured the retry as follows:

static const lws_retry_bo_t RetryPolicy = {
    nullptr,
    0,
    0,
    3,
    10,
    20
};

We're not using lws_validity_confirmed, as in this case it wouldn't
help (no one is sending anything meaningful).

The platform is Windows x64.

Logs.

[2020.10.08-02.36.35:239][163]LogLwsWebSockets: Lws(Notice): LWS:
4.1.0-v4.1.2-6-g616d8e97, loglevel 4063
[2020.10.08-02.36.35:241][163]LogLwsWebSockets: Lws(Notice): NET CLI
H1 H2 WS IPv6-absent
[2020.10.08-02.36.35:245][163]LogLwsWebSockets: Lws(Info): Event loop: poll

... snip ...

[2020.10.08-02.12.04:716][905]LogLwsWebSockets: Lws(Info):
lws_validity_cb: wsi 000001523218EC70: scheduling validity check
[2020.10.08-02.12.04:717][905]LogLwsWebSockets: Verbose: Lws(Debug):
_lws_change_pollfd: wsi 000001523218EC70: fd 5952 events 41 -> 43
[2020.10.08-02.12.04:717][905]LogLwsWebSockets: Verbose: Lws(Debug):
rops_handle_POLLOUT_ws: client:0.38.10.x: wsi->ws->tx_draining_ext 0
[2020.10.08-02.12.04:718][905]LogLwsWebSockets: Lws(Info):
rops_handle_POLLOUT_ws: issuing ping on wsi 000001523218EC70: ws
client:0.38.10.x h2: 0
[2020.10.08-02.12.04:718][905]LogLwsWebSockets: Lws(Info):
lws_issue_raw: ssl_capable_write (6) says 6
[2020.10.08-02.12.04:719][905]LogLwsWebSockets: Verbose: Lws(Debug):
rops_handle_POLLOUT_ws: client:0.38.10.x: wsi->ws->tx_draining_ext 0
[2020.10.08-02.12.04:720][905]LogLwsWebSocketsExt: Verbose: Lws(Ext):
rops_handle_POLLOUT_ws: !wsi->ws->extension_data_pending
[2020.10.08-02.12.04:722][905]LogLwsWebSockets: Verbose: Lws(Debug):
_lws_change_pollfd: wsi 000001523218EC70: fd 5952 events 43 -> 41
[2020.10.08-02.12.04:722][905]LogLwsWebSockets: Verbose: Lws(Debug):
lws_handle_POLLOUT_event: 000001523218EC70: non mux: wsistate
0x10000119, ops ws
[2020.10.08-02.12.04:724][905]LogLwsWebSockets: Verbose: FLwsWebSocket
[2]: 000001523218EC70 callback 10
[2020.10.08-02.12.04:724][905]LogLwsWebSockets: Verbose: Lws(Debug):
lws_read_h1: h1 path: wsi state 0x119
[2020.10.08-02.12.04:725][905]LogLwsWebSockets: Verbose: Lws(Debug):
lws_ws_handshake_client: hs client feels it has 2 in
[2020.10.08-02.12.04:726][905]LogLwsWebSocketsExt: Verbose: Lws(Ext):
lws_ws_client_rx_sm:    This RX frame Final 1
[2020.10.08-02.12.04:726][905]LogLwsWebSockets: Lws(Info):
lws_ws_client_rx_sm: client 000001523218EC70 received pong
[2020.10.08-02.12.04:727][905]LogLwsWebSockets: Verbose: Lws(Debug):
(hexdump: zero length)
[2020.10.08-02.12.04:727][905]LogLwsWebSockets: Lws(Info):
_lws_validity_confirmed_role: wsi 000001523218EC70: setting validity
timer 3s (hup 0)
[2020.10.08-02.12.04:728][905]LogLwsWebSockets: Verbose: Lws(Debug):
lws_ws_client_rx_sm: starting disbursal of 0 deframed rx
[2020.10.08-02.12.04:728][905]LogLwsWebSocketsExt: Verbose: Lws(Ext):
lws_ws_client_rx_sm: +++ passing 0 0000015263BD7020 to ext
[2020.10.08-02.12.04:729][905]LogLwsWebSocketsExt: Verbose: Lws(Ext):
lws_extension_callback_pm_deflate: LWS_EXT_CB_PAYLOAD_RX: in 0,
existing in 0
[2020.10.08-02.12.04:729][905]LogLwsWebSocketsExt: Verbose: Lws(Ext):
Ext RX returned 1
[2020.10.08-02.12.04:729][905]LogLwsWebSockets: Verbose: Lws(Debug):
rops_handle_POLLOUT_ws: client:0.38.10.x: wsi->ws->tx_draining_ext 0
[2020.10.08-02.12.04:730][905]LogLwsWebSocketsExt: Verbose: Lws(Ext):
rops_handle_POLLOUT_ws: !wsi->ws->extension_data_pending
[2020.10.08-02.12.04:730][905]LogLwsWebSockets: Verbose: Lws(Debug):
_lws_change_pollfd: wsi 000001523218EC70: fd 5952 events 41 -> 41
[2020.10.08-02.12.04:731][905]LogLwsWebSockets: Verbose: Lws(Debug):
lws_handle_POLLOUT_event: 000001523218EC70: non mux: wsistate
0x10000119, ops ws
[2020.10.08-02.12.04:731][905]LogLwsWebSockets: Verbose: FLwsWebSocket
[2]: 000001523218EC70 callback 10
[2020.10.08-02.12.07:717][ 85]LogLwsWebSockets: Lws(Info):
lws_validity_cb: wsi 000001523218EC70: scheduling validity check
[2020.10.08-02.12.07:718][ 85]LogLwsWebSockets: Verbose: Lws(Debug):
_lws_change_pollfd: wsi 000001523218EC70: fd 5952 events 41 -> 43
[2020.10.08-02.12.14:720][505]LogLwsWebSockets: Verbose: Lws(Debug):
rops_handle_POLLOUT_ws: client:0.38.10.x: wsi->ws->tx_draining_ext 0
[2020.10.08-02.12.14:722][505]LogLwsWebSockets: Lws(Info):
rops_handle_POLLOUT_ws: issuing ping on wsi 000001523218EC70: ws
client:0.38.10.x h2: 0
[2020.10.08-02.12.14:722][505]LogLwsWebSockets: Lws(Info):
lws_issue_raw: ssl_capable_write (6) says 6
[2020.10.08-02.12.14:723][505]LogLwsWebSockets: Lws(Info):
lws_validity_cb: wsi 000001523218EC70: validity too old
[2020.10.08-02.12.14:723][505]LogLwsWebSockets: Lws(Info):
__lws_close_free_wsi: 000001523218EC70: caller: validity timeout
[2020.10.08-02.12.14:724][505]LogLwsWebSockets: Verbose: Lws(Debug):
__lws_close_free_wsi: real just_kill_connection: 000001523218EC70
(sockfd 5952)
[2020.10.08-02.12.14:724][505]LogLwsWebSockets: Verbose: Lws(Debug):
lwsi_set_state(000001523218EC70, 0x10000020)
[2020.10.08-02.12.14:724][505]LogLwsWebSocketsExt: Verbose: Lws(Ext):
lws_extension_callback_pm_deflate: LWS_EXT_CB_DESTROY
[2020.10.08-02.12.14:725][505]LogLwsWebSockets: Verbose: FLwsWebSocket
[2]: 000001523218EC70 callback 75

Stepping through, it seems like the lws_validity_cb call at 12.07
should have triggered a ping to be sent (it seems to have done so
correctly at 12.04). Stepping through in a debugger, I can see the
call to rops_issue_keepalive_ws take the branch where it sets
send_check_ping, but then the rops_handle_POLLIN_ws call for that
socket comes in too late (7 seconds later instead of 3) and by then
the socket gets hung up by the 10 second timeout logic.

I tested with minimal-ws-client. The only modifications I made were to
disable SSL and have it connect to a different URL (a local server
that sends nothing).

Wireshark logs as follows:

1    0.000000    127.0.0.1    127.0.0.1    TCP    56    63905 → 9998
[SYN] Seq=0 Win=65535 Len=0 MSS=65495 WS=256 SACK_PERM=1    63905
9998    0
2    0.000030    127.0.0.1    127.0.0.1    TCP    56    9998 → 63905
[SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=65495 WS=256 SACK_PERM=1
9998    63905    0
3    0.000048    127.0.0.1    127.0.0.1    TCP    44    63905 → 9998
[ACK] Seq=1 Ack=1 Win=2619648 Len=0    63905    9998    0
4    0.000096    127.0.0.1    127.0.0.1    HTTP    336    GET
/socketExternal HTTP/1.1     63905    9998    0
5    0.000105    127.0.0.1    127.0.0.1    TCP    44    9998 → 63905
[ACK] Seq=1 Ack=293 Win=2619648 Len=0    9998    63905    0
6    0.000617    127.0.0.1    127.0.0.1    HTTP    352    HTTP/1.1 101
Switching Protocols     9998    63905    0
7    0.000626    127.0.0.1    127.0.0.1    TCP    44    63905 → 9998
[ACK] Seq=293 Ack=309 Win=2619392 Len=0    63905    9998    0
8    3.000857    127.0.0.1    127.0.0.1    WebSocket    50
WebSocket Ping [FIN] [MASKED]    63905    9998    0    Ping
9    3.000873    127.0.0.1    127.0.0.1    TCP    44    9998 → 63905
[ACK] Seq=309 Ack=299 Win=2619648 Len=0    9998    63905    0
10    3.000987    127.0.0.1    127.0.0.1    WebSocket    46
WebSocket Pong [FIN]     9998    63905    0    Pong
11    3.000996    127.0.0.1    127.0.0.1    TCP    44    63905 → 9998
[ACK] Seq=299 Ack=311 Win=2619392 Len=0    63905    9998    0
12    13.002098    127.0.0.1    127.0.0.1    WebSocket    50
WebSocket Ping [FIN] [MASKED]    63905    9998    0    Ping
13    13.002131    127.0.0.1    127.0.0.1    TCP    44    9998 → 63905
[ACK] Seq=311 Ack=305 Win=2619648 Len=0    9998    63905    0
14    13.002176    127.0.0.1    127.0.0.1    TCP    44    63905 → 9998
[FIN, ACK] Seq=305 Ack=311 Win=2619392 Len=0    63905    9998    0
15    13.002183    127.0.0.1    127.0.0.1    TCP    44    9998 → 63905
[ACK] Seq=311 Ack=306 Win=2619648 Len=0    9998    63905    0
16    13.002301    127.0.0.1    127.0.0.1    WebSocket    46
WebSocket Pong [FIN]     9998    63905    0    Pong
17    13.002330    127.0.0.1    127.0.0.1    TCP    44    63905 → 9998
[RST, ACK] Seq=306 Ack=313 Win=0 Len=0    63905    9998    0

The same behaviour I saw in my application also shows here in the
example. It seems that it waits in the service call for too long
before sending the ping?

Perhaps this is a problem specifically on windows with this polling model?

I could also just use the scheduler to wake up and send pings (without
using the retry functionality at all). Also, if I've misunderstood how
the system is supposed to work, let me know.

Cheers and thanks for maintaining libwebsockets!

- Casey


More information about the Libwebsockets mailing list