[Libwebsockets] LWS assertion probably due to start/stop of a pppd daemon

thomas.spitz at hestia-france.com thomas.spitz at hestia-france.com
Wed Sep 23 10:05:23 CEST 2020


Hello Andy, 

Thanks a lot for your quick answer 

Unfortunately I have to continue looking into my code "by hand" because
I don't see yet how to reproduce my program issues through skeletal
changes. I also cannot use Valgrind because it is not available for
armv5 and my program is very tight to hardware (hard to make it cross
platform)... 

In order to help me spot the bug, sometimes, at least with libwebsockets
v4.0.99 unknown-build-hash, the issue showed up in a different manner.
The WS server just stopped accepting new connection without trigerring
any assertion. 

Connection OK: 

> 2020-08-13 14:52:46,1,6,V4:, SSL_ERROR_WANT_READ: m 2
> 2020-08-13 14:52:46,1,6,V4:, SSL_accept says -2
> 2020-08-13 14:52:46,1,6,V4:, lws_header_table_attach: wsi 0x12a0e0: ah (nil) (tsi 0, count = 0) in
> 2020-08-13 14:52:46,1,6,V4:, _lws_create_ah: created ah 0x13bd18 (size 4096): pool length 1
> 2020-08-13 14:52:46,1,6,V4:, lws_header_table_attach: did attach wsi 0x12a0e0: ah 0x13bd18: count 1 (on exit)
> 2020-08-13 14:52:46,1,6,V4:, SNI: Unknown ServerName
> 2020-08-13 14:52:46,1,6,V4:, 
> 2020-08-13 14:52:46,1,6,V4:, SSL_ERROR_WANT_READ: m 2
> 2020-08-13 14:52:46,1,6,V4:, SSL_accept says -2
> 2020-08-13 14:52:46,1,6,V4:, lws_tls_server_accept: no client cert CN
> 2020-08-13 14:52:46,1,6,V4:, lws_openssl_describe_cipher: wsi 0x12a0e0: ECDHE-RSA-AES256-GCM-SHA384, ECDHE-RSA-AES256-GCM-SHA384, 256 bits, TLSv1/SSLv3
> 2020-08-13 14:52:46,1,6,V4:, SSL_accept says 0
> 2020-08-13 14:52:46,1,6,V4:, setting wsi to vh default
> 2020-08-13 14:52:46,1,6,V4:, no ALPN upgrade
> 2020-08-13 14:52:46,1,6,V4:, lws_buflist_aware_read: wsi 0x12a0e0: lws_h1_server_socket_service: ssl_capable_read 217
> 2020-08-13 14:52:46,1,6,V4:, lws_handshake_server: parsed count 217
> 2020-08-13 14:52:46,1,6,V4:, lws_select_vhost: vhost match to default based on port 443
> 2020-08-13 14:52:46,1,6,V4:, Upgrade to ws
> 2020-08-13 14:52:46,1,6,V4:, lws_issue_raw: ssl_capable_write (173) says 173this is the same issue as the assertion or
> 2020-08-13 14:52:46,1,0,V4:, ../trunk/serveurWS/callback_WSShestia.c(12185)callback_hestia_v4()L69: LWS_CALLBACK_ESTABLISHED
> 2020-08-13 14:52:46,1,6,V4:, _lws_validity_confirmed_role: wsi 0x12a0e0: setting validity timer 300s (hup 0)
> 2020-08-13 14:52:46,1,6,V4:, lws_process_ws_upgrade2: 0x12a0e0: dropping ah on ws upgrade
> 2020-08-13 14:52:46,1,6,V4:, __lws_header_table_detach: wsi 0x12a0e0: ah 0x13bd18 (tsi=0, count = 1)
> 2020-08-13 14:52:46,1,6,V4:, __lws_header_table_detach: nobody usable waiting
> 2020-08-13 14:52:46,1,6,V4:, _lws_destroy_ah: freed ah 0x13bd18 : pool length 0
> 2020-08-13 14:52:46,1,6,V4:, __lws_header_table_detach: wsi 0x12a0e0: ah 0x13bd18 (tsi=0, count = 0)
> 2020-08-13 14:52:46,1,6,V4:, lws_ws_frame_rest_is_payload: using 38 of raw input (total 38 on offer)

Connection not working anymore: 

> 2020-08-13 14:56:27,1,6,V4:, SSL_ERROR_WANT_READ: m 2
> 2020-08-13 14:56:27,1,6,V4:, SSL_accept says -2
> 2020-08-13 14:56:27,1,6,V4:, lws_header_table_attach: wsi 0x1297d0: ah (nil) (tsi 0, count = 0) in
> 2020-08-13 14:56:27,1,6,V4:, _lws_create_ah: created ah 0x13b0c8 (size 4096): pool length 1
> 2020-08-13 14:56:27,1,6,V4:, lws_header_table_attach: did attach wsi 0x1297d0: ah 0x13b0c8: count 1 (on exit)
> 2020-08-13 14:56:27,1,6,V4:, SNI: Unknown ServerName
> 2020-08-13 14:56:27,1,6,V4:, 
> 2020-08-13 14:56:27,1,6,V4:, SSL_accept says -1
> 2020-08-13 14:56:27,1,6,V4:, SSL_accept failed socket 2: -1
> 2020-08-13 14:56:27,1,6,V4:, __lws_close_free_wsi: 0x1297d0: caller: close_and_handled
> 2020-08-13 14:56:27,1,6,V4:, rops_destroy_role_h1: ah det due to close
> 2020-08-13 14:56:27,1,6,V4:, __lws_header_table_detach: wsi 0x1297d0: ah 0x13b0c8 (tsi=0, count = 1)
> 2020-08-13 14:56:27,1,6,V4:, __lws_header_table_detach: nobody usable waiting
> 2020-08-13 14:56:27,1,6,V4:, _lws_destroy_ah: freed ah 0x13b0c8 : pool length 0
> 2020-08-13 14:56:27,1,6,V4:, __lws_headethis is the same issue as the assertion orr_table_detach: wsi 0x1297d0: ah 0x13b0c8 (tsi=0, count = 0)

Do you think this could also be due to a client connection fd my program
close accidentaly? 

Best regards, 
Thomas 

On 2020-09-22 18:36, andy at warmcat.com wrote: 

> On September 22, 2020 4:10:56 PM UTC, Thomas Spitz
> <thomas.spitz at hestia-france.com> wrote: Hello everyone,
> 
> I've been using LWS in my main program for years without stability
> issues.
> 
> My main program is launching a LWS server through a thread in the same
> manner as minimal-ws-server.c (creating lws context, creating vhost
> with my
> ws-protocol and looping on lws_service) :
> 
> pthread_attr_init(&attr); 
> pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);pthread_create(&ThreadServeurWEBSOCKETS, &attr, serveurwebsockets, NULL) 
> This thread is managing a wsi list, receiving and sending WS messages
> 
> Now, my program also launches a pppd daemon through a thread using fork
> +
> exec. This pppd daemon is used to set a mobile backup internet
> connection
> but this internet connection is never used by WS clients at the
> present.
> 
> Testing with only one WS client, switching ON/OFF the mobile backup
> internet connections many times, this WS client is generaly able to
> stay
> connected or connects / disconnects without issue. However, sometimes
> (1
> time among 40 in average), the WS server creates an assertion.
> 
> Here a log example with LWS: 4.1.0-22d043a, loglevel 31.:
 ... 

> I think the cause of the above assertion is not due to lws itself but I
> don't know how to proceed in order to find the issue.
> 
>> assert(context->lws_lookup[wsi->desc.sockfd -
>> lws_plat_socket_offset()] == 0);
> 
> Do you think this could be due to a fd that would be copied during the
> pppd
> fork+exec?

Iiui it's telling you that lws has a particular fd in use, that has
not indicated that it has closed... if an fd is in use, we'll normally
hear about its disconnection by POLLERR or POLLIN, when we go to
read() from it we get -1 back.  Then we understand he's dead, we do
the wsi close flow and logically close the fd as part of that,
removing it from the fd tables and destroying / freeing the wsi too
until no trace of it.

Later the same fd in the process can be reused, but it comes again
fresh to the whole thing from scratch.

The kind of error you have, it's like, eg, you have manually closed an
fd that lws has in use, outside of lws and without lws-side processing
of the close.  The now meaningless wsi is sitting there doing nothing
but still registers ownership of the fd.  But then a new connection
comes recycling that fd as if it's something new.

If it feels a problem on lws side, best thing is try to reproduce it
with skeletal changes to a minimal example.  And run your own stuff
under valgrind, it can report then if you, eg, have a bug where your
code closes the wrong fd.

-Andy

> Thanks in advance for your help
> Best regards,
> Thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://libwebsockets.org/pipermail/libwebsockets/attachments/20200923/8d89bd18/attachment-0001.htm>


More information about the Libwebsockets mailing list