[Libwebsockets] lws_write speed

Andy Green andy at warmcat.com
Thu May 19 06:47:29 CEST 2016



On 05/19/2016 10:05 AM, Andy Green wrote:
>
>
> On 05/18/2016 05:35 PM, Roger Light wrote:
>> On Wed, May 18, 2016 at 5:10 AM, Andy Green <andy at warmcat.com> wrote:
>>
>>> No it works fine, because after copying to the internal buffer
>>> lws_write()
>>> lies and returns the whole amount as "sent".  It has to do that
>>> because the
>>> buffer it was given is usually on the stack and will immediately be
>>> lost.
>>
>> Ok, I don't believe that this will be done on the stack most of the
>> time but I understand the reasoning.
>
> It depends on how the user code generates the data... buffering it on
> the stack is often desirable unless where it's stored has made
> arrangements for LWS_PRE and getting the data XOR'd, sometimes the raw
> data needs transforming when it's sent as part of the ws protocol.
>
>>> Seeing what has happened, lws then disables any further WRITABLE
>>> callbacks
>>> to the user code and requests and services them automatically from
>>> the temp
>>> buffer.  When the malloc'd temp buffer is drained, it is kept around
>>> (on the
>>> basis if you wrote that much once on this wsi, your code is probably
>>> planning to do so again) and only realloc'd if the next one is bigger.
>>> WRITABLE callbacks are reenabled when the temp buffer is drained.
>>> The temp
>>> buffer is freed when the wsi closes.
>>
>> I see, that makes sense in the context of the previous
>>
>>> PS: also I learned to my surprise, the pattern of giving write() a big
>>> length and letting it nibble what it wants is a really bad
>>> performance idea.
>>> The problem is the kernel processes all of the pages every time before
>>> passing the request to the network stack, if len is counted in MB
>>> that is a
>>> huge amount of time and CPU lost each call, that will only accept a
>>> fraction
>>> of the processed pages.
>>
>> Ah, that's very interesting and not something I'd thought about,
>> thanks for the tip.
>>
>> I wanted to test it out of course though, so tried sending a ~60MB
>> file (not using websockets) either using write(full_remaining_length)
>> or write(4096) and using callgrind to look at both cases  (yes, this
>> is only looking at the user space). This is with an application
>> operating as a client with only 2 socket connections open. What I saw
>> was that passing the full length meant write() was called 1665 times,
>> but limited to 4096 bytes it was called 15725 times.
>>
>> Doing further investigation to look at what was actually being
>> returned from write() gave me a smallest write of 1428 (this only
>> happened twice), a mean of 40559, median of 19992 and maximum of
>> 1098132.
>>
>> It's clear that those numbers are much smaller than the 60MB total
>> size and so trying to pass that every single time would result in a
>> loss in performance from what you said. On the other hand, limiting to
>> 4096 bytes at once seems like it would reduce performance as well from
>> what I've seen.
>>
>> FWIW, this is Ubuntu 14.04 running on an Intel Atom N2800 with 2GB RAM
>> - connecting to a remote host in a different country.
>
> The surprising thing is the kernel ever took 1MB for a non-127.x.x.x
> address.  I guess it did it at the very start.
>
> There's a tradeoff with the requested length vs the chance of having to
> buffer some of it.  I think the median is most indicative, under the
> test conditions mainly the kernel would take ~20K.  But of course if the
> kernel came under memory pressure, or there were many active
> connections, or connections slow to ACK, that figure is highly dynamic.
>
> Maybe what we should do is keep a small ringbuffer of per-connection
> stats like that and let the user code predict what would be an optimal
> size based on what went through the last few recent sends.  Occasionally
> probing if it should go bigger isn't so bad, if it only goes over by a
> bit, the malloc'd buffer for the leftover bit is only small, although
> sending the small leftover bit next time might hit throughput a bit.

I added some basic stats and read leaf.jpg some different ways.  This is 
on a very fast Linux machine with 32GB RAM FWIW.

1) Current situation, 4KiB LWS_MAX_SOCKET_IO_BUF, wget from same machine 
on lo

lwsws[23437]:  TXSTATS: wsi 0x01903f30, sent:        2477660 bytes, 
write() calls:             606
lwsws[23437]:           clipped data:                      0 (  0%), 
clipped writes:              0 (  0%)
lwsws[23437]:           mean size passed:               4088 bytes, 
mean size clipped:           0 bytes
lwsws[23437]:           smallest size clipped:             0 bytes, 
largest size clipped:        0 bytes
lwsws[23437]:           smallest size asked:             142 bytes, 
largest size asked:       4096 bytes
lwsws[23437]:           mean size asked:                4088 bytes

For all the other tests, LWS_MAX_SOCKET_IO_BUF was increased to 128KiB

2) wget from same machine on lo

lwsws[24342]:  TXSTATS: wsi 0x01be6ee0, sent:        2477660 bytes, 
write() calls:              20
lwsws[24342]:           clipped data:                      0 (  0%), 
clipped writes:              0 (  0%)
lwsws[24342]:           mean size passed:             123883 bytes, 
mean size clipped:           0 bytes
lwsws[24342]:           smallest size clipped:             0 bytes, 
largest size clipped:        0 bytes
lwsws[24342]:           smallest size asked:             142 bytes, 
largest size asked:     131072 bytes
lwsws[24342]:           mean size asked:              123883 bytes

3) wget from ARM machine on same ethernet

lwsws[24342]:  TXSTATS: wsi 0x01be6c20, sent:        2477660 bytes, 
write() calls:              26
lwsws[24342]:           clipped data:                 209824 (  8%), 
clipped writes:              6 ( 23%)
lwsws[24342]:           mean size passed:             113391 bytes, 
mean size clipped:       34970 bytes
lwsws[24342]:           smallest size clipped:         85432 bytes, 
largest size clipped:   101304 bytes
lwsws[24342]:           smallest size asked:             142 bytes, 
largest size asked:     131072 bytes
lwsws[24342]:           mean size asked:              103364 bytes

4) Android chrome access over internet from 4G network

lwsws[22271]: lws_header_table_detach: wsi 0xb53ee0: ah held 5s, 
ah.rxpos 0, ah.rxlen 0, mode/state 0 4,wsi->more_rx_waiting 0
lwsws[22271]:  TXSTATS: wsi 0x00b54230, sent:        2477660 bytes, 
write() calls:              22
lwsws[22271]:           clipped data:                  63456 (  2%), 
clipped writes:              2 (  9%)
lwsws[22271]:           mean size passed:             120710 bytes, 
mean size clipped:       31728 bytes
lwsws[22271]:           smallest size clipped:         95040 bytes, 
largest size clipped:   103648 bytes
lwsws[22271]:           smallest size asked:             142 bytes, 
largest size asked:     131072 bytes
lwsws[22271]:           mean size asked:              115505 bytes


Ignoring trying to optimize adaptively for the moment, clearly there's a 
lot of scope for improvement here if burning more memory for the 
per-service-thread buffer (used by lws_serve_http_file_fragment() that 
is doing the work for leaf.jpg) is an option.

Until now the size of that buffer is set in private-libwebsockets.h to 
4KiB, I just pushed a patch on master that lets it be set from the 
context creation info struct.

-Andy


> -Andy
>
>> Cheers,
>>
>> Roger
>>
> _______________________________________________
> Libwebsockets mailing list
> Libwebsockets at ml.libwebsockets.org
> http://libwebsockets.org/mailman/listinfo/libwebsockets



More information about the Libwebsockets mailing list