[Libwebsockets] performance client vs server
justinbeech at gmail.com
Mon Nov 7 05:16:56 CET 2016
No worries I did the hack to see where the performance could be picked up between joining the list to send the query, and reading your reply. I thought it may be a byproduct of the protocol that made reading more work that writing.
Doing the payload in chunks in a similar way and to match the server performance would seem the right thing to do, yes. The first gain was to simply to avoid the function call for every payload byte, that doubled the speed by itself..
The rest is in moving bytes in clumps, by one method or another. The weird masking of course makes this more of a pain. (I don't know why the spec dabbled with that when SSL is the correct answer!).
> On Nov 7, 2016, at 2:58 PM, Andy Green <andy at warmcat.com> wrote:
>> On Mon, 2016-11-07 at 14:40 +1100, jb wrote:
>> Hi Andy,
>> I managed to get it about 6x faster by inserting this
>> chunk of code into lws_handshake_client.
>> Sorry I'm not too good with diffs or pull requests yet ..
> Assuming lws came to you via git clone, you can just do
> $ git diff
> to get a diff of your changes. Stick them in a file like
> $ git diff > mydiff
> and send the file is enough, assuming it's not overwhelmed with other
> Note lws is LGPL... easiest way to deal with that is contribute your
>> However you can see from the screen shot that if one
>> gulps payload when the parser is in its most common state
>> then there is massive performance improvement for
>> extended runs of large payloads that are not masked.
> Yes this isn't news... the code I pointed to did the same thing to the
> very similar code on the server side.
>> Before: Server 10% client 100% - throughput maybe 1.5 gigabit
>> After: Server 50% client 100% - throughput about 6-7 gigabit.
>> The new code is in braces. In lws_handshake_client.
>> this stops most of the character by character calling
>> of lws_client_rx_sm
>> which has local variables and a lot of popping and pushing
> Local vars don't cost anything per se in C. It just adjusts the stack
> frame one time on entry and one time on exit. However we know from the
> server change avoiding calling this bytewise the way that did is much
> more efficient.
>> and if()s. Similar performance if you just gulp 128 byte
>> chunks, rather than "len" and then let lws_client_rx_sm()
>> clean up the remainder.
>> I've no idea what this breaks! but yeah the gain is available.
> That's why following the code I pointed to is maybe a good idea... it's
> been in for a while.
>>> On Mon, Nov 7, 2016 at 1:40 PM, Andy Green <andy at warmcat.com> wrote:
>>>> On Mon, 2016-11-07 at 13:12 +1100, jb wrote:
>>>> Using a modified fraggle.c, removing deflate, increasing the
>>>> size to batches of 32k, removing the generation of random data
>>>> the checksums, I see that when the client runs at 100% cpu the
>>>> is only running at 10% cpu. (fraggle.c is arranged so when a
>>>> connects the server sends a bunch of messages).
>>>> Doing a quick profile it looks like all the client cpu time is
>>>> up by lws_client_rx_sm which appears to be a character by
>>>> character state machine for receiving bytes.
>>>> It isn't totally clear to me why the server is 10x faster than
>>>> client at sending data than the client is at reading data. If the
>>>> server sends a 32k block of zeros as a binary message, at some
>>>> isn't there a payload length and a payload of 32k does each byte
>>>> to be processed individually on one side but not the other?
>>> Take a look at this
>>> On the server side, the equivalent parser got a patch optimizing
>>> bulk data flow.
>>> If you'd like to port that to the client side, patches are welcome.
>>>> Libwebsockets mailing list
>>>> Libwebsockets at ml.libwebsockets.org
More information about the Libwebsockets