[Libwebsockets] UTF-8 correctness for WS text-message fragments.

Alexander Bruines alexander.bruines at gmail.com
Sun Jul 1 21:42:54 CEST 2018


Hello Andy,

I have two lws related questions.

1) Would it be possible for lws to make sure that each WS text-message fragment is checked for UTF-8 correctness by setting LWS_SERVER_OPTION_VALIDATE_UTF8 for a given vhost?
Right now (I believe that) this option checks the entire WS text-message but the individual fragments are (definitely) split at a 'random' byte (determined by the buffer size for that protocol).

Since LWS_CALLBACK_RECEIVE is expecting UTF8 in this instance, wouldn't it be useful to make sure that each fragment ends with a valid code-point instead (so that the user can process the fragment without worries).

My user-code (C++11) now uses the following to achieve this:

> ...
>   case LWS_CALLBACK_RECEIVE: {
>
>     /* A buffer to store any incomplete UTF8 codepoint (at most three bytes) */
>     static std::string bytes_left_over;
>
>     /* Start with whatever bytes are left over from the previous fragment. */
>     std::string bytes_in(std::move(bytes_left_over));
>
>     /* Append the new fragment */
>     bytes_in.append((const char*)in, len);
>
>     /* Test the last codepoint in the fragment for UTF8 correctness.
>      * We can do this using a (reverse) loop over the data and determine if the
>      * last byte is part of a multibyte codepoint (bit 7 will be set).
>      * If bit 6 is also set we know that this byte is the start of a codepoint,
>      * but if bit 6 is unset then it is the N-th byte of a codepoint and we need
>      * to keep scanning. */
>
>     if (!lws_is_final_fragment(wsi)) {
>       using CharBits = std::bitset<std::numeric_limits<unsigned char>::digits>;
>       auto cr_iter = bytes_in.crbegin();
>       for ( ;cr_iter != bytes_in.crend(); ++cr_iter) {
>         CharBits current_byte(*cr_iter);
>         if (current_byte.test(7)) {
>           if (current_byte.test(6)) break;
>           continue;
>         }
>         break;
>       }
>       if (cr_iter != bytes_in.crbegin()) {
>
>         /* Verify that the codepoint is incomplete then move it to the buffer.
>          * (We could just move it without testing, this would always
>          * save a last multibyte character for the next/final fragment.) */
>
>         auto pos = std::distance(cr_iter, bytes_in.crend()) - 1;
>         auto iter = bytes_in.begin() + pos;
>         unsigned bytes_needed = 1;
>         CharBits first_byte(*iter);
>         if (first_byte.test(6)) bytes_needed++;
>         if (first_byte.test(5)) bytes_needed++;
>         if (first_byte.test(4)) bytes_needed++;
>         if (bytes_needed != bytes_in.length() - pos) {
>           bytes_left_over.assign(iter, bytes_in.end());
>           bytes_in.erase(pos);
>         }
>       }
>     }
>
>     /* The parser expects UCS so convert the UTF-8 data in the fragment, then
>      * process it. This should not throw any exception, but you never know... */
>
>     bool parser_retv = false;
>     try {
>       using convert_type = std::codecvt_utf8<wchar_t>;
>       std::wstring_convert<convert_type, wchar_t> converter;
>       parser_retv = parser.read(converter.from_bytes(bytes_in));
>     }
>     catch (fancy_runtime_error*) {
>       /* Catch and rethrow non-STL exceptions (from our parser) to be caught by
>        * the exception handler in the 'runable' method of this Thread. */
>       throw;
>     }
>     catch (std::runtime_error& ex) {
>       /* Just ignore exceptions from the STL and let the parser fail instead. */
>       dbg_wprintf(L"Caught STL exception: %s\n", ex.what());
>     }
>
>     if (lws_is_final_fragment(wsi)) {
>       ... do something with the processed WS textmessage ...
>     }
>
>     break;
>   }
> ...

2) My second question is about throwing an exception from within LWS_CALLBACK_RECEIVE.

In the code-snippet above the fragment parser may throw an exception. If this happens my exception handler calls lws_cancel_service() and lws_context_destroy(), but there are two memory blocks (allocated by lws) that are lost.

The first is the struct lws_context allocated at line 1032 of context.c (lws 3.0):
    context = lws_zalloc(sizeof(struct lws_context), "context");

The second block is the WS text-message fragment that has not been freed due to the exception (this is also the cause for the first memory error...).

I understand that lws is a C library and it is not at fault, but do you have any suggestion how I might free the fragment (from my exception handler or before throwing the exception in the first place).

> ==11558==
> ==11558== HEAP SUMMARY:
> ==11558==     in use at exit: 4,944 bytes in 2 blocks
> ==11558==   total heap usage: 27,960 allocs, 27,959 frees, 2,364,205 bytes allocated
> ==11558==
> ==11558== Thread 1:
> ==11558== 4,944 (848 direct, 4,096 indirect) bytes in 1 blocks are definitely lost in loss record 2 of 2
> ==11558==    at 0x4C2CABF: malloc (vg_replace_malloc.c:298)
> ==11558==    by 0x4C2EE04: realloc (vg_replace_malloc.c:785)
> ==11558==    by 0x174781: lws_zalloc (alloc.c:82)
> ==11558==    by 0x173FC1: lws_create_context (context.c:1032)
> ==11558==    by 0x15E82E: Lws::Server::Runable::Main() (LwsServer.cpp:154)
> ==11558==    by 0x1446B9: Thread::ThreadMain() (Thread.cpp:70)
> ==11558==    by 0x5C1897E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25)
> ==11558==    by 0x64905A9: start_thread (pthread_create.c:463)
> ==11558==    by 0x679DCBE: clone (clone.S:95)
> ==11558==
> ==11558== LEAK SUMMARY:
> ==11558==    definitely lost: 848 bytes in 1 blocks
> ==11558==    indirectly lost: 4,096 bytes in 1 blocks
> ==11558==      possibly lost: 0 bytes in 0 blocks
> ==11558==    still reachable: 0 bytes in 0 blocks
> ==11558==         suppressed: 0 bytes in 0 blocks
> ==11558==
> ==11558== For counts of detected and suppressed errors, rerun with: -v
> ==11558== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)

Kind regards,
Alexander Bruines



More information about the Libwebsockets mailing list