[Libwebsockets] LWS Full-text search

Andy Green andy at warmcat.com
Fri Oct 19 01:29:54 CEST 2018


Hi -

Master has grown a generic, scaleable, lightweight full-text search api 
that has been migrated from gitohashi; gitohashi now uses the lws 
implementation.  It's originally designed for cheap fulltext searches of 
potentially huge git trees like the Linux kernel.

It can very rapidly index one or more UTF-8 text "files" (up to hundreds 
of thousands of them) into a single index file... the input files may be 
virtual / in-memory only as in the gitohashi case.

The index file can be queried to provide:

  - smart autocomplete results (these are optimized with the paths 
leading to the highest number of hits first)

  - lists of files that have matches

  - line number and line file offsets for "hits"

  - optionally quote the actual text on hit lines, if the original files 
are still available.

The results come as linked-lists of structs inside in a struct lwsac.

There's a demo here

https://libwebsockets.org/ftsdemo/

which is an indexed text of "The Picture of Dorian Gray" in searchable 
form with autocomplete.  The demo coverts the results lwsac to JSON for 
transport on XHR.

The minimal example behind the demo is here (the libwebsockets.org 
version is the same protocol plugin running in lwsws)

https://libwebsockets.org/git/libwebsockets/tree/minimal-examples/http-server/minimal-http-server-fulltext-search

General overview and some info is here:

https://libwebsockets.org/git/libwebsockets/tree/lib/misc/fts

Public API:

https://libwebsockets.org/git/libwebsockets/tree/include/libwebsockets/lws-fts.h

CI test is here

https://libwebsockets.org/git/libwebsockets/tree/minimal-examples/api-tests/api-test-fts

The api-test-fts minimal example includes a cli app that allows you to 
create index files from other files given as an argument list, but it's 
also simple to do programmatically.

The actual querying only needs enough memory to hold the results in the 
lwsac, it costs almost nothing otherwise... it just walks various 
structures directly in the index file.

Creating the indexes is more expensive, but for example to index all the 
.c and .h in lws master (about 3MB source) takes 124ms and peak 
allocation of 3MB on my box, producing a 1.4MB index file.

Doing the same on the Linux kernel sources at 4.14 (53K files, 695MB 
source) takes 50s with a peak RAM allocation under 80MB and an index 
file of 350MB.  Again the queries are very low cost even on weak hardware.

-Andy


More information about the Libwebsockets mailing list