Project homepage Mailing List  API Docs  Github Mirror 
{"schema":"libjg2-1", "vpath":"/git/", "avatar":"/git/avatar/", "alang":"en-US,en;q\u003d0.5", "gen_ut":1634826688, "reponame":"libwebsockets", "desc":"libwebsockets lightweight C networking library", "owner": { "name": "Andy Green", "email": "", "md5": "c50933ca2aa61e0fe2c43d46bb6b59cb" },"url":"", "f":3, "items": [ { "schema":"libjg2-1", "oid":{ "oid": "7fa96facbb189706cabe121a81c0dd2cdcb02f2b", "alias": [ "refs/heads/main"]},"tree": [ { "name": "","mode": "33188", "size":11280}, { "name": "private-lib-misc-fts.h","mode": "33188", "size":404}, { "name": "trie-fd.c","mode": "33188", "size":21890}, { "name": "trie.c","mode": "33188", "size":32372}],"s":{"c":1634826688,"u": 1332}} ,{"schema":"libjg2-1", "cid":"83a2ddb8f8a9cdca96d7d063901a1d09", "oid":{ "oid": "7fa96facbb189706cabe121a81c0dd2cdcb02f2b", "alias": [ "refs/heads/main"]},"blobname": "lib/misc/fts/", "blob": "# LWS Full Text Search\n\n## Introduction\n\n![lwsac flow](/doc-assets/lws-fts.svg)\n\nThe general approach is to scan one or more UTF-8 input text \u0022files\u0022 (they may\nonly exist in memory) and create an in-memory optimized trie for every token in\nthe file.\n\nThis can then be serialized out to disk in the form of a single index file (no\nmatter how many input files were involved or how large they were).\n\nThe implementation is designed to be modest on memory and cpu for both index\ncreation and querying, and suitable for weak machines with some kind of random\naccess storage. For searching only memory to hold results is required, the\nactual searches and autocomplete suggestions are done very rapidly by seeking\naround structures in the on-disk index file.\n\nFunction|Related Link\n---|---\nPublic API|[include/libwebsockets/lws-fts.h](\nCI test app|[minimal-examples/api-tests/api-test-fts](\nDemo minimal example|[minimal-examples/http-server/minimal-http-server-fulltext-search](\nLive Demo|[](\n\n## Query API overview\n\nSearching returns a potentially very large lwsac allocated object, with contents\nand max size controlled by the members of a struct lws_fts_search_params passed\nto the search function. Three kinds of result are possible:\n\n### Autocomplete suggestions\n\nThese are useful to provide lists of extant results in\nrealtime as the user types characters that constrain the search. So if the\nuser has typed 'len', any hits for 'len' itself are reported along with\n'length', and whatever else is in the index beginning 'len'.. The results are\nselected using and are accompanied by an aggregated count of results down that\npath, and the results so the \u0022most likely\u0022 results already measured by potential\nhits appear first.\n \nThese results are in a linked-list headed by `result.autocomplete_head` and\neach is in a `struct lws_fts_result_autocomplete`.\n \nThey're enabled in the search results by giving the flag\n `LWSFTS_F_QUERY_AUTOCOMPLETE` in the search parameter flags.\n \n### Filepath results \n\nSimply a list of input files containing the search term with some statistics,\none file is mentioned in a `struct lws_fts_result_filepath` result struct.\n\nThis would be useful for creating a selection UI to \u0022drill down\u0022 to individual\nfiles when there are many with matches.\n\nThis is enabled by the `LWSFTS_F_QUERY_FILES` search flag.\n\n### Filepath and line results\n \nSame as the file path list, but for each filepath, information on the line\nnumbers and input file offset where the line starts are provided.\n\nThis is enabled by `LWSFTS_F_QUERY_FILE_LINES`... if you additionally give\n`LWSFTS_F_QUERY_QUOTE_LINE` flag then the contents of each hit line from the\ninput file are also provided.\n \n## Result format inside the lwsac\n\nA `struct lws_fts_result` at the start of the lwsac contains heads for linked-\nlists of autocomplete and filepath results inside the lwsac.\n\nFor autocomplete suggestions, the string itself is immediately after the\n`struct lws_fts_result_autocomplete` in memory. For filepath results, after\neach `struct lws_fts_result_filepath` is\n\n - match information depending on the flags given to the search\n - the filepath string\n \nYou can always skip the line number table to get the filepath string by adding\n.matches_length to the address of the byte after the struct.\n\nThe matches information is either\n\n - 0 bytes per match\n \n - 2x int32_t per match (8 bytes) if `LWSFTS_F_QUERY_FILE_LINES` given... the\n first is the native-endian line number of the match, the second is the\n byte offset in the original file where that line starts\n\n - 2 x int32_t as above plus a const char * if `LWSFTS_F_QUERY_QUOTE_LINE` is\n also given... this points to a NUL terminated string also stored in the\n results lwsac that contains up to 255 chars of the line from the original\n file. In some cases, the original file was either virtual (you are indexing\n a git revision) or is not stored with the index, in that case you can't\n usefully use `LWSFTS_F_QUERY_QUOTE_LINE`.\n\nTo facilitate interpreting what is stored per match, the original search flags\nthat created the result are stored in the `struct lws_fts_result`.\n\n## Indexing In-memory and serialized to file\n\nWhen creating the trie, in-memory structs are used with various optimization\nschemes trading off memory usage for speed. While in-memory, it's possible to\nadd more indexed filepaths to the single index. Once the trie is complete in\nterms of having indexed everything, it is serialized to disk.\n\nThese contain many additional housekeeping pointers and trie entries which can\nbe optimized out. Most in-memory values must be held literally in large types,\nwhereas most of the values in the serialized file use smaller VLI which use\nmore or less bytes according to the value. So the peak memory requirements for\nlarge tries are much bigger than the size of the serialized trie file that is\noutput.\n\nFor the linux kernel at 4.14 and default indexing list on a 2.8GHz AMD\nthreadripper (using one thread), the stats are:\n\nName|Value\n---|---\nFiles indexed|52932\nInput corpus size|694MiB\nIndexing cpu time|50.1s (\u003e1000 files / sec; 13.8MBytes/sec)\nPeak alloc|78MiB\nSerialization time|202ms\nTrie File size|347MiB\n\nTo index libwebsockets main branch under the same conditions:\n\nName|Value\n---|---\nFiles indexed|489\nInput corpus size|3MiB\nIndexing time|123ms\nPeak alloc|3MiB\nSerialization time|1ms\nTrie File size|1.4MiB\n\n\nOnce it's generated, querying the trie file is very inexpensive, even when there\nare lots of results.\n\n - trie entry child lists are kept sorted by the character they map to. This\n allows discovering there is no match as soon as a character later in the\n order than the one being matched is seen\n \n - for the root trie, in addition to the linked-list child + sibling entries,\n a 256-entry pointer table is associated with the root trie, allowing one-\n step lookup. But as the table is 2KiB, it's too expensive to use on all\n trie entries\n\n## Structure on disk\n\nAll explicit multibyte numbers are stored in Network (MSB-first) byte order.\n\n - file header\n - filepath line number tables\n - filepath information\n - filepath map table\n - tries, trie instances (hits), trie child tables\n\n### VLI coding\n\nVLI (Variable Length Integer) coding works like this\n\n[b7 EON] [b6 .. b0 DATA]\n\nIf EON \u003d 0, then DATA represents the Least-significant 7 bits of the number.\nif EON \u003d 1, DATA represents More-significant 7-bits that should be shifted\nleft until the byte with EON \u003d 0 is found to terminate the number.\n\nThe VLI used is predicated around 32-bit unsigned integers\n\nExamples:\n\n - 0x30 \u003d 48\n - 0x81 30 \u003d 176\n - 0x81 0x80 0x00 \u003d 16384\n\nBytes | Range\n---|---\n1|\u003c\u003d 127\n2|\u003c\u003d 16K - 1\n3|\u003c\u003d 2M -1\n4|\u003c\u003d 256M - 1\n5|\u003c\u003d 4G - 1\n\nThe coding is very efficient if there's a high probabilty the number being\nstored is not large. So it's great for line numbers for example, where most\nfiles have less that 16K lines and the VLI for the line number fits in 2 bytes,\nbut if you meet a huge file, the VLI coding can also handle it.\n\nAll numbers except a few in the headers that are actually written after the\nfollowing data are stored using VLI for space- efficiency without limiting\ncapability. The numbers that are fixed up after the fact have to have a fixed\nsize and can't use VLI.\n\n### File header\n\nThe first byte of the file header where the magic is, is \u0022fileoffset\u0022 0. All\nthe stored \u0022fileoffset\u0022s are relative to that.\n\nThe header has a fixed size of 16 bytes.\n\nsize|function\n---|---\n32-bits|Magic 0xCA7A5F75\n32-bits|Fileoffset to root trie entry\n32-bits|Size of the trie file when it was created (to detect truncation)\n32-bits|Fileoffset to the filepath map\n32-bits|Number of filepaths\n\n### Filepath line tables\n\nImmediately after the file header are the line length tables.\n\nAs the input files are parsed, line length tables are written for each file...\nat that time the rest of the parser data is held in memory so nothing else is\nin the file yet. These allow you to map logical line numbers in the file to\nfile offsets space- and time- efficiently without having to walk through the\nfile contents.\n\nThe line information is cut into blocks, allowing quick skipping over the VLI\ndata that doesn't contain the line you want just by following the 8-byte header\npart.\n\nOnce you find the block with your line, you have to iteratively add the VLIs\nuntil you hit the one you want.\n\nFor normal text files with average line length below 128, the VLIs will\ntypically be a single byte. So a block of 200 line lengths is typically\n208 bytes long.\n\nThere is a final linetable chunk consisting of all zeros to indicate the end\nof the filepath line chunk series for a filepath.\n\nsize|function\n---|---\n16-bit|length of this chunk itself in bytes\n16-bit|count of lines covered in this chunk\n32-bit|count of bytes in the input file this chunk covers \nVLI...|for each line in the chunk, the number of bytes in the line\n\n\n### Filepaths\n\nThe single trie in the file may contain information from multiple files, for\nexample one trie may cover all files in a directory. The \u0022Filepaths\u0022 are\nlisted after the line tables, and referred to by index thereafter.\n\nFor each filepath, one after the other:\n\nsize|function\n---|---\nVLI|fileoffset of the start of this filepath's line table\nVLI|count of lines in the file\nVLI|length of filepath in bytes\n...|the filepath (with no NUL)\n\n### Filepath map\n\nTo facilitate rapid filepath lookup, there's a filepath map table with a 32-bit\nfileoffset per filepath. This is the way to convert filepath indexes to\ninformation on the filepath like its name, etc\n\nsize|function\n---|---\n32-bit...|fileoffset to filepath table for each filepath\n\n### Trie entries\n\nImmediately after that, the trie entries are dumped, for each one a header:\n\n#### Trie entry header\n\nsize|function\n---|---\nVLI|Fileoffset of first file table in this trie entry instance list\nVLI|number of child trie entries this trie entry has\nVLI|number of instances this trie entry has\n\nThe child list follows immediately after this header\n\n#### Trie entry instance file\n\nFor each file that has instances of this symbol:\n\nsize|function\n---|---\nVLI|Fileoffset of next file table in this trie entry instance list\nVLI|filepath index\nVLI|count of line number instances following\n\n#### Trie entry file line number table\n\nThen for the file mentioned above, a list of all line numbers in the file with\nthe symbol in them, in ascending order. As a VLI, the median size per entry\nwill typically be ~15.9 bits due to the probability of line numbers below 16K.\n\nsize|function\n---|---\nVLI|line number\n...\n\n#### Trie entry child table\n\nFor each child node\n\nsize|function\n---|---\nVLI|file offset of child\nVLI|instance count belonging directly to this child\nVLI|aggregated number of instances down all descendent paths of child\nVLI|aggregated number of children down all descendent paths of child\nVLI|match string length\n...|the match string\n","s":{"c":1634826688,"u": 376}} ],"g": 5072,"chitpc": 0,"ehitpc": 0,"indexed":0 , "ab": 1, "si": 0, "db":0, "di":1, "sat":0, "lfc": "0000"}