|Marvin Humphrey||Mar 30, 2009 10:21 pm|
|Michael McCandless||Apr 1, 2009 4:51 am|
|Marvin Humphrey||Apr 8, 2009 10:42 am|
|Michael McCandless||Apr 9, 2009 3:51 am|
|Marvin Humphrey||Apr 9, 2009 10:38 am|
|Michael McCandless||Apr 10, 2009 6:51 am|
|Marvin Humphrey||Apr 10, 2009 2:38 pm|
|Marvin Humphrey||Apr 10, 2009 6:50 pm|
|Michael McCandless||Apr 11, 2009 7:58 am|
|Michael McCandless||Apr 11, 2009 8:16 am|
|Marvin Humphrey||Apr 12, 2009 6:08 am|
|Michael McCandless||Apr 12, 2009 12:15 pm|
|Marvin Humphrey||Apr 12, 2009 2:04 pm|
|Michael McCandless||Apr 13, 2009 6:42 am|
|Marvin Humphrey||Apr 14, 2009 3:38 am|
|Michael McCandless||Apr 14, 2009 5:39 am|
|Marvin Humphrey||Apr 30, 2009 4:17 pm|
|Michael McCandless||May 1, 2009 4:01 am|
|Subject:||Re: Sort cache file format|
|From:||Michael McCandless (luc...@mikemccandless.com)|
|Date:||Apr 1, 2009 4:51:48 am|
On Tue, Mar 31, 2009 at 1:21 AM, Marvin Humphrey <mar...@rectangular.com> wrote:
It's time to think about how to implement segment-centric sorted search in Lucy, as discussed in the JIRA thread at
... and implemented in Lucene via...
[Jeez, JIRA permalinks are even longer than mail.apache.org permalinks.]
Lucy's implementation will differ markedly from that of Lucene's because we will write out sort cache files at index time to be mmap'd at search-time, rather than lazily build sort caches in process memory within IndexReader.
Our segment sort cache files will need two logical parts:
* Doc-num-to-ordinal mapping. * Field cache values.
At present, KinoSearch divides up lexicon and postings files by field number within each segment: three files for each field's lexicon and one for each fields postings. Dividing up the content so aggressively allows the files to have very simple formats, at the cost of file proliferation. However, since KS uses a compound file format for all segment files (exclusively), it only looks like there are a lot of files from KinoSearch's perspective; from the operating system's perspective, only one file descriptor is held open per segment.
In the future, it will make sense for us to divide up lexicon, postings, and sort cache files per-unique-FieldSpec rather than per-unique-field-name; we will then merely add offsets for each field, which we might store as metadata within segmeta.json. However, that's a revision I plan to take up in a month or two.
Can the same field name have more than one FieldSpec? What's otherwise the difference?
My immediate goal is to make a dev release of KinoSearch that includes real-time indexing support. My goal for the release after that is to overhaul the postings, lexicon, and sort cache formats to support "flexible indexing" and PFOR.
While those revisions are underway, it will make sense to make the per-FieldSpec revamp.
At the end of that process, I hope to have completed a draft Lucy File Format Spec version 0.1 which KinoSearch will adhere to. In the mean time, a per-field format for sort caches will be adequate.
The obvious format for the doc-num-to-ord mapping is a stack of 32-bit integers. The doc num would be implied by the array tick, and the ordinal value by the integer value at that tick.
i32_t sort_ord = self->sort_ords[doc_num];
We'd like to be able to use the field's lexicon to obtain the field values. However, since lexicons use prefix compression, we can't jump to a random entry. Therefore, we either need to change the lexicon format for sortable fields so that it doesn't use prefix compression, or write redundant data.
The requirement is that each doc has a single value, right?
For numeric fields, can't you simply store the values instead of separate ord + values? EG sorting float/long/doubles is very easy. Also for smaller types (byte, short) you can be much more compact with only the values.
Does/will Lucy have column-stride field storage? In which case, that file can already be your sort cache. (Lucene does not, yet, and so we must un-invert an inverted single-token-per-doc field, so even numeric fields are costly for Lucene to warm).
In the interest of simplicity, let's explore the redundant-data route for now. However, let's plan to revisit that at a later date and see if we can make the lexicon do double-duty as a sort cache, because the smaller the index, the more of it that fits into the OS cache.
For pure Unicode character data, a two-file format would work well.
* A stack of 64-bit file offsets into the character data file. * Pure character data, with string lengths implied by the offset file.
Under this scheme, the value at a given position would be obtained like so:
i64_t offset = self->offsets[sort_ord]; i64_t size = self->offsets[sort_ord + 1] - offset; ViewCB_Assign_Str(value, self->char_data + offset, size);
In the interest of flexibility, we might allow arbitrary codecs, in which case we would require the UTF-8 byte count to be prepended to the character data as a C32:
i64_t offset = self->offsets[sort_ord]; char *data = self->data + offset; i32_t size = MATH_DECODE_C32(data); /* advances data pointer */ ViewCB_Assign_Str(value, data, size);
However, I think it makes sense to set up the SortCacheReader to know what type it is supposed to be retrieving.
Now it turns out that our offsets have a funny property: they sort in the same order as the string data that they point to. Because of that, we can theoretically use them as our sort-ords, and do away with the stack of 32-bit sort ords.
Except, they take 2X the storage as int ords.
So, perhaps we only need two files for mmap'd sort-caches:
* A stack of 64-bit file pointers, mapping doc num to character data. * One file of character data, with each string prepended by a C32 byte count.
The two files would be named field_cache-XXX.ix and field_cache-XXX.dat, with XXX representing the field number.