|Marvin Humphrey||Mar 30, 2009 10:21 pm|
|Michael McCandless||Apr 1, 2009 4:51 am|
|Marvin Humphrey||Apr 8, 2009 10:42 am|
|Michael McCandless||Apr 9, 2009 3:51 am|
|Marvin Humphrey||Apr 9, 2009 10:38 am|
|Michael McCandless||Apr 10, 2009 6:51 am|
|Marvin Humphrey||Apr 10, 2009 2:38 pm|
|Marvin Humphrey||Apr 10, 2009 6:50 pm|
|Michael McCandless||Apr 11, 2009 7:58 am|
|Michael McCandless||Apr 11, 2009 8:16 am|
|Marvin Humphrey||Apr 12, 2009 6:08 am|
|Michael McCandless||Apr 12, 2009 12:15 pm|
|Marvin Humphrey||Apr 12, 2009 2:04 pm|
|Michael McCandless||Apr 13, 2009 6:42 am|
|Marvin Humphrey||Apr 14, 2009 3:38 am|
|Michael McCandless||Apr 14, 2009 5:39 am|
|Marvin Humphrey||Apr 30, 2009 4:17 pm|
|Michael McCandless||May 1, 2009 4:01 am|
|Subject:||Re: Sort cache file format|
|From:||Marvin Humphrey (mar...@rectangular.com)|
|Date:||Apr 8, 2009 10:42:41 am|
On Wed, Apr 01, 2009 at 07:52:11AM -0400, Michael McCandless wrote:
Can the same field name have more than one FieldSpec?
Absolutely not. I have always regarded that feature of Lucene as insane. It's like an SQL engine where any INSERT can silently ALTER TABLE.
Unsurprisingly, the back-end code to support that insane feature is ginormous and convoluted and has been the source of many bugs over the years. LUCENE-1590 is only the latest.
What's otherwise the difference?
[regarding per-field files vs. per-FieldSpec files]
Lexicon objects in KinoSearch only iterate over one field's terms. PostingLists are single-field, too -- you can't seek to a term in another field.
In the Lucene file format, each term in the term dictionary has to list a field number. That wastes space compared to the KinoSearch format, since in KS there is only one field number per file and it is encoded within the file's name.
Furthermore, per-field lexicon and postings files are easier to troubleshoot. You just look at the directory listings and you get a pretty good idea of what's going on: files that are missing, that have zero length, etc, are big red flags.
If you're using the compound file format, you don't have to worry about running out of file descriptors, no matter how many virtual files you use -- because a single descriptor is shared by all of them. Thus, the only consequence of dividing up files per-FieldSpec rather than per-field is... to make the index more opaque and harder to debug. :( It's a step backwards.
However, I think Lucy will have to accept that loss of clarity in order to support the non-compound format. :(
We'd like to be able to use the field's lexicon to obtain the field values. However, since lexicons use prefix compression, we can't jump to a random entry. Therefore, we either need to change the lexicon format for sortable fields so that it doesn't use prefix compression, or write redundant data.
The requirement is that each doc has a single value, right?
For numeric fields, can't you simply store the values instead of separate ord + values? EG sorting float/long/doubles is very easy. Also for smaller types (byte, short) you can be much more compact with only the values.
Right now there are three core FieldSpec classes in KS:
FullTextField StringField BlobField
Only "fulltext" fields are associated with Analyzers (and they are *always* analyzed). Only "string" fields -- which are always single-value since they aren't analyzed -- allow sorting.
The plan in the near future is to add additional field types such as these:
Int8Field Int16Field Int32Field Int64Field Float32Field TimeStamp32Field TimeStamp64Field
I think the sort cache format for a given field would be dependent on the FieldSpec class. So would the encode/decode for the full document storage, for that matter.
Does/will Lucy have column-stride field storage?
I think these sort caches should be either column-stride or variable width. Which one would be most appropriate would be data dependent. I don't know how important it is to expose an API to support multiple sort cache types. I suppose that someone will need it eventually for maximum efficiency.
As I mentioned in this JIRA comment for Lucene column-stride fields this morning...
... I think the only purpose for column-stride fields should be to aid search.
For pure Unicode character data, a two-file format would work well.
* A stack of 64-bit file offsets into the character data file. * Pure character data, with string lengths implied by the offset file.
Now it turns out that our offsets have a funny property: they sort in the same order as the string data that they point to. Because of that, we can theoretically use them as our sort-ords, and do away with the stack of 32-bit sort ords.
Except, they take 2X the storage as int ords.
True. For a field with a small number of unique values, the 32-bit ords are a
// Less space with a small number of unique values: 32-bit ord => 64-bit file pointer => character data
// Less space with many unique values: 64-bit file pointer => character data
However, I think that we should accomodate fields with few values using dedicated enum field types, e.g. a "StringEnumField" that requires you to declare all possible values in advance.
There's still an unoptimized case where a field has few values, but they aren't all known in advance. It would be nice if we could come up with a field type for that.
Nevertheless, I think we cover a lot of cases if we support both an enum type and a one-value-per-doc type.