atom feed18 messages in org.apache.lucene.lucy-devRe: Sort cache file format
FromSent OnAttachments
Marvin HumphreyMar 30, 2009 10:21 pm 
Michael McCandlessApr 1, 2009 4:51 am 
Marvin HumphreyApr 8, 2009 10:42 am 
Michael McCandlessApr 9, 2009 3:51 am 
Marvin HumphreyApr 9, 2009 10:38 am 
Michael McCandlessApr 10, 2009 6:51 am 
Marvin HumphreyApr 10, 2009 2:38 pm 
Marvin HumphreyApr 10, 2009 6:50 pm 
Michael McCandlessApr 11, 2009 7:58 am 
Michael McCandlessApr 11, 2009 8:16 am 
Marvin HumphreyApr 12, 2009 6:08 am 
Michael McCandlessApr 12, 2009 12:15 pm 
Marvin HumphreyApr 12, 2009 2:04 pm 
Michael McCandlessApr 13, 2009 6:42 am 
Marvin HumphreyApr 14, 2009 3:38 am 
Michael McCandlessApr 14, 2009 5:39 am 
Marvin HumphreyApr 30, 2009 4:17 pm 
Michael McCandlessMay 1, 2009 4:01 am 
Subject:Re: Sort cache file format
From:Michael McCandless (luc@mikemccandless.com)
Date:Apr 10, 2009 6:51:38 am
List:org.apache.lucene.lucy-dev

On Thu, Apr 9, 2009 at 1:38 PM, Marvin Humphrey <mar@rectangular.com> wrote:

Ahhhhh... OK this finally clarifies my [silly] confusion: a FieldSpec corresponds to more than one field. It's like FieldInfos in Lucene?

I wouldn't say so. It's more like a single FieldInfo on steroids.

Here's the the Schema API from KS svn trunk, translated to Java:

Schema schema = new Schema(); PolyAnalyzer analyzer = new PolyAnalyzer("en"); FullTextField fulltext = new FullTextField(analyzer); StringField notIndexed = new StringField(); notIndexed.setIndexed(false); schema.specField("title", fulltext); schema.specField("content", fulltext); schema.specField("url", notIndexed); schema.specField("category", new StringField());

So, a single FieldSpec can be used to spec multiple fields -- as in this case with "title" and "content".

So in this code, FullTextField is an instance (subclass?) of FieldSpec?

It seems like a FieldSpec represents the "extended type" of a field (ie, includes all juicy details about how it should be indexed (including an analyzer instance), stored, etc), and then you're free to have more than one field in your doc share that "extended type".

So things like "I intend to do range searching" and "I intend to sort" this field, and "I want to store term vectors, with positions but not offsets", etc., belong in FieldSpec?

Hmmm... actually I sort of talked about the beginnings of this, for Lucene, in the last paragraph here:

https://issues.apache.org/jira/browse/LUCENE-1590?focusedCommentId=12696994&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12696994

Ie, maybe we should instantiate Field.Index and tweak its options (norms, omitTFAP, etc.) and that instance becomes the type of your field (at least wrt indexing).

This is also sort of like the crazy static types one can create with generics, ie, a "type" used to be something nice simple (int, float, your own class, etc.) but now can be a rich object (instance) in itself. [A class and an instance really should not be different, anyway (prototype languages like Self don't differentiate).]

Note that unlike Lucene, there is no default Analyzer. There is no default FieldSpec, either. It's up to you to assign fields to FieldSpecs.

OK.

In some sense, it would actually be cleaner to require that each field name be associated with a unique FieldSpec object. That would make the process of defining a Schema closely analogous to SQL's CREATE TABLE idiom.

CREATE SCHEMA my_schema ( FULLTEXT title ANALYZER POLYANALYZER("en"), FULLTEXT content ANALYZER POLYANALYZER("en"), STRING url NOT INDEXED, STRING category );

However, that would require a unique Analyzer instance for each field, which would be very expensive if you had a lot of small fields.

Now that I understand FieldSpec (I think!), I think allowing sharing is fine. Different variables in my source code can share the same type...

BTW, in KS svn trunk, Schemas are now fully serialized and written to the index as "schema_NNN.json". Including Analyzers. :)

How do you serialize Analyzers again? Maybe it's because the only allowed "custom" analyzer is via specifying your own regexp (and not custom host or core source code)?

So for Lucy you'll move away from KS's required "compound file format", and allow either compound or not.

I think so. It would be nice not to foreclose on the option of a discrete file format. In some sense discrete files are actually easier to troubleshoot because you can inspect the individual binary files directly. And of course, that extra file copy at index time goes away.

I'm still torn on how crucial "live" transparency is, vs running a tool (CheckIndex, Luke) to see things, but yes it's a plus.

But: I've often wondered whether that extra copy gains performance. Lucene now sets the file size of the CFS before writing, which in theory at least means an FS can optimize placement of the file to reduce fragmentation. So that copy may be good. I suppose we could do a similar trick to the discrete files too.

But you didn't answer the original question: will you simply use the numeric value as the ord?

Yes, that was what I was thinking. For example, if the FieldSpec for a given field is an Int32Field, then we'd build a stack of the numeric field values.

OK

Also, for enumerated fields (small universe of possible values, eg "yes" or "no" in the binary case) will you bit-pack the ords?

I imagine so.

OK

I don't really understand what "to aid search" means -- that's very generic.

Meaning, we'd use the column-stride fields during matching/scoring, but IndexReader.fetchDoc() would never use them.

I'm still unsure. I think how CSFs get used will be app dependent; I guess we throw them out there and see how they're used.

EG maybe the APP stores price & manuf only in CSF and not in the "normal" stored fields... the additional cost to retrieve stored field / excerpts for each of 10 hits on the page can be quite high.

Of course it's really only a win if you never need the stored fields (ie, if you're gonna pay a seek cost, the added cost to retrieve 1 vs 10 fields is negligible), which if you're doing excerpts won't be the case.

Except, they take 2X the storage as int ords.

True. For a field with a small number of unique values, the 32-bit ords are a win.

Well, you can do better by bit packing. For enumerated value fields you usually can use far fewer than 32 bits per value.

Hmm... I suppose you could think of any field as enumerated. :) Once you invert all the docs in a segment, you know how many terms there are.

Can we take advantage of that to simplify our implementation and only provide simple variants on the bit-packed-array-of-ords data structure?

Say that after we finish indexing a segment, we discover that the "category" field, which is marked as sortable, has only 237 unique values. We can then build our ords cache as a stack of bytes. Similarly, we discover that the "enabled" field has only two values, "y" and "n". We can build our ords cache as a bit map.

In both cases, we avoid requiring the user to specify values in advance -- but because we proceed at a segment-at-a-time, we can pretend that they did.

Well if the terms are "largely" unique (eg a "title" field), you lose by treating them as enumerated (as Lucene does today) since the extra deref only adds waste.

But then, if you are going to sort by that field, you need the ord [deref] anyway, so... yeah maybe we only implement "enumerated" values, at least for the sort cache.

If I just want to retrieve values, and the values are mostly unique, it's more efficient to inline the values.

And yes at indexing time we know precisely all stats on the field so we could pick and choose betweeen the "hey it's obviously enumerated" vs the "hey most values are unique" cases, if we want to do both.

That doesn't solve all our problems, because we still need the actual values for resolving inter-segment comparisons... For fixed-width fields, we can just build a raw stack of values. For variable-width fields, we can either do something block-based, or we can use the two-file model with one pointer file and one data file.

I prefer the two-file model over the block model because comparison of long values which cross block boundaries gets messy unless you build a contiguous value by copying. With the two-file model, you just pass around pointers into the mmap'd data. Locality of reference isn't a big deal if those two files are jammed right next to each other in the compound file.

In the block-based approach, you'd still need to store, somewhere, the pointers (block + offset) for each value, somewhere?