| From | Sent On | Attachments |
|---|---|---|
| Marvin Humphrey | Mar 30, 2009 10:21 pm | |
| Michael McCandless | Apr 1, 2009 4:51 am | |
| Marvin Humphrey | Apr 8, 2009 10:42 am | |
| Michael McCandless | Apr 9, 2009 3:51 am | |
| Marvin Humphrey | Apr 9, 2009 10:38 am | |
| Michael McCandless | Apr 10, 2009 6:51 am | |
| Marvin Humphrey | Apr 10, 2009 2:38 pm | |
| Marvin Humphrey | Apr 10, 2009 6:50 pm | |
| Michael McCandless | Apr 11, 2009 7:58 am | |
| Michael McCandless | Apr 11, 2009 8:16 am | |
| Marvin Humphrey | Apr 12, 2009 6:08 am | |
| Michael McCandless | Apr 12, 2009 12:15 pm | |
| Marvin Humphrey | Apr 12, 2009 2:04 pm | |
| Michael McCandless | Apr 13, 2009 6:42 am | |
| Marvin Humphrey | Apr 14, 2009 3:38 am | |
| Michael McCandless | Apr 14, 2009 5:39 am | |
| Marvin Humphrey | Apr 30, 2009 4:17 pm | |
| Michael McCandless | May 1, 2009 4:01 am |
| Subject: | Re: Types and Schemas (was "Sort cache file format") | |
|---|---|---|
| From: | Michael McCandless (luc...@mikemccandless.com) | |
| Date: | Apr 13, 2009 6:42:43 am | |
| List: | org.apache.lucene.lucy-dev | |
On Sun, Apr 12, 2009 at 5:04 PM, Marvin Humphrey <mar...@rectangular.com> wrote:
I think Lucene could continue to merge yet isolate information (subdivision, subclassing). At least I sure hope so :)
I see why subdividing options might be useful in Lucene, but I'm not sure it's necessary for Lucy.
It's all still hazy to me :) Hopefully once we talk about it enough I'll get some clarity...
Actually, what we probably need are Python bindings so that you can start playing around. :)
That'd be nice but I'm quite hurting for time these days ;) Sudden bursts of innovation all over the place...
I've been trying to clean up Boilerplater enough so that it porting Boilerplater::Binding::Perl to Boilerplater::Binding::Python would be a reasonable undertaking. Perl's C API and object model are so complicated that other languages will probably be a lot easier -- but right now, it's not apparent from Boilerplater's API how you would get started.
OK. It would also be good to have > 1 host language driving the design... to keep things generic/portable.
it is sort of scary that we're inventing a type system.
What's scary is that Java Lucene *has* a type system but won't admit it.
Yah. In fact Lucene is "weakly typed", like Tcl. We gleefully, secretly "merge" one type with another. I'd be happy to get to strong but dynamic typing (ie the write once schema).
EG there are many things the FieldType should somehow tell us:
* How does FieldSpec model "multi-valued" fields? Is there a boolean in the base class?
Because Lucy's Doc objects will be hash based, there will *never* be a case where the same field has two "values" per se within the same doc.
However, it's fine if we support compound types via specific FieldType subclasses, e.g. Float32ArrayType, or StringArrayType.
I see -- does KS support multi-valued (compound) types today? For which "types"? And I imagine for such types, "sortable" is not allowed (yet "sortable" is set at the top FieldSpec, right?)?
It's also important to distinguish between "multi-valued" and the "multi-token" FullTextType. FullTextType fields are tokenized within the index, but in the context of the doc reader, they only have one string "value". Note, however, that you cannot sort on a FullTextType field in KS.
So if I want to index & sort by "title" field, I make 2 separate fields?
* "Has only one token" -- I guess this is implied by the class (ie only FullTextType may have > 1 token)
For the near-to-middle-term future, yes -- FullTextType is the only multi-token, single-valued type.
Looking down the road, I suppose other types like Int32ArrayType could have more than one "token", but it wouldn't be an ordinary string "token".
OK
* Open vs closed (known set of values) enums
It would be nice to add this later. I don't think it's a high priority, since it's an optimization.
You mean you'd start with "open" enums?
* Sortable
I think this belongs in the base class -- that's where KS has it now. That way, we can perform the following test, regardless of what the type is.
if (FieldType_Sortable(field_type)) { /* Build sort cache. */ ... }
Yeah... except multi-valued (compound) types would disable this, I guess. Though Lucene users seem to hit this limitation enough to make it relaxable... and customize how SortCache gets created.
* nulls sort on top or bottom
This would be individual to each sort comparator. Note that we might want to use a different sort comparator for NOT NULL fields for efficiency's sake, which complicates making the comparator a method on FieldSpec.
Yes, we're iterating on this now in LUCENE-831. Though I wonder if this ought to be the realm of source code specialization... multiplying out all the combinations of "single comparator or not", "scoring or not", "track max score or not", "string index may have nulls or not", in Lucene's "true" sources (vs generated sources) starts to get crazy. Soon we'll also multiply in "docIDs guaranteed to arrive in order to the collector, or not" as well.
My general inclination is to have NULLs sort towards the end of the array.
* Omit norms, omit TFAP
I'm putting this off for now. It will be addressed when we refactor for flexible indexing.
OK. These would seem to live nicely under FullTextType... oh actually maybe not, because presumably I can index single-valued fields (the equivalent of NOT_ANALYZED in Lucene). EG an Int32Type may in fact be indexed, and I would at that point want to put omit norms/TFAP there. Hmmm, cross cutting concerns. Maybe sub-typing is needed...
* Binary or not (I guess BlobType <-> binary)
BlobType is one binary type, but I propose adding others, e.g. Int32Type.
Binary() should be an abstract method on the base class. It shouldn't be a boolean flag member, because it's not something that can be switched up within a class.
OK.
* Term vectors or not, positions, offsets
Term vectors are unique to FullTextType, since it is the only multi-token field. Right now in KS, it's a boolean member var in FullTextType.
Single-token indexed fields might want term vectors too?
* Stored or not -- toplevel?
Yes. As a boolean member.
Makes sense.
* CSF'd or not
Right now, I'd say keep this out of core.
OK, and, merge with sort cache somehow. For most types they are one and the same.
* ValueSource is XYZ for this field
I'd like to avoid ValueSource if we can. I think it's better to add real binary types like Int32Type, DateStamp32, and so on -- instead of faking them with strings.
Well, that's UninversionValueSource you're thinking of (faking w/ strings).
But, yes, it's not good that ValueSource has type switching internal to itself..... vs, you get lookup FieldType for the field and use it to "switch".
* I will use RangeFilter on this field
The "sortable" boolean member var fills this need, no?
They are different? Eg you'll add aggregates (Trie*) to your index for fast range constraints, but for sorting you just need a sort cache computed.
* Analyzer to use (exposed only FullTextType)
Analyzer should be a required constructor arg to FullTextType.
OK
* Extensibility -- so app can enroll new attrs / make new type subclasses
So long as the core performs inheritance checks rather than absolute class membership checks, subclasses will work fine.
OK.
Remind me again: do custom subclasses get enrolled into the global hash in Lucy's core? I know you had said it's a thread risk, ie, not read only...
Yes.
I'm still confused. Say StandardAnalyzer is implemented in C; maybe you'd name it Lucy_Analysis_StandardAnalyzer (since C doesn't support namespaces you put prefixes in front).
FWIW, the current implementation of Boilerplater only supports two level namespacing (with nicknames). Outside of core, fully qualified code would look like this:
lucy_StandardAnalyzer *analyzer = lucy_StdAnalyzer_new(); lucy_Inversion *inversion = Lucy_StdAnalyzer_Transform_Text(charbuf);
What are the two levels here? Level 1 is "StdAnalyzer", and Level 2 is "new" and "Transform_Text"?
One of the constraints the two-level limitation imposes is that the last part of every core class name must be unique. However, it makes for fully qualified C names that are are just cumbersome rather than unworkably long.
OK
Any time something in core wants to use that class, it refers to it by name (and the C compiler/linker maps it), not via the global hash?
For the most part. A quick once-over of the KS code seems to indicate that the exceptions to that rule are all related to Deserialize() and Load().
OK
But for deserializing a core object, when the deserializer is implemented in C, I agree you'd need a global lookup; basically because you can't consult the OBJ's symbol table dynamically. (If you have a hosty deserializer, then it would "import lucy; lucy.XXX" to find its classes).
(But it seems like that global hash should be readonly-able).
If we readonly that Hash, we can't add subclasses to it -- and therefore we won't be able to retrieve their deserializers.
I guess it's only subclasses implemented in C where this is important?
Because a hosty subclass's deserializer is using/relying the host's namespace to find classes by name.
Mike





