|Marvin Humphrey||Mar 30, 2009 10:21 pm|
|Michael McCandless||Apr 1, 2009 4:51 am|
|Marvin Humphrey||Apr 8, 2009 10:42 am|
|Michael McCandless||Apr 9, 2009 3:51 am|
|Marvin Humphrey||Apr 9, 2009 10:38 am|
|Michael McCandless||Apr 10, 2009 6:51 am|
|Marvin Humphrey||Apr 10, 2009 2:38 pm|
|Marvin Humphrey||Apr 10, 2009 6:50 pm|
|Michael McCandless||Apr 11, 2009 7:58 am|
|Michael McCandless||Apr 11, 2009 8:16 am|
|Marvin Humphrey||Apr 12, 2009 6:08 am|
|Michael McCandless||Apr 12, 2009 12:15 pm|
|Marvin Humphrey||Apr 12, 2009 2:04 pm|
|Michael McCandless||Apr 13, 2009 6:42 am|
|Marvin Humphrey||Apr 14, 2009 3:38 am|
|Michael McCandless||Apr 14, 2009 5:39 am|
|Marvin Humphrey||Apr 30, 2009 4:17 pm|
|Michael McCandless||May 1, 2009 4:01 am|
|Subject:||Types and Schemas (was "Sort cache file format")|
|From:||Marvin Humphrey (mar...@rectangular.com)|
|Date:||Apr 12, 2009 6:08:27 am|
On Sat, Apr 11, 2009 at 10:58:44AM -0400, Michael McCandless wrote:
Does FieldSpec sub divide the options? Eg options about indexing could live in its own class, with commonly used constants like "NO".
This was the motivation of that comment in Lucene (the fact that we don't subdivide means suddenly stored only fields have to figure out what to do with omitNorms, omitTFAP booleans; if we had Field.Index.NO that's be better).
Right now, FieldSpec doesn't subdivide, but it's not a least common denominator, either. To illustrate: FieldSpec has boolean members for "indexed", "stored", and "sortable", but knows nothing about Analyzers. Analyzers are the exclusive province of the FullTextField subclass.
If you don't permit automatic merging of field types, then there isn't a need for FieldSpec to know everything about all its subclasses. I see why subdividing options might be useful in Lucene, but I'm not sure it's necessary for Lucy.
I think it's better OO design for the parent class to be simple rather than comprehensive.
Well, in Lucene we could better decouple a Field's value from its "extended type". The type would still be attached to the Field's value (not to the global schema as in KS), but strongly decoupled & shared across Field instances.
That makes sense. The "extended type" class could look almost identical, but in Lucene the user would make the connection directly, while in Lucy it would be made indirectly via the field name.
Haha, awesome. :)
Lucene in fact implicitly has a global schema in that when segments are merged, or when docs are added into a single segment, the schema for each document or segment are "merged" according to certain rules. When your index is optimized then you have your global schema.
That's a good way of putting it.
Dump them to a JSON-izable data structure. Include the class name so that you can pick a deserialization routine at load time.
You rely on the same namespace -> obj mapping being present at deserialize time? Ie its the callers responsibility to import the same modules, ensure the names "map" to the same objs (or at least compatible ones) as were used during serialization, etc.
If the user has implemented custom subclasses, then yes, the subclasses must be loaded or you'll get a "class not found" error.
Though, for core objects, you would use the global name -> vtable mapping that Lucy core maintains?
Yes. Any core class would already be loaded.
(I still don't fully understand why Lucy needs that global hash -- this is what namespaces are for).
If we didn't implement it internally, we'd need to implement it in the bindings for e.g. looking up deserialization routines. Furthermore, we need some mechanism for C-level subclassing, since that's not part of the C language. No namespaces there. :)
OK, so if I've made a custom Tokenizer doing some funky Python code instead of a regexp, I could simply implement dump/load to do the right thing.
BTW, I saw that Earwin Burrfoot calls his type class "FieldType".
"FieldType" is probably a better name than "FieldSpec", as it implies subclasses with "Type" as a suffix: FullTextType, StringType, BlobType, Int32Type, etc.