atom feed11 messages in org.apache.lucene.java-userRe: Getting multi-values to use in fi...
FromSent OnAttachments
Rob AudenaerdeApr 23, 2014 3:56 am 
Michael SokolovApr 23, 2014 7:11 am 
Rob AudenaerdeApr 23, 2014 7:30 am 
Shai EreraApr 23, 2014 7:38 am 
Rob AudenaerdeApr 23, 2014 7:49 am 
Shai EreraApr 23, 2014 8:13 am 
Rob AudenaerdeApr 23, 2014 8:49 am 
Shai EreraApr 24, 2014 3:20 am 
Shai EreraApr 27, 2014 12:27 pm 
Rob AudenaerdeApr 29, 2014 12:04 am 
Shai EreraApr 29, 2014 12:43 am 
Subject:Re: Getting multi-values to use in filter?
From:Shai Erera (ser@gmail.com)
Date:Apr 27, 2014 12:27:18 pm
List:org.apache.lucene.java-user

Hi Rob,

Your question got me interested, so I wrote a quick prototype of what I think solves your problem (and if not, I hope it solves someone else's! :)). The idea is to write a special ValueSource, e.g. MaxValueSource which reads a BinadyDocValues, decodes the values and returns the maximum one. It can then be embedded in an expression quite easily.

I published a post on Lucene expressions and included some prototype code which demonstrates how to do it. Hope it's still helpful to you: http://shaierera.blogspot.com/2014/04/expressions-with-lucene.html.

Shai

I don't think that you should use the facet module. If all you want is to encode a bunch of numbers under a 'foo' field, you can encode them into a byte[] and index them as a BDV. Then at search time you get the BDV and decode the numbers back. The facet module adds complexity here: yes, you get the encoding/decoding for free, but at the cost of adding mock categories to the taxonomy, or use associations, for no good reason IMO.

Once you do that, you need to figure out how to extend the expressions module to support a function like maxValues(fieldName) (cannot use 'max' since it's reserved). I read about it some, and still haven't figured out exactly how to do it. The JavascriptCompiler can take custom functions to compile expressions, but the methods should take only double values. So I think it should be some sort of binding, but I'm not sure yet how to do it. Perhaps it should be a name like max_fieldName, which you add a custom Expression to as a binding ... I will try to look into it later.

On Wed, Apr 23, 2014 at 6:49 PM, Rob Audenaerde <rob.@gmail.com>wrote:

Thanks for all the questions, gives me an opportunity to clarify it :)

I want the user to be able to give a (simple) formula (so I don't know it on beforehand) and use that formula in the search. The Javascript expressions are really powerful in this use case, but have the single-value limitation. Ideally, I would like to make it really flexible by for example allowing (in-document aggregating) expressions like: max(fieldA) - fieldB

fieldC.

Currently, using single values, I can handle expressions in the form of "fieldA - fieldB - fieldC > 0" and evaluate the long-value that I receive from the FunctionValues and the ValueSource. I also optimize the query by assuring the field exists and has a value, etc. to the search still fast enough. This works well, but single value only.

I also looked into the facets Association Fields, as they somewhat look like the thing that I want. Only in the faceting module, all ordinals and values are stored in one field, so there is no easy way extract the fields that are used in the expression.

I like the solution one you suggested, to add all the numeric fields an encoded byte[] like the facets do, but then on a per-field basis, so that each numeric field has a BDV field that contains all multiple values for that field for that document.

Now that I am typing this, I think there is another way. I could use the faceting module and add a different facet field ($facetFIELDA, $facetFIELDB) in the FacetsConfig for each field. That way it would be relatively straightforward to get all the values for a field, as they are exact all the values for the BDV for that document's facet field. Only aggregating all facets will be harder, as the TaxonomyFacetSum*Associations would need to do this for all fields that I need facet counts/sums for.

What do you think?

-Rob

On Wed, Apr 23, 2014 at 5:13 PM, Shai Erera <ser@gmail.com> wrote:

A NumericDocValues field can only hold one value. Have you thought about encoding the values in a BinaryDocValues field? Or are you talking about multiple fields (different names), each has its own single value, and at search time you sum the values from a different set of fields?

If it's one field, multiple values, then why do you need to separate the values? Is it because you sometimes sum and sometimes e.g. avg? Do you always include all values of a document in the formula, but the formula changes between searches, or do you sometimes use only a subset of the values?

If you always use all values, but change the formula between queries, then perhaps you can just encode the pre-computed value under different NDV fields? If you only use a handful of functions (and they are known in advance), it may not be too heavy on the index, and definitely perform better during search.

Otherwise, I believe I'd consider indexing them as a BDV field. For facets, we basically need the same multi-valued numeric field, and given that NDV is single valued, we went w/ BDV.

If I misunderstood the scenario, I'd appreciate if you clarify it :)

Shai

On Wed, Apr 23, 2014 at 5:49 PM, Rob Audenaerde <

rob.@gmail.com

wrote:

Hi Shai, all,

I am trying to write that Filter :). But I'm a bit at loss as how to efficiently grab the multi-values. I can access the context.reader().document() that accesses the storedfields, but that seems slow.

For single-value fields I use a compiled JavaScript Expression with simplebindings as ValueSource, which seems to work quite well. The downside is that I cannot find a way to implement multi-value through that solution.

These create for example a LongFieldSource, which uses the FieldCache.LongParser. These parsers only seem te parse one field.

Is there an efficient way to get -all- of the (numeric) values for a field in a document?

On Wed, Apr 23, 2014 at 4:38 PM, Shai Erera <ser@gmail.com> wrote:

You can do that by writing a Filter which returns matching documents based on a sum of the field's value. However I suspect that is going to be slow, unless you know that you will need several such filters and can

cache

them.

Another approach would be to write a Collector which serves as a Filter, but computes the sum only for documents that match the query. Hopefully that would mean you compute the sum for less documents than you

would

have

w/ the Filter approach.

Shai

On Wed, Apr 23, 2014 at 5:11 PM, Michael Sokolov < msok@safaribooksonline.com> wrote:

This isn't really a good use case for an index like Lucene. The most essential property of an index is that it lets you look up

documents

very

quickly based on *precomputed* values.

-Mike

On 04/23/2014 06:56 AM, Rob Audenaerde wrote:

Hi all,

I'm looking for a way to use multi-values in a filter.

I want to be able to search on sum(field)=100, where field has

values

in

one documents:

field=60 field=40

In this case 'field' is a LongField. I examined the code in the FieldCache, but that seems to focus on single-valued fields only, or

It this something that can be done in Lucene? And what would be a good approach?

Thanks in advance,

To unsubscribe, e-mail: java@lucene.apache.org For additional commands, e-mail: java@lucene.apache.org