|Subject:||Explanation and solutions of some Jackrabbit queries regarding performance|
|From:||Ard Schrijvers (a.sc...@hippo.nl)|
|Date:||Jan 22, 2008 1:17:04 pm|
Hello Martin Zdila regarding JCR-1196 et al,
from time to time I see mails regarding performance of queries and slow things like queryResult.getNodes().hasNext(). There are queries which can be slow, there are data modelling structures which might be slow, and there are seemingly trivial things like queryResult.getNodes().hasNext() which might be slow. I write 'might' all the time, because everything can and must be blistering fast with millions of documents, and most of the time, solutions are extremely simple to achieve this. We just have to document some pitfalls of easy made mistakes. I'll try to find some time in the near future to document some parts I am aware of in the form of a FAQ, like the rest of this mail will be. For now just some frequently made mistakes from the top of my head:
@Martin Zdila : if you are not interested in reading the rest of this mail, just add <param name="respectDocumentOrder" value="false"/> to the <SearchIndex> element of your workspace.xml (and repository.xml). Also try to avoid 4000 node childs (certainly same name nodes) under one node, try to create a larger tree where nodes to not contains many child nodes. This is just like your filesystem not fast
Question 1: why is search for xpath '/jcr:root/a/b/c' slower than '//c' or '//*[@someprop]' ?
Answer 1: When using a path like '/jcr:root/a/b/c' or '/jcr:root/a//*/c' will be executed, the hierarchy manager has to check all found nodes wether their parents are correct. Since Jackrabbit does not store hierarchical data (if it would, it could not efficiently move a node anymore, at least in the current architecture), hierarchies need to be checked by iterating through the lucene indexes to find parent nodes of a result. This is cpu consuming. Although since Jackrabbit 1.4 the hierarchy is cached properly, returning many results is still an expensive operation. The first execution of a query might be slow because the hierarchy cache needs to be build up. Queries like '//c' or '//*[@someprop]' do not need to check hierarchies, because results do not need to check wether they are allowed according their parent node.
Conclusion 1: When the resultset of the search is expected to be large, try to avoid path info in the xpath. Try to distinguish based on for example nodetype or some property.
Question 2: My xpath was '//c' and the result size is 10.000 nodes. When I call queryResult.getNodes().hasNext() it takes up to minutes to complete this call.
Answer 2: For Jackrabbit version < 1.5 , the default setting in the <SearchIndex> configuration in repository.xml is <param name="respectDocumentOrder" value="true"/>. This means that when a query does *not* have a 'order by' clause, result nodes will be in document order. Returning nodes in document order for many results (> 1000) will become increasingly slow. You can fix this by either setting respectDocumentOrder to false in your repository.xml (and in workspace.xml if you have an existing workspace already) *or* by adding an 'order by' clause in your query. Minutes delay will be decreased to 0-15ms
Conclusion 2: When you have a lot of results, either include an 'order by' clause or set respectDocumentOrder to false. Modelling your content in having many child nodes below one single node will make the problem even larger when you have respectDocumentOrder = true and do not define an 'order by' clause
Question 3: My xpath is '//*[jcr:like(@propertyName, '%somevalue%')]' and it takes minutes to complete.
Answer 3: a jcr:like with % will be translated to a WildcardQuery lucene query. In order to prevent extremely slow WildcardQueries, a Wildcard term should not start with one of the wildcards * or ?. So this is not a Jackrabbit implementation detail, but a general Lucene (and I think inverted indexes in general) issue 
Conclusion 3: Avoid % prefixes in jcr:like. Use jcr:contains when searching for a specific word. If jcr:contains is not suitable, you can work around the problem by creating a custom lucene analyzer for the specific propery (see IndexingConfiguration  at Index Analyzers).
Question 4: I am not searching through nodes, but traversing, and this is slow
Answer 4: Model your repository to not have very many child nodes directly below a node. Try to structure your repository to have not extremely 'large folders', comparable to how your FileSystem would become slow
This mail is getting to long :-) I'll come up with ssome extra FAQ's from time to time, and if people are interested I will make a (wiki?) document for it. I though might need some help because at some parts my knowledge might be insufficient
To be continued,
 http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/or g/apache/lucene/search/WildcardQuery.html  http://wiki.apache.org/jackrabbit/IndexingConfiguration