

![]() | Start a set with this search |
![]() | Include this search in one of my sets |
![]() | Exclude this search from one of my sets |
![]() | Permalink to these results Paste this link in email or IM: |
| Atom feed for tracking future search results Paste this URL into your reader: |
2 messages in org.apache.jackrabbit.usersRe : Re : Binary Content Search Probl...| From | Sent On | Attachments |
|---|---|---|
| Patrick Wider | Oct 23, 2007 3:18 am | |
| Jukka Zitting | Oct 23, 2007 3:30 am |

![]() | Permalink for this message Paste this link in email or IM: |
![]() | Permalink for this thread Paste this link in email or IM: |
| Atom feed for this thread Paste this URL into your reader: |
| Subject: | Re : Re : Binary Content Search Problem... | Actions... |
|---|---|---|
| From: | Patrick Wider (pat_...@yahoo.fr) | |
| Date: | Oct 23, 2007 3:18:34 am | |
| List: | org.apache.jackrabbit.users | |
Hi,
I really don't think file 3 replaces the previous ones. I really create on "top"
node (called "Homepage"), where I attached 3 different Nodes using
Homepage.addNode(...) (typed as: wider:file > 'nt:file', 'mix:referenceable' -
maybe there is something missing in my NodeType definition???)... I also
attached 3 different nt:resource nodes. It goes like this:
File fileTXT = new File("C:/JackRabbit/testresources/JackRabbittest.txt"); File fileDOC = new File("C:/JackRabbit/testresources/JackRabbittest.doc");
Node file1 = homepage.addNode("MyStringName", "wider:file"); Node res1 = file1.addNode("jcr:content", "nt:resource"); res1.setProperty("jcr:mimeType", mimetype); res1.setProperty("jcr:encoding", ""); res1.setProperty("jcr:lastModified", cal); res1.setProperty("jcr:data", "My String with MyKeyWord Content toto"); session.save();
Node file2 = homepage.addNode(fileTXT.getName(), "wider:file"); Node res2 = file2.addNode("jcr:content", "nt:resource"); res2.setProperty("jcr:mimeType", mimetype); res2.setProperty("jcr:encoding", ""); res2.setProperty("jcr:lastModified", cal); InputStream inputTXT = new FileInputStream(fileTXT); res2.setProperty("jcr:data", inputTXT); session.save();
Node file3 = homepage.addNode(fileDOC.getName(), "wider:file"); Node res3 = file3.addNode("jcr:content", "nt:resource"); res3.setProperty("jcr:mimeType", mimetype); res3.setProperty("jcr:encoding", ""); res3.setProperty("jcr:lastModified", cal); InputStream inputDOC = new FileInputStream(fileDOC); res3.setProperty("jcr:data", inputDOC); session.save();
Yes, my query returns one hit: the doc file... even though MyKeyWord appears in
the 3 contents
I had no return because of the missing jars. Now this problem is resolved and
the Word Document is indexed!
But the simple text file is not... weird, isn't it?
BR, Patrick
----- Message d'origine ---- De : Ard Schrijvers <a.sc...@hippo.nl> À : use...@jackrabbit.apache.org; Patrick Wider <pat_...@yahoo.fr> Envoyé le : Mardi, 23 Octobre 2007, 11h55mn 29s Objet : RE: Re : Binary Content Search Problem...
Hello Patrick,
didn't file 3 replace file 2 and file 1 perhaps? You did a session.save() after
each different file?
Do I understand correctly that you now at least get a hit for
/jcr:root//element(*, nt:resource)[(jcr:contains(., 'MyKeyWord'))]
where you did not have this one before?
Ard
Hi Ard,
Thanx for your answer.... Especially the part concerning the logs... So I could realize that they were disabled... Shame on me !;-) Anyway... the logs showed me that some jars were missing in the classpath. After correction, I re-created my repository again with one Node where I attached 3 files (the means, the creation of a nt:file node with a nt:resource node for each attached file). My files are: 1. I set up the jcr:data property with a String, as you asked me to do... I put text/plain as mimetype (since the field is mandatory) 2. jcr:data is set up with a stream on a simple text file (mime type: text/plain) 3. jcr:data is set up with a stream on a Word Document file (mimetype: application/msword)
I created this nodes and here are extracts of the logs the I got related to indexing. (note that there is no error log in the whole log file, only debug) file 1: DEBUG - persisting change log {#addedStates=15, #modifiedStates=1, #deletedStates=0, #modifiedRefs=0} took 172ms DEBUG - notifying 3 synchronous listeners. DEBUG - onEvent: indexing started DEBUG - extractText(stream, text/plain, ) DEBUG - onEvent: indexing finished in 31 ms.
file 2: DEBUG - persisting change log {#addedStates=11, #modifiedStates=1, #deletedStates=0, #modifiedRefs=0} took 79ms DEBUG - notifying 3 synchronous listeners. DEBUG - onEvent: indexing started DEBUG - extractText(stream, text/plain, ) DEBUG - onEvent: indexing finished in 0 ms. DEBUG - got EventStateCollection
file 3: DEBUG - persisting change log {#addedStates=11, #modifiedStates=1, #deletedStates=0, #modifiedRefs=0} took 125ms DEBUG - notifying 3 synchronous listeners. DEBUG - onEvent: indexing started DEBUG - extractText(stream, application/msword, ) DEBUG - onEvent: indexing finished in 78 ms. DEBUG - got EventStateCollection
And checking the state of the index with Luke, I could figure out that file 3 (Word) was tokenized... but the content of file 1 and 2 don't appear anywhere, even though the respective properties and nodes do appear!!! Consquently, when I run the following XPath query: /jcr:root//element(*, nt:resource)[(jcr:contains(., 'MyKeyWord'))]
The only result is the Word Document...
What happened with the 2 other files? Maybe the mimetype is wrong (text/plain) ? Or did I forget to define something ? Maybe I did something wrong in my filter definition, which is: <param name="textFilterClasses" value="org.apache.jackrabbit.extractor.PlainTextExtractor, org.apache.jackrabbit.extractor.MsWordTextExtractor, org.apache.jackrabbit.extractor.MsExcelTextExtractor, org.apache.jackrabbit.extractor.MsPowerPointTextExtractor, org.apache.jackrabbit.extractor.PdfTextExtractor, org.apache.jackrabbit.extractor.OpenOfficeTextExtractor, org.apache.jackrabbit.extractor.RTFTextExtractor, org.apache.jackrabbit.extractor.HTMLTextExtractor, org.apache.jackrabbit.extractor.XMLTextExtractor"/>
I thought that org.apache.jackrabbit.extractor.PlainTextExtractor could handle simple text files... As you can see, it is getting better, but I still need a little help ;-) so if you haven any idea, don't hesitate
Thank you in advance, BR Patrick
----- Message d'origine ---- De : Ard Schrijvers <a.sc...@hippo.nl> À : use...@jackrabbit.apache.org; Patrick Wider <pat_...@yahoo.fr> Envoyé le : Lundi, 22 Octobre 2007, 14h59mn 53s Objet : RE: Binary Content Search Problem...
Hello Patrick,
Patrick Wider wrote:
Of course the files contain somehow 'myKeyWord'... the text file contains it for sure, but in the Document, 'myKeyWord' is wrapped by bold and italic styles. But I don't think the styles cause any problems... on the other hand, I have no idea how the extractors works ;-) it's just a guess....
Just for pinpointing the problem, what happens if:
1) you search for a word that is not with bold or italic styles? 2) if you replace inputstr with "a string to test myKeyWord", and then do the search again
You might want to turn on the logging for the indexing and extractors, perhaps they reveal some problems. Furthermore you might want to take a look at the latest created index folder after adding a binary doc with luke [1] and see if the binary data is present as tokens in the index
Regards Ard
[1] http://www.getopt.org/luke/
______________________________________________________________
_______________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail
_____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail







