| From | Sent On | Attachments |
|---|---|---|
| James Brady | May 4, 2008 3:35 pm | |
| Dean Thompson | Dec 5, 2008 10:43 am | |
| Grant Ingersoll | Dec 6, 2008 4:21 am |
| Subject: | IOException: Mark invalid while analyzing HTML | |
|---|---|---|
| From: | James Brady (jame...@gmail.com) | |
| Date: | May 4, 2008 3:35:18 pm | |
| List: | org.apache.lucene.solr-user | |
Hi,
I'm seeing a problem mentioned in Solr-42, Highlighting problems with HTMLStripWhitespaceTokenizerFactory: https://issues.apache.org/jira/browse/SOLR-42
I'm indexing HTML documents, and am getting reams of "Mark invalid" IOExceptions: SEVERE: java.io.IOException: Mark invalid at java.io.BufferedReader.reset(Unknown Source) at org .apache .solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171) at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java: 728) at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java: 742) at java.io.Reader.read(Unknown Source) at org.apache.lucene.analysis.CharTokenizer.next(CharTokenizer.java:56) at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:118) at org .apache .solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:249) at org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:33) at org .apache .solr .analysis.EnglishPorterFilter.next(EnglishPorterFilterFactory.java:92) at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:45) at org .apache .solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:94) at org .apache .solr .analysis .RemoveDuplicatesTokenFilter.process(RemoveDuplicatesTokenFilter.java: 33) at org .apache .solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:82) at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:79) at org.apache.lucene.index.DocumentsWriter$ThreadState $FieldData.invertField(DocumentsWriter.java:1518) at org.apache.lucene.index.DocumentsWriter$ThreadState $FieldData.processField(DocumentsWriter.java:1407) at org.apache.lucene.index.DocumentsWriter $ThreadState.processDocument(DocumentsWriter.java:1116) at org .apache .lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:2440) at org .apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java: 2422) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java: 1445)
This is using a ~1 week old version of Solr 1.3 from SVN.
One workaround mentioned in that Jira issue was to move HTML stripping outside of Solr; can anyone suggest a better approach than that?
Thanks James





