

![]() | Start a set with this search |
![]() | Include this search in one of my sets |
![]() | Exclude this search from one of my sets |
![]() | Permalink to these results Paste this link in email or IM: |
| Atom feed for tracking future search results Paste this URL into your reader: |
20 messages in org.xml.lists.xml-devRe: [xml-dev] MarkMail: now archiving...| From | Sent On | Attachments |
|---|---|---|
| Jason Hunter | Nov 26, 2007 11:55 am | |
| Costello, Roger L. | Nov 26, 2007 1:32 pm | |
| Len Bullard | Nov 26, 2007 5:07 pm | |
| bryan rasmussen | Nov 27, 2007 12:59 am | |
| Elliotte Harold | Nov 27, 2007 4:51 am | |
| Elliotte Rusty Harold | Nov 27, 2007 5:00 am | |
| Len Bullard | Nov 27, 2007 5:56 am | |
| Jason Hunter | Nov 27, 2007 11:05 am | |
| Jason Hunter | Nov 27, 2007 12:46 pm | |
| Elliotte Rusty Harold | Nov 27, 2007 6:52 pm | |
| Edward C. Zimmermann | Nov 27, 2007 11:41 pm | |
| Jason Hunter | Nov 28, 2007 12:48 am | |
| Andrew Welch | Nov 28, 2007 2:21 am | |
| Edward C. Zimmermann | Nov 28, 2007 3:45 am | |
| John Snelson | Nov 28, 2007 4:51 am | |
| Jason Hunter | Nov 28, 2007 11:34 am | |
| Edward C. Zimmermann | Nov 28, 2007 1:12 pm | |
| Jason Hunter | Nov 28, 2007 3:09 pm | |
| Elliotte Rusty Harold | Dec 7, 2007 4:39 am | |
| Jason Hunter | Dec 7, 2007 9:38 am |

![]() | Permalink for this message Paste this link in email or IM: |
![]() | Permalink for this thread Paste this link in email or IM: |
| Atom feed for this thread Paste this URL into your reader: |
| Subject: | Re: [xml-dev] MarkMail: now archiving xml-dev | Actions... |
|---|---|---|
| From: | Jason Hunter (jhun...@acm.org) | |
| Date: | Nov 28, 2007 11:34:44 am | |
| List: | org.xml.lists.xml-dev | |
Edward C. Zimmermann wrote:
Quoting Jason Hunter <jhun...@acm.org>:
Edward C. Zimmermann wrote:
Quoting Elliotte Rusty Harold <elh...@metalab.unc.edu>:
Jason Hunter wrote:
What if they start consuming disk or thrashing the disk IO? When you query against hundreds of gigs of content, you don't have to be malicious to mess things up.
Its not 100s of GB. Mailing lists are not that large.
Apache's messages in raw mbox format weigh in just shy of 60 Gigs.
If you say so--- although I'm really quite amused that the there could be 60 GB of text in their lists..
If you divide 60 Gigs by 4,000,000 emails that's 15k per email. That's bigger than I would have guessed an average email to be, but you have to take into account the full headers and the influence of the (relatively few) binary attachments.
Converting mbox emails to enriched XML involves an expansion.
When I index mail I don't bother.
Well, we probably have different goals and infrastructure technologies. I want to have access to the hierarchical internal structure of each email body, and to help me accomplish that I have a tool that thinks in XML so it's a natural representation.
Of course with MarkLogic you don't store XML files on disk, any more than Oracle stores CSV files on disk. XML is just the representation data model.
Why parse and tag mail to then parse it as XML when one can parse it directly (which makes also a lot of sense given the observation that mail contains overlapping context structures such as lines and sentences) into the "internal" structures that one is using anyway (especially given that one wants to see the mail as given, noting the use of physical position to convey meaning as-if ee.cummings)?
If you only fetched mail by id, then I could parse it on the fly for rendering. But if I'm to use the structure in the query, it needs to exist in the database in its enriched format.
So, in fact, it's 100+ Gigs of XML content.
Do you index it in one big lump or is it segmented?
It operates in many ways like a database. Every new email that arrives is incorporated into the index immediately. The index model is are able to do that while also keeping performance up, using an index merging model.
-jh-
_______________________________________________________________________
XML-DEV is a publicly archived, unmoderated list hosted by OASIS to support XML implementation and development. To minimize spam in the archives, you must subscribe before posting.
[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/ Or unsubscribe: xml-...@lists.xml.org subscribe: xml-...@lists.xml.org List archive: http://lists.xml.org/archives/xml-dev/ List Guidelines: http://www.oasis-open.org/maillists/guidelines.php







