20 messages in org.xml.lists.xml-devRe: [xml-dev] MarkMail: now archiving...
FromSent OnAttachments
Jason HunterNov 26, 2007 11:55 am 
Costello, Roger L.Nov 26, 2007 1:32 pm 
Len BullardNov 26, 2007 5:07 pm 
bryan rasmussenNov 27, 2007 12:59 am 
Elliotte HaroldNov 27, 2007 4:51 am 
Elliotte Rusty HaroldNov 27, 2007 5:00 am 
Len BullardNov 27, 2007 5:56 am 
Jason HunterNov 27, 2007 11:05 am 
Jason HunterNov 27, 2007 12:46 pm 
Elliotte Rusty HaroldNov 27, 2007 6:52 pm 
Edward C. ZimmermannNov 27, 2007 11:41 pm 
Jason HunterNov 28, 2007 12:48 am 
Andrew WelchNov 28, 2007 2:21 am 
Edward C. ZimmermannNov 28, 2007 3:45 am 
John SnelsonNov 28, 2007 4:51 am 
Jason HunterNov 28, 2007 11:34 am 
Edward C. ZimmermannNov 28, 2007 1:12 pm 
Jason HunterNov 28, 2007 3:09 pm 
Elliotte Rusty HaroldDec 7, 2007 4:39 am 
Jason HunterDec 7, 2007 9:38 am 
Actions with this message:
Paste this link in email or IM:
Paste this link in email or IM:
Atom feed for this thread
Paste this URL into your reader:
Subject:Re: [xml-dev] MarkMail: now archiving xml-devActions...
From:Edward C. Zimmermann (ed@bsn.com)
Date:Nov 28, 2007 3:45:57 am
List:org.xml.lists.xml-dev

Quoting Jason Hunter <jhun@acm.org>:

Edward C. Zimmermann wrote:

Quoting Elliotte Rusty Harold <elh@metalab.unc.edu>:

What if they start consuming disk or thrashing the disk IO? When you query against hundreds of gigs of content, you don't have to be malicious to mess things up.

Its not 100s of GB. Mailing lists are not that large.

Apache's messages in raw mbox format weigh in just shy of 60 Gigs.

If you say so--- although I'm really quite amused that the there could be 60 GB of text in their lists..

Converting mbox emails to enriched XML involves an expansion.

When I index mail I don't bother. Why parse and tag mail to then parse it as XML when one can parse it directly (which makes also a lot of sense given the observation that mail contains overlapping context structures such as lines and sentences) into the "internal" structures that one is using anyway (especially given that one wants to see the mail as given, noting the use of physical position to convey meaning as-if ee.cummings)?

There is, of course, the context of one message within the larger context but that too is a more complex. One thread may be a part of another thread and bits split-off going partially to completely off-topic to being again a part of a topic with some other grand siblings.. Part of IR should distinguish between announced part of threads (declaring with MESSAGE-ID and References or even subject content) and information threads. Even declared threads overlap.

So, in fact, it's 100+ Gigs of XML content.

Do you index it in one big lump or is it segmented?

-jh-

-- E. Zimmermann, BSn/Munich R&D Unit Leopoldstrasse 53-55, D-80802 Munich, Federal Republic of Germany http://www.nonmonotonic.net

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS to support XML implementation and development. To minimize spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/ Or unsubscribe: xml-@lists.xml.org subscribe: xml-@lists.xml.org List archive: http://lists.xml.org/archives/xml-dev/ List Guidelines: http://www.oasis-open.org/maillists/guidelines.php