| From | Sent On | Attachments |
|---|---|---|
| Todd Gochenour | Feb 19, 2012 4:59 pm | |
| Damon Feldman | Feb 19, 2012 6:00 pm | |
| Todd Gochenour | Feb 19, 2012 10:56 pm | |
| Geert Josten | Feb 19, 2012 11:08 pm | |
| Geert Josten | Feb 19, 2012 11:12 pm | |
| Todd Gochenour | Feb 19, 2012 11:46 pm | |
| Geert Josten | Feb 20, 2012 2:42 am | |
| Damon Feldman | Feb 20, 2012 7:25 am | |
| Todd Gochenour | Feb 20, 2012 7:53 am | |
| Todd Gochenour | Feb 20, 2012 7:57 am | |
| Michael Blakeley | Feb 20, 2012 9:14 am | |
| Todd Gochenour | Feb 20, 2012 9:22 am | |
| Todd Gochenour | Feb 20, 2012 9:39 am | |
| Tim Meagher | Feb 20, 2012 9:56 am | |
| Michael Blakeley | Feb 20, 2012 9:59 am | |
| Michael Blakeley | Feb 20, 2012 10:10 am | |
| Todd Gochenour | Feb 20, 2012 10:48 am | |
| Todd Gochenour | Feb 20, 2012 12:16 pm | |
| Todd Gochenour | Feb 21, 2012 6:59 am | |
| David Lee | Feb 21, 2012 7:01 am | |
| Todd Gochenour | Feb 21, 2012 7:51 am | |
| David Lee | Feb 21, 2012 8:02 am | |
| mcun...@comcast.net | Feb 21, 2012 8:09 am | |
| Colleen Whitney | Feb 21, 2012 9:16 am | |
| Michael Blakeley | Feb 21, 2012 10:06 am | |
| Todd Gochenour | Feb 21, 2012 10:15 am | |
| Todd Gochenour | Feb 24, 2012 10:09 pm | |
| Geert Josten | Feb 24, 2012 11:57 pm | |
| Todd Gochenour | Feb 25, 2012 9:53 am | |
| Geert Josten | Feb 25, 2012 9:59 am | |
| Todd Gochenour | Feb 25, 2012 10:05 am | |
| Geert Josten | Feb 25, 2012 12:01 pm | |
| Todd Gochenour | Feb 25, 2012 4:04 pm | |
| Geert Josten | Feb 26, 2012 2:16 am | |
| Todd Gochenour | Feb 26, 2012 3:59 pm | |
| Todd Gochenour | Feb 26, 2012 10:09 pm |
| Subject: | Re: [MarkLogic Dev General] Processing Large Documents? | |
|---|---|---|
| From: | Geert Josten (geer...@dayon.nl) | |
| Date: | Feb 19, 2012 11:08:36 pm | |
| List: | com.marklogic.developer.general | |
Hi Todd,
It is mostly because of two reasons: memory footprint, and indexing.
If you don’t have fragmentation enabled in the database configuration, then the entire document is one fragment of 150Gb. Any processing on fragments mean that the entire fragment is loaded into memory. Luckily FLWOR expressions are highly optimized, since they are also necessary to sort search results, and operate almost as if they are streamed usually.
The indexes are also fragment-based. This means that if you search for a word, the database can return from the index (loaded in memory) in microseconds which fragments contain that word. That is why you want your fragments to ‘match’ your records.
You can use fragmentation for this, which prevents you from chunking the file at load time, but in practice chunked file seem to out-perform a file stored with fragmentation. Besides, the container element doesn’t add value, so why insist on maintaining it? You can always add it again if you want to extract the content and write it as a single file again.
Kind regards,
Geert
*Van:* gene...@developer.marklogic.com [mailto: gene...@developer.marklogic.com] *Namens *Todd Gochenour *Verzonden:* maandag 20 februari 2012 7:57 *Aan:* MarkLogic Developer Discussion *Onderwerp:* Re: [MarkLogic Dev General] Processing Large Documents?
This advice repeats a recommendation I saw earlier tonight during some of my research, namely that with MarkLogic it's better to break up documents into smaller fragments. I guess there's a performance gain in bursting a document into small fragments, something to do with concurrency and locking or minimizing the depth of the hierarchy, perhaps?
Note that my document doesn't equate to tables but instead it equates to the entire database, which is two levels away from this recommendation to have documents equate to rows. It seems like the conventional wisdom is to burst large documents into smaller fragments so that each fragment can be handled independently. I've always felt it simpler and more accurate to load and use the XML file as is and not shred it into multiple parts. I want to replace the MySQL database with an XML database for this very reason.
So I've managed to load this large document into the database and I've done my first transformation of this document using XQuery to perform the extraction and performance seems rather impressive. I've done the same thing with both eXistDB and xDB with no problem, indexing everything including the deep hierarchical structure. Once in the database, I should be able to update fragments within the document as easily as if these fragments were burst into individual files. Is there a technical reason (I've yet to discover) for why this would not be the case?
_______________________________________________ General mailing list Gene...@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general





