| From | Sent On | Attachments |
|---|---|---|
| Todd Gochenour | Feb 19, 2012 4:59 pm | |
| Damon Feldman | Feb 19, 2012 6:00 pm | |
| Todd Gochenour | Feb 19, 2012 10:56 pm | |
| Geert Josten | Feb 19, 2012 11:08 pm | |
| Geert Josten | Feb 19, 2012 11:12 pm | |
| Todd Gochenour | Feb 19, 2012 11:46 pm | |
| Geert Josten | Feb 20, 2012 2:42 am | |
| Damon Feldman | Feb 20, 2012 7:25 am | |
| Todd Gochenour | Feb 20, 2012 7:53 am | |
| Todd Gochenour | Feb 20, 2012 7:57 am | |
| Michael Blakeley | Feb 20, 2012 9:14 am | |
| Todd Gochenour | Feb 20, 2012 9:22 am | |
| Todd Gochenour | Feb 20, 2012 9:39 am | |
| Tim Meagher | Feb 20, 2012 9:56 am | |
| Michael Blakeley | Feb 20, 2012 9:59 am | |
| Michael Blakeley | Feb 20, 2012 10:10 am | |
| Todd Gochenour | Feb 20, 2012 10:48 am | |
| Todd Gochenour | Feb 20, 2012 12:16 pm | |
| Todd Gochenour | Feb 21, 2012 6:59 am | |
| David Lee | Feb 21, 2012 7:01 am | |
| Todd Gochenour | Feb 21, 2012 7:51 am | |
| David Lee | Feb 21, 2012 8:02 am | |
| mcun...@comcast.net | Feb 21, 2012 8:09 am | |
| Colleen Whitney | Feb 21, 2012 9:16 am | |
| Michael Blakeley | Feb 21, 2012 10:06 am | |
| Todd Gochenour | Feb 21, 2012 10:15 am | |
| Todd Gochenour | Feb 24, 2012 10:09 pm | |
| Geert Josten | Feb 24, 2012 11:57 pm | |
| Todd Gochenour | Feb 25, 2012 9:53 am | |
| Geert Josten | Feb 25, 2012 9:59 am | |
| Todd Gochenour | Feb 25, 2012 10:05 am | |
| Geert Josten | Feb 25, 2012 12:01 pm | |
| Todd Gochenour | Feb 25, 2012 4:04 pm | |
| Geert Josten | Feb 26, 2012 2:16 am | |
| Todd Gochenour | Feb 26, 2012 3:59 pm | |
| Todd Gochenour | Feb 26, 2012 10:09 pm |
| Subject: | Re: [MarkLogic Dev General] Processing Large Documents? | |
|---|---|---|
| From: | Damon Feldman (Damo...@marklogic.com) | |
| Date: | Feb 19, 2012 6:00:59 pm | |
| List: | com.marklogic.developer.general | |
Todd,
RecordLoader and CoRB are useful tools for bulk loading and processing,
respectively, and are on the MarkLogic developer site.
Typically, XML documents in MarkLogic correspond to rows rather than tables, so
it may be ideal to use RecordLoader's RECORD_NAME configuration property to
break each <row> into its own document. It's a Java utility, so Java will handle
streaming the document.
To transform poorly-structured data such as an SQL export into nice XML you may
use RecordLoader's XccModuleContentFactory to invoke your transform code while
loading, or use CoRB to run a long batch after the load is complete..
Both utilities are multi-threaded and fast.
Yours, Damon
________________________________
From: gene...@developer.marklogic.com
[gene...@developer.marklogic.com] On Behalf Of Todd Gochenour
[todd...@gmail.com]
Sent: Sunday, February 19, 2012 7:59 PM
To: MarkLogic Developer Discussion
Subject: [MarkLogic Dev General] Processing Large Documents?
I have a 154Gig file representing a data dump from MySQL that I want to load
into MarkLogic and analyze.
When I use the flow editor to collect/load this file into an empty database, it
takes 33 seconds.
When I add two delete element transforms to the flow the load fails with a
timeout error after several minutes. One was to remove <table_structure/>, as
this schema information isn't necessary for my analysis. The second removed
elements with empty contents using the *[not(text())] xpath expression.
I gather from this that the transform phase does not operate on XML files in a
streaming mode. Does there exist a custom transform that can work on a stream
of data, say by using Saxon's streaming functionality or a StAX transformation?
I would expect an ETL tool to be able to handle large files.
After loading this file huge file without the transform into MarkLogic, I then
wrote the following XQuery which when run in the Query Console was able to
delete these elements and perform an element name transformation as this
operation performed in 15 seconds and reduced the 154Gig file to 6Gigs. This
process handles the ETL functionality with great performance.
The original record reads:
<table_data name="cli"> <row> <field name="id">1</field> <field name="org_id">1</field> </row> ....
will be transformed into:
<cli> <id>1</id> <org_id>1</org_id> </cli> ...
with this XQuery:
let $doc := element {/*/*/@name)} { for $row in /*/*/table_data/row return element {$row/../@name} { for $field in $row/field[text()] return element {$field/@name} {$field/text()} } } return xdmp:document-insert("{/*/*/@name}.xml", $doc)
My next step in this process is to write a transform which de-normalizes the SQL
tables into nested element structure and thus removing all the primary/foreign
keys which have no semantic purpose other than to identify relationships. I'd
like to be able to automate this transformation using the Information Center
Flow Editor rather than doing it manually in the Query Console.
_______________________________________________ General mailing list Gene...@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general





