atom feed36 messages in com.marklogic.developer.generalRe: [MarkLogic Dev General] Processin...
FromSent OnAttachments
Todd GochenourFeb 19, 2012 4:59 pm 
Damon FeldmanFeb 19, 2012 6:00 pm 
Todd GochenourFeb 19, 2012 10:56 pm 
Geert JostenFeb 19, 2012 11:08 pm 
Geert JostenFeb 19, 2012 11:12 pm 
Todd GochenourFeb 19, 2012 11:46 pm 
Geert JostenFeb 20, 2012 2:42 am 
Damon FeldmanFeb 20, 2012 7:25 am 
Todd GochenourFeb 20, 2012 7:53 am 
Todd GochenourFeb 20, 2012 7:57 am 
Michael BlakeleyFeb 20, 2012 9:14 am 
Todd GochenourFeb 20, 2012 9:22 am 
Todd GochenourFeb 20, 2012 9:39 am 
Tim MeagherFeb 20, 2012 9:56 am 
Michael BlakeleyFeb 20, 2012 9:59 am 
Michael BlakeleyFeb 20, 2012 10:10 am 
Todd GochenourFeb 20, 2012 10:48 am 
Todd GochenourFeb 20, 2012 12:16 pm 
Todd GochenourFeb 21, 2012 6:59 am 
David LeeFeb 21, 2012 7:01 am 
Todd GochenourFeb 21, 2012 7:51 am 
David LeeFeb 21, 2012 8:02 am 
mcun...@comcast.netFeb 21, 2012 8:09 am 
Colleen WhitneyFeb 21, 2012 9:16 am 
Michael BlakeleyFeb 21, 2012 10:06 am 
Todd GochenourFeb 21, 2012 10:15 am 
Todd GochenourFeb 24, 2012 10:09 pm 
Geert JostenFeb 24, 2012 11:57 pm 
Todd GochenourFeb 25, 2012 9:53 am 
Geert JostenFeb 25, 2012 9:59 am 
Todd GochenourFeb 25, 2012 10:05 am 
Geert JostenFeb 25, 2012 12:01 pm 
Todd GochenourFeb 25, 2012 4:04 pm 
Geert JostenFeb 26, 2012 2:16 am 
Todd GochenourFeb 26, 2012 3:59 pm 
Todd GochenourFeb 26, 2012 10:09 pm 
Subject:Re: [MarkLogic Dev General] Processing Large Documents?
From:Michael Blakeley (mi@blakeley.com)
Date:Feb 20, 2012 9:59:48 am
List:com.marklogic.developer.general

You can raise the time limit:

http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.0doc/xml/admin/http.xml&query=request+timeout

Default Time Limit specifies the default value for any request's time limit,
when otherwise unspecified. A request can change its time limit using
xdmp:set-request-time-limit. The time limit, in turn, is the maximum number of
seconds allowed for servicing a query request. The App Server gives up on
queries which take longer, and returns an error.

Turning to your query, I see some repeated work that could probably be factored
out.

for $row in /*/*/table_data/row let $record := element {$row/../@name} {

Let's remove that duplicate name lookup: the result will be constant for every
$row in a given table_data element, and I presume there are many of those.

for $table at $index in /*/*/table_data let $table-name := $table/@name/string() for $row in $table/row let $record := element { $table-name } { $row/field[text()]/element { @name } { text() } } ...

This part is especially troubling and probably adds a lot of duplicated work:
aren't you going back to the entire database again?

xdmp:document-insert(concat(/*/*/@name,'/',name($record),'/',name($record),'_',local:generate-uuid-v4(),'.xml'),
$record)

The semicolon at the end is superfluous. I think this might do what you want:

... let $uri = concat( replace(xdmp:path($row), '(\[[0-9]+\])', ''), '/', $index) return xdmp:document-insert($uri, $xml)

That removes the uuid functions too. But if you do want a uuid implementation
that should be slightly faster, take a look at
http://markmail.org/message/mql6teskkwb574na

Given that all the work is in-memory except the document-insert, you might
actually be able to do this faster by not ingesting the table data first. It's
all one large document, right? You can read that from the filesystem. I don't
know if that will make the transform query faster or slower, but it will avoid
the need to insert all that table_data first.

for $table at $index in xdmp:document-get('/tmp/export.xml')/*/*/table_data let $table-name := $table/@name/string() for $row in $table/row let $record := element { $table-name } { $row/field[text()]/element { @name } { text() } } let $uri := concat( replace(xdmp:path($row), '(\[[0-9]+\])', ''), '/', $index) return xdmp:document-insert($uri, $record)

Finally, (as Tim just proposed in his reply) I would probably move the actual
insert into a spawned task. This requires a little more setup, but allows you to
run the XML processing in timestamped, lock-free mode. Each doc-insert would
then run asynchronously on the task server threads. You might have to increase
the task server queue size, which defaults to 100,000 I think. Otherwise you are
likely to see a MAXTASKS error. You might also want to increase the number of
task server threads. For this workload I would try one thread per CPU,
initially.

http://docs.marklogic.com/5.0doc/docapp.xqy#search.xqy?query=xdmp:spawn

(: task.xqy, a module on the filesystem in the app-server module root :) xquery version "1.0-ml"; declare variable $URI external ; declare variable $NEW external ; xdmp:document-insert($URI, $NEW)

(: query console :) for $table at $index in xdmp:document-get('/tmp/export.xml')/*/*/table_data let $table-name := $table/@name/string() for $row in $table/row let $record := element { $table-name } { $row/field[text()]/element { @name } { text() } } let $uri := concat( replace(xdmp:path($row), '(\[[0-9]+\])', ''), '/', $index) return xdmp:spawn('task.xqy', (xs:QName('URI'), $uri, xs:QName('NEW'), $record))

-- Mike

On 20 Feb 2012, at 09:23 , Todd Gochenour wrote:

The XQuery I have for performing the chunking is timing out after 9 minutes
(running in the query console). There are 156000 'rows' total in this extract.
I'm now reading the Developer's guide for Understanding Transactions to figure
out how I might optimize this query. My query reads:

declare function local:random-hex($length as xs:integer) as xs:string { string-join( for $n in 1 to $length return xdmp:integer-to-hex(xdmp:random(15)), "" ) }; declare function local:generate-uuid-v4() as xs:string { string-join( (local:random-hex(8),local:random-hex(4),local:random-hex(4),local:random-hex(4),local:random-hex(12)), "-" ) };

for $row in /*/*/table_data/row let $record := element {$row/../@name} { for $field in $row/field[text()] return element {$field/@name} {$field/text()} } return
xdmp:document-insert(concat(/*/*/@name,'/',name($record),'/',name($record),'_',local:generate-uuid-v4(),'.xml'),
$record);