atom feed7 messages in com.marklogic.developer.generalRe: [MarkLogic Dev General] Optimizii...
FromSent OnAttachments
seme...@hotmail.comFeb 6, 2012 1:56 pm 
Michael BlakeleyFeb 6, 2012 2:10 pm 
seme...@hotmail.comFeb 6, 2012 2:10 pm 
Michael BlakeleyFeb 6, 2012 3:55 pm 
seme...@hotmail.comFeb 6, 2012 5:36 pm 
Michael BlakeleyFeb 6, 2012 6:10 pm 
Geert JostenFeb 6, 2012 10:43 pm 
Subject:Re: [MarkLogic Dev General] Optimiziing for several writes
From:Michael Blakeley (mi@blakeley.com)
Date:Feb 6, 2012 6:10:04 pm
List:com.marklogic.developer.general

More I/O, of course. At some point it will become difficult to get the forests
to coordinate transactions quickly enough. The limit will depend on the network,
CPU, and disk speeds. Using more than one ingestion host (ie, HTTP client or XCC
client) can help to push that limit out, too.

-- Mike

On 6 Feb 2012, at 17:36 , seme@hotmail.com wrote:

Golden. Thanks mike

What about thousands of writes per second. Any differences?

Sent from my iPhone

On Feb 6, 2012, at 4:56 PM, "Michael Blakeley" <mi@blakeley.com> wrote:

That doesn't sound too challenging. The points you've already raised are good,
but you will need whatever indexing you need. You might try to avoid using
property fragments, if possible (disable maintain-last-modified, for example).
Depending on your queries, you may be able to disable some or all of the default
full-text indexing, and rely on a combination of the built-in XPath indexes and
application-specific range indexes.

Think hard about your document URIs. You will want the URIs to be such that lock
contention simply won't happen. For example you could use xdmp:random to
generate URIs, or some combination of ids and timestamps that will guarantee
uniqueness. Let's say you receive an update for each ticker symbol once per
second, for example. You might structure your URIs as SYMBOL/TIMESTAMP, or as
TIMESTAMP/SYMBOL. Put some thought into which of those might be more useful at
query time.

You may want to reduce the size of your in-memory stands. This may sound
backward. Folks often try to optimize ingestion by using really large in-memory
stands, but with small documents this can be counter-productive. With
high-frequency updates and small documents, you may be better off limiting each
in-memory stand to less than 32k fragments, and reducing the in-memory limits
accordingly so that you can use that memory elsewhere.

After that it will mostly be a question of keeping up with the demands on CPU,
memory, and disk. Given modern Xeon CPUs and memory sizes, the disk is probably
the hardest part. You want fast sequential writes for journaling and for saving
in-memory stands as they fill up. You'll also need fairly good read performance
for merges. As a rule of thumb, try to have 10-MB/sec of read-write capacity per
1-MB/sec of incoming XML.

You might also benefit from a little SSD storage configured as a fast data
directory for your forests (requires MarkLogic 5). But I think you can hit your
targets with spinning disks, as long as you configure them properly.

You'll probably want to have 1-2 forests per filesystem, spread out across
multiple block devices, rather than putting everything on one giant filesystem.
Consider avoiding RAID entirely, and using forest replication instead. If you do
use RAID, use RAID-1 and RAID-10. Avoid RAID-5 and RAID-6, because their write
performance is likely to be a problem.

On 6 Feb 2012, at 14:11 , seme@hotmail.com wrote:

Not sure, but let's say hundreds a second.

From: mi@blakeley.com Date: Mon, 6 Feb 2012 14:10:42 -0800 To: gene@developer.marklogic.com Subject: Re: [MarkLogic Dev General] Optimiziing for several writes

How many inserts/sec do you think the database will need to sustain?

On 6 Feb 2012, at 13:57 , seme@hotmail.com wrote:

So I've normally dealt with optimizing MarkLogic for few writes but many reads.
In a situation where there are several writes and fewer reads (as with reports
on stock ticks for example), are there any pointers or tips for speeding up
writes? I can imagine that reducing the number of indexes helps, as does always
writing new files rather than updating existing ones, and keep the files small.
Anything else? I may need some indexes for reporting purposes. And I realize
that it may be better to let another system write the data while MarkLogic
ingests soon thereafter, but I am interested in truly realtime data views, not
next-day, or next-hour views into the data.

thanks, Ryan