atom feed9 messages in org.apache.marmotta.usersRe: Scaling Marmotta's LDP interface
FromSent OnAttachments
Mark BreedloveSep 18, 2015 12:37 pm 
Raffaele PalmieriSep 19, 2015 10:27 am 
Mark A. MatienzoSep 19, 2015 12:02 pm 
Sergio FernándezSep 22, 2015 12:22 am 
Mark BreedloveSep 23, 2015 3:53 pm 
Sergio FernándezSep 24, 2015 8:47 am 
Mark BreedloveSep 25, 2015 12:30 pm 
Mark BreedloveOct 2, 2015 12:46 pm 
Sergio FernándezOct 8, 2015 4:04 am 
Subject:Re: Scaling Marmotta's LDP interface
From:Sergio Fernández (wik@apache.org)
Date:Sep 22, 2015 12:22:49 am
List:org.apache.marmotta.users

Hi Mark and Tom,

take into account that by default PostgreSQL is very very conservative. Have you customized it?

In the official wiki you can find some useful information:

https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

We typically modify the following settings in deployments environments, but some may need to be adapted depending on the concrete scenario and available resources:

* max_connections: default is 100, 1000 is at least recommended, but goes together with the available memory (see next) * shared_buffers: default is just 32MB, as much as you give the more postres would not need to touch disk for any operation (we have installations with 16GB behaving really well) * work_mem: default is 4MB, you can dramatically increase it (128 or 256 MB) to improve each each transaction handling * maintenance_work_mem: is less critical, but useful on maintenance tasks you periodically should run (see below) * checkpoint_segments: by default is just 3, but a much larger value improves transaction handling * think about your vacuum strategy

I'm not an expert on performance tuning, but I'm pretty sure that with some time your sysadmin would manage to find the right settings for your installation.

In addition to the general Posgres stuff, for Marmotta there are few more things that are critical:

* If you have resources, create all the indexes (cspo, cop, cp, literals, etc) you may need to improve performance

* If you do not use versioning, cleanup periodically (nightly) deleted triples: DELETE FROM triples where deleted = true;

Hope that helps. Try to implement some of those suggestion in your system and tell us how they behave and where you still need more help, and maybe code patching.

Cheers,

On Fri, Sep 18, 2015 at 9:37 PM, Mark Breedlove <mb@dp.la> wrote:

Hello, Marmotta Users,

At the Digital Public Library of America, we have a large Marmotta triplestore, with which we interact entirely over LDP.

We're looking for some advice about scaling Marmotta's LDP interface past our current size. In the short term, we are hoping that we can find ways to tune PostgreSQL to mitigate some problems we have seen; in the long term, we are open to advice about alternate backends.

A high-level overview of how we interact with our LDP Resources is documented in [1]. While we have had to do some LDP-specific tuning (especially introducing a partial index on `triples.context`) for all processes, we have seen particular trouble in cases where we GET, transform, then PUT an LDP RDFSource (see: *Enrichment *in the overview link).

That overview is part of a greater wiki that we've put together to document our installation and performance-tuning activities [2].

Our biggest problem at the moment is addressing slow updates and inserts [3], observed when we GET and PUT those RDFSources with two concurrent mapping or enrichment activities. If we run one of these activities, GETing, transforming, and PUTing in serial, performance seems to be network and CPU bound, and is not very bad. But as soon as we run a second mapping or enrichment, work performed grinds practically to a halt, as described in [3].

To give you a sense of the scale at which we're operating, we have about two million LDP-RSs, typically including about 50 triples and a handful of blank nodes (around 5 to 15). Our `triples` table has about 294M rows now and takes up 32GB for the table, and 13GB each for its two largest indices. Our entire Marmotta database takes up about 140GB. We've had some successes with improving index performance with low cardinality in `triples.context` [4] and tuning the Amazon EC2 instances that we run on [5][6]. The I/O wait problem with concurrent LDP operations, however, is the new blocker.

Some supplemental information:

* An overview of the project for which Marmotta is being used: https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Heidrun

* The application (a Rails engine) that makes all of these LDP requests: https://github.com/dpla/KriKri

* Our configuration-management project, with details on how some of our stack is configured: https://github.com/dpla/automation

We'd be grateful for any feedback that you might have that would assist us with handling large volumes of data over LDP. Thanks for your help!

[1] https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/LDP+Interactions+Overview [2] https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Marmotta [3] https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Addressing+slow+updates+and+inserts [4] https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Index+performance+with+high+context+counts [5] https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Amazon+EC2+adjustments [6] https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Using+irqbalance+and+SMP+IRQ+affinity