|Mark Breedlove||Sep 18, 2015 12:37 pm|
|Raffaele Palmieri||Sep 19, 2015 10:27 am|
|Mark A. Matienzo||Sep 19, 2015 12:02 pm|
|Sergio Fernández||Sep 22, 2015 12:22 am|
|Mark Breedlove||Sep 23, 2015 3:53 pm|
|Sergio Fernández||Sep 24, 2015 8:47 am|
|Mark Breedlove||Sep 25, 2015 12:30 pm|
|Mark Breedlove||Oct 2, 2015 12:46 pm|
|Sergio Fernández||Oct 8, 2015 4:04 am|
|Subject:||Re: Scaling Marmotta's LDP interface|
|From:||Sergio Fernández (wik...@apache.org)|
|Date:||Oct 8, 2015 4:04:21 am|
sorry, I've been with limited connectivity for some days, but let's catch up this.
On Fri, Oct 2, 2015 at 9:46 PM, Mark Breedlove <mb...@dp.la> wrote:
I just wanted to get back to you with the status of our experimentation. I still need to gather some more information by running some more of our activities, but the activities that were bogged down last week are running now without the I/O wait that we were encountering. I'll report back when we've run the full range of our processes. (Including processes that do more inserts updates.)
That are great news! Performance tuning is very tricky.
BTW, since you are already using AWS,have you considered to give a try to Amazon RDS for PostgreSQL? I have to admit I've never had the chance to test Marmotta on it, but it'd be really interesting to compare the performance in scenarios like yours.
Well, internally each LDP-R uses its own context. Therefore all indexes
that could improve the lookup would help. Try to create also at least cop. [...]
OK, that's on my to-do list and I'll let you know ... :-)
I've created an index, `idx_triples_cop` on triples (context, object, predicate) where deleted = false.
I made note of a query last week that is executed frequently, which now uses this index:
SELECT id,subject,predicate,object,context,deleted,inferred,creator,createdAt,deletedAt FROM triples WHERE deleted = false AND context = $1 AND subject = $2 AND predicate = $3 AND object = $4
If I `explain analyze` that query, PG says "Index Scan using idx_triples_cop on triples".
Exactly, that query is executed very frequently in KiWi, so the index helps.
One query that's showing up frequently in our slow query log is now:
SELECT id,subject,predicate,object,context,deleted,inferred,creator,createdAt,deletedAt FROM triples WHERE deleted = false AND subject = $1 AND predicate = $2 AND context = $3
This is supposed to use a full index scan on idx_triples_spo, according to `explain analyze`. However, I think that the index is just so huge that a full index scan is what's still causing it to take about 6 seconds in a lot of cases.
Well, specifically the spc index, that could also be quite big.
We're apparently getting a 98% buffer cache hit rate on idx_triples_spo,
according to a query on pg_statio_user_indexes. The idx_triples_spo index is 8GB in size.
Another thing that I did was to increase work_mem to 100GB (again, before adding the memory), which didn't make any difference at the time, but which sounds like a good idea in general (and may be have been allowed to show its benefits after the filesystem cache was increased).
Well, keep observing how that working memory behaves, and maybe you can decrease it a bit and safe some memory for other components of your setup.
Since we only have about one million ore:Aggregations in our system now,
out of 11 million, I'm going to have to see how big the triples table has to get before we run into more I/O wait issues as a result of the filesystem cache / shared buffers running out. But so far so good, insofar as the I/O wait has dropped. I will follow up later with you after we try some of our harvest and mapping activities .
Based on this experience maybe we can add some of that to the project documentation, so it will help other users in the future.
Thanks for sharing all this with us!