atom feed2 messages in org.apache.lucene.solr-userGC stalls cause Zookeeper timeout dur...
FromSent OnAttachments
Arend-Jan WijtzesNov 6, 2012 3:06 am 
Gil TeneNov 6, 2012 6:53 am 
Subject:GC stalls cause Zookeeper timeout during uninvert for facet field
From:Arend-Jan Wijtzes (ajwy@wise-guys.nl)
Date:Nov 6, 2012 3:06:31 am
List:org.apache.lucene.solr-user

Hi,

We are running a small solr cluster with 8 cores on 4 machines. This database has about 1E9 very small documents. One of the statistics we need requires a facet on a text field with high cardinality.

During the uninvert phase of this text field the searchers experience long stalls because of the garbage collecting (20+ seconds pauses) which causes Solr to lose the Zookeeper lease. Often they do not recover gracefully and as a result the cluster becomes degraded:

"SEVERE: There was a problem finding the leader in zk:org.apache.solr.common.SolrException: Could not get leader props"

This is an known open issue.

I explored several options to try and work around this. However I'm new to Solr and need some help.

We tried running more cores: We went from 4 to 8 cores. Does it make sense to go to 16 cores on 4 machines?

GC tuning: This helped a lot but not enough to prevent the lease expirations. I'm by no means a Java GC expert and would appreciate any tips to improve this further. Current settings are:

Java HotSpot(TM) 64-Bit Server VM (20.0-b11) -Xloggc:/home/solr/solr/log/gc.log -XX:+PrintGCApplicationStoppedTime -XX:+PrintTenuringDistribution -XX:+PrintClassHistogram -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75 -XX:MaxGCPauseMillis=10000 -XX:+CMSIncrementalMode -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -Djava.awt.headless=true -Xss256k -Xmx18g -Xms1g -DzkHost=ds30:2181,ds31:2181,ds32:2181

Actual memory stats accoring to top are: 74GB virtual, 11GB resident. The GC log shows: - age 1: 39078968 bytes, 39078968 total : 342633K->38290K(345024K), 24.7992520 secs] 9277535K->9058682K(11687832K) icms_dc=73 , 24.7993810 secs] [Times: user=366.87 sys=26.31, real=24.79 secs] Total time for which application threads were stopped: 24.8005790 seconds 975.478: [GC 975.478: [ParNew Desired survivor size 19628032 bytes, new threshold 1 (max 4) - age 1: 38277672 bytes, 38277672 total : 343750K->37537K(345024K), 22.4217640 secs] 9364142K->9131962K(11687832K) icms_dc=73 , 22.4218650 secs] [Times: user=331.25 sys=23.85, real=22.42 secs] Total time for which application threads were stopped: 22.4231750 seconds

etc.

Solr version: 4.0.0.2012.10.06.03.04.33

Current hardware consists of 4 machines, of which each has: 2x E5645 CPU, total of 24 cores 48GB mem 8 x SATA 7200RPM in raid 10

What would be a good strategy to try and get this database to perform the way we need it? Would it make sense to split it up into 16 shards? Ways to improve the GC behavior?

Any help would be grately appreciated.

AJ