We have a three node ML setup in a cluster and found that the UI
environment became unresponsive. And upon analysis we found that for one
node there were around 100 threads pending and the CPU utilization for the
node was 98.9 %. And the other two nodes had the following errors logged:
2017-09-01 00:27:02.573 Error: PerfMeterTask::run: SVC-EXTIME: Time limit
2017-09-01 01:01:01.783 Error: UsageMeterTask::run: SVC-EXTIME: Time limit
So we restarted the ML servers around 10 AM IST and UI started working
again. We don't have the errorlog from 12:00 AM till 10 AM . Just wondering
when we restarted the system, there was no logs for node 1 during this
time. And the logs were present for the other two nodes. Please suggest
what would have gone wrong.
And just in case if node 1 was down, what would be the entry in the error
log for such a scenario.