Services hung, but no errors to track
Permalink
I am attempting to manage multiple servers hosting Concrete5, in a simple load balanced configuration. In the past few weeks, we're seeing one of the servers simply stop responding - the service is up, no interesting log messages,or other hints of root cause. Also, it's not the same server having the issue, in case that was a question in your mind.
The only indicator I've found which helps me quickly find out which server is in trouble is by looking for a specific error using the "lsof" tool, and searching for "Can't identify protocol", then counting the number of incidents of this message.
The hung server always has 1027 of these errors, so I can only assume I've reached the maximum number of open (or half-open?) sockets for the port. Servers which are not hung have had less than 100 of these entries. While I am sure we could increase certain resources to reduce the frequency of this event, that's not really solving the issue.
Not sure if this is a coding bug on our side, Concrete5 side, PHP, Java, or the CentOS 6.5. Not having a root cause, or definitive error within an application or the OS is quite annoying.
Has anyone else experienced this issue lately?
Were there other key symptoms to look for?
Were you able to resolve the problem?
Any troubleshooting/debugging methods to step through would be greatly appreciated!
Cheers!
The only indicator I've found which helps me quickly find out which server is in trouble is by looking for a specific error using the "lsof" tool, and searching for "Can't identify protocol", then counting the number of incidents of this message.
ps -ef | grep httpd | awk '{print "lsof -p "$2}' | sh | grep "identify protocol" | wc -l
The hung server always has 1027 of these errors, so I can only assume I've reached the maximum number of open (or half-open?) sockets for the port. Servers which are not hung have had less than 100 of these entries. While I am sure we could increase certain resources to reduce the frequency of this event, that's not really solving the issue.
Not sure if this is a coding bug on our side, Concrete5 side, PHP, Java, or the CentOS 6.5. Not having a root cause, or definitive error within an application or the OS is quite annoying.
Has anyone else experienced this issue lately?
Were there other key symptoms to look for?
Were you able to resolve the problem?
Any troubleshooting/debugging methods to step through would be greatly appreciated!
Cheers!