Following on from my previous blog post where I mentioned that we’ve discovered a bug in the Hyperic 5.8.4 client (on both Windows and Linux), I think it’s only fair that I share our findings. It’s a bug that we discovered whilst deploying a very large vRealize Suite (two maximum sized global clusters of vROPS, vRLI, Hyperic and vRA/vRO).
Whilst carrying out some testing in my lab surrounding the impact of replacing SSL certificates in Hyperic, I noticed that if for whatever reason authentication between the Hyperic agent and Hyperic server fails, the Hyperic agent increases CPU utilisation of the client machine it’s running on to between 85% and 100%. At first I thought that it’s an anomaly, but I was then able to reproduce the symptoms a further 3 times in proving to VMware GSS that the issue really does exist. A long story short
, VMware GSS has opened a bug ticket with engineering and it should be resolved in a future release I believe.
Now, if you are running Hyperic 5.8.4 and you are looking to replace SSL certificates for an implementation that is already running, the task is relatively straight forward, although I will not be covering how to replace the Hyperic SSL certificates as part of this post. The successful execution of replacing SSL certificates for the Hyperic server depends on how the Hyperic agents that are currently reporting back into the Hyperic server were configured when they were deployed.
The default Hyperic agent configuration, which can be found in the agent.properties file, contains a configuration line that will ultimately determine if the agent will accept the new SSL certificate presented by the Hyperic server or not. If the Hyperic agent was pushed out using the default settings within the agent.properties file, with the exception of the “agent.setup.<setting>” lines, you will most probably encounter the bug when replacing your SSL certificates on the Hyperic Server.
Reproducing the issue
To reproduce the issue in a lab, simply:
1. Deploy a new Hyperic Server instance
2. Deploy a few new Windows and/or Linux server with the agent installed, using the default agent.properties configuration.
3. Confirm that the agents are monitored from within Hyperic
4. Replace the Hyperic Server SSL Certificate
5. Monitor the platforms from the Hyperic web interface and confirm that they show as red (unavailable) after a few minutes
6. Monitor the agent machines CPU utilisation and agent.log
What’s the cause?
The default agent.properties file contains the following lines:
## Automatically accept unverified certificates accept.unverified.certificates=false
With this accept.unverified.certificates (notice the plural) property set to false, the Hyperic agent will not accept the new Hyperic server certificate and will therefore log the following in the agent.log file:
[SenderThread] [AgentCallbackClient@168] javax.net.ssl.SSLPeerUnverifiedException: The authenticity of host 'vrhs01.spiesr.com' can't be established: javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated javax.net.ssl.SSLPeerUnverifiedException: The authenticity of host 'vrhs01.spiesr.com' can't be established: javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated at org.hyperic.util.security.DefaultSSLProviderImpl$1.verify(DefaultSSLProviderImpl.java:139) at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:390) at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:148) at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:149) at org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121) at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:561) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732) at org.hyperic.util.http.HQHttpClient.post(HQHttpClient.java:81) at org.hyperic.util.http.HQHttpClient.post(HQHttpClient.java:57) at org.hyperic.lather.client.LatherHTTPClient.invoke(LatherHTTPClient.java:111) at org.hyperic.hq.bizapp.client.AgentCallbackClient.invokeLatherCall(AgentCallbackClient.java:162) at org.hyperic.hq.bizapp.client.AgentCallbackClient.invokeLatherCall(AgentCallbackClient.java:146) at org.hyperic.hq.bizapp.client.MeasurementCallbackClient.measurementSendReport(MeasurementCallbackClient.java:62) at org.hyperic.hq.measurement.agent.server.SenderThread.sendBatch(SenderThread.java:457) at org.hyperic.hq.measurement.agent.server.SenderThread.sendData(SenderThread.java:645) at org.hyperic.hq.measurement.agent.server.SenderThread.run(SenderThread.java:630) at java.lang.Thread.run(Thread.java:745) Caused by: javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated at sun.security.ssl.SSLSessionImpl.getPeerCertificates(SSLSessionImpl.java:421) at org.apache.http.conn.ssl.AbstractVerifier.verify(AbstractVerifier.java:128) at org.hyperic.util.security.DefaultSSLProviderImpl$1.verify(DefaultSSLProviderImpl.java:137) ... 19 more
This is expected behaviour. What is not expected is what happens next. The agent keeps retrying indefinitely, going into a loop of connecting -> receiving the new SSL certificate -> rejecting the new SSL certificate. With this constant retrying, the agent uses up to 100% of the available CPU power.
Whilst the agent is in this loop, you can simply edit agent.properties file and change the line from:
## Automatically accept unverified certificates accept.unverified.certificates=false
to read:
## Automatically accept unverified certificates accept.unverified.certificates=true
Once the change is made, save the agent.properties file. Without having to restart the Hyperic agent, you’ll notice that the CPU utilisation has immediately dropped to normal levels and that the platform will show as green in Hyperic after a few minutes.
This is a major issue. If you have a thousand “platforms” in Hyperic all communicating back using this version of the agent configured to not accept unverified certificates (i.e. the default configuration), you’ll probably bring down all of those platforms in a reverse-DDOS style internal attack (if a term like that even exists), simply by replacing that single SSL server certificate on the Hyperic server.
As mentioned before, we have opened a support request with VMware GSS and after having to reproduce the issue and upload DEBUG logs to GSS, they have now acknowledged that it is an issue and that a bug report has been submitted.
Deploying Hyperic?
When preparing the agent.properties file prior to rolling out the Hyperic agent to your estate, the accept.unverified.certificates property should NOT be confused with the agent.setup.acceptUnverifiedCertificate property. The agent.setup.acceptUnverifiedCertificate property is only used for the initial agent configuration, where it will accept the initial SSL certificate presented by the Hyperic server. Once this certificate has been accepted, a change to the Hyperic server certificate will only be accepted if the accept.unverified.certificates property has been set to true.
I really hope those with the default agent configuration who wishes to replace their Hyperic server certificates, find this blog post, or at least test it in a lab first, before attempting it, as it could cause major performance problems on all their servers (platforms) with this agent configuration in place.