Rhino Troubleshooting Guide
TAS-007-Issue 2.5.0-Release 1
September 2018
Rhino Troubleshooting Guide (V2.5.0)
Notices
Copyright © 2017 Metaswitch Networks. All rights reserved.
This manual is issued on a controlled basis to a specific person on the understanding that no part of the Metaswitch Networks product code or
documentation (including this manual) will be copied or distributed without prior agreement in writing from Metaswitch Networks.
Metaswitch Networks reserves the right to, without notice, modify or revise all or part of this document and/or change product features or
specifications and shall not be responsible for any loss, cost, or damage, including consequential damage, caused by reliance on these materials.
Metaswitch and the Metaswitch logo are trademarks of Metaswitch Networks. Other brands and products referenced herein are the trademarks or
registered trademarks of their respective holders.
2
Rhino Troubleshooting Guide (V2.5.0)
Contents
1 Rhino Troubleshooting Guide........................................................................................................................................ 13
2 Common Symptoms........................................................................................................................................................14
2.1 Rhino watchdog.......................................................................................................................................................................................... 14
2.1.1 Stuck Threads watchdog.............................................................................................................................................................. 14
Symptoms................................................................................................................................................................................... 14
2.1.2 Timewarp watchdog......................................................................................................................................................................16
Long Garbage Collection(GC) pause..........................................................................................................................................16
System clock changes................................................................................................................................................................ 17
Virtual Machine pauses...............................................................................................................................................................17
Excessive system IO waits..........................................................................................................................................................17
2.2 Long-running transaction warnings.............................................................................................................................................................17
2.2.1 Symptoms..................................................................................................................................................................................... 18
2.2.2 Diagnostic steps and correction....................................................................................................................................................18
2.3 Lock timeouts..............................................................................................................................................................................................18
2.3.1 Symptoms..................................................................................................................................................................................... 19
2.3.2 Diagnostic steps and correction....................................................................................................................................................19
2.4 I/O exceptions.............................................................................................................................................................................................21
2.4.1 I/O exceptions from Savanna in Rhino logs..................................................................................................................................22
2.4.2 Symptoms..................................................................................................................................................................................... 22
2.4.3 Diagnostic steps and correction....................................................................................................................................................22
2.4.4 I/O exceptions from Rhino’s logging infrastructure in Rhino logs................................................................................................. 22
2.4.5 Symptoms..................................................................................................................................................................................... 22
3
Rhino Troubleshooting Guide (V2.5.0)
2.4.6 Workaround or resolution..............................................................................................................................................................22
2.4.7 I/O exceptions from Rhino’s core in Rhino logs............................................................................................................................ 22
2.4.8 Symptoms..................................................................................................................................................................................... 22
2.4.9 Workaround or resolution..............................................................................................................................................................23
2.4.10 I/O exceptions from resource adaptors or SBBs in Rhino logs...................................................................................................23
2.5 Inactive RA provider................................................................................................................................................................................... 23
2.5.1 Symptoms..................................................................................................................................................................................... 23
2.5.2 Resolution..................................................................................................................................................................................... 23
2.6 Leaked OCIO buffers..................................................................................................................................................................................23
2.6.1 Symptoms..................................................................................................................................................................................... 23
2.6.2 Diagnostic steps............................................................................................................................................................................24
2.6.3 Profile Management and/or provisioning failing............................................................................................................................ 24
Symptoms................................................................................................................................................................................... 24
Diagnostic steps..........................................................................................................................................................................24
Resolution................................................................................................................................................................................... 26
2.7 License over capacity alarms..................................................................................................................................................................... 26
2.8 Rhino logs containing exceptions............................................................................................................................................................... 27
2.8.1 Resource Adaptors throwing Exceptions...................................................................................................................................... 27
3 Environmental..................................................................................................................................................................28
3.1 Operating environment issues.................................................................................................................................................................... 28
3.1.1 Symptoms..................................................................................................................................................................................... 28
3.1.2 Diagnostic steps............................................................................................................................................................................28
3.1.3 Workaround or resolution..............................................................................................................................................................29
3.2 Java Virtual Machine heap issues.............................................................................................................................................................. 30
3.2.1 Symptoms..................................................................................................................................................................................... 30
4
Rhino Troubleshooting Guide (V2.5.0)
3.2.2 Diagnostic steps............................................................................................................................................................................30
3.2.3 Workaround or resolution..............................................................................................................................................................31
3.3 Application or resource adaptor heap issues..............................................................................................................................................32
3.3.1 Symptoms..................................................................................................................................................................................... 32
3.3.2 Diagnostic steps............................................................................................................................................................................32
3.3.3 Workaround or resolution..............................................................................................................................................................34
3.4 Rhino start-up fails with 'java.io.IOException: Not Enough Space'.............................................................................................................34
3.4.1 Symptoms..................................................................................................................................................................................... 34
3.4.2 Diagnostic steps............................................................................................................................................................................35
3.4.3 Workaround or resolution..............................................................................................................................................................35
3.5 Warning about UDP buffer sizes................................................................................................................................................................ 35
3.5.1 Symptoms..................................................................................................................................................................................... 36
3.5.2 Resolution..................................................................................................................................................................................... 36
3.6 Java Virtual Machine error..........................................................................................................................................................................36
3.6.1 Symptoms..................................................................................................................................................................................... 36
3.6.2 Diagnstic steps..............................................................................................................................................................................38
3.6.3 Workaround or resolution..............................................................................................................................................................38
3.7 Multicast traffic is using the wrong network interface................................................................................................................................. 39
3.7.1 Symptoms..................................................................................................................................................................................... 39
3.7.2 Diagnostic steps............................................................................................................................................................................39
3.7.3 Workaround or resolution..............................................................................................................................................................40
4 Clustering......................................................................................................................................................................... 41
4.1 Node failure................................................................................................................................................................................................ 41
4.1.1 Configuration change messages appear in Rhino log output....................................................................................................... 41
Symptoms................................................................................................................................................................................... 41
5
Rhino Troubleshooting Guide (V2.5.0)
Diagnosis and Resolution........................................................................................................................................................... 41
4.1.2 An alarm indicates a node has failed............................................................................................................................................ 41
Symptoms................................................................................................................................................................................... 41
Diagnosis and Resolution........................................................................................................................................................... 42
4.1.3 A Rhino node exits the JVM..........................................................................................................................................................42
Symptoms................................................................................................................................................................................... 42
4.1.4 Out of memory errors....................................................................................................................................................................42
Symptom..................................................................................................................................................................................... 42
Diagnosis and Resolution........................................................................................................................................................... 43
4.1.5 JVM errors.................................................................................................................................................................................... 43
4.1.6 Watchdog timeouts....................................................................................................................................................................... 43
Symptoms................................................................................................................................................................................... 43
Diagnosis and Resolution........................................................................................................................................................... 44
4.2 Cluster segmentation..................................................................................................................................................................................44
4.2.1 Symptoms..................................................................................................................................................................................... 44
4.2.2 Diagnostic steps............................................................................................................................................................................44
4.2.3 Workaround or resolution..............................................................................................................................................................44
4.3 Cluster failing to start.................................................................................................................................................................................. 45
4.3.1 No Primary Component................................................................................................................................................................ 45
Symptoms................................................................................................................................................................................... 45
Resolution................................................................................................................................................................................... 45
4.3.2 Multicast Configuration Error........................................................................................................................................................ 46
Symptoms................................................................................................................................................................................... 46
4.4 Cluster starts but stops after a few minutes................................................................................................................................................47
4.4.1 Symptoms..................................................................................................................................................................................... 47
6
Rhino Troubleshooting Guide (V2.5.0)
4.4.2 Diagnostic steps............................................................................................................................................................................48
4.4.3 Workaround or resolution..............................................................................................................................................................49
4.5 Rhino SLEE fails to start cluster groups..................................................................................................................................................... 50
4.5.1 Symptoms..................................................................................................................................................................................... 50
4.5.2 Diagnostic steps............................................................................................................................................................................50
4.5.3 Workaround or resolution..............................................................................................................................................................51
4.6 Group heartbeat timeout.............................................................................................................................................................................51
4.6.1 Symptoms..................................................................................................................................................................................... 51
4.7 Scattercast endpoints out of sync...............................................................................................................................................................52
4.7.1 Symptoms..................................................................................................................................................................................... 52
5 Performance.....................................................................................................................................................................56
5.1 High Latency...............................................................................................................................................................................................56
5.1.1 Symptoms..................................................................................................................................................................................... 56
5.1.2 Diagnostic steps............................................................................................................................................................................56
Low number of available threads in Rhino statistics output........................................................................................................ 57
High staging queue size in Rhino statistics output......................................................................................................................57
Dropped staging items................................................................................................................................................................ 57
High event processing time.........................................................................................................................................................57
5.1.3 Workaround or Resolution............................................................................................................................................................ 58
5.2 Dropped Calls............................................................................................................................................................................................. 58
5.2.1 Symptoms..................................................................................................................................................................................... 58
5.2.2 Diagnostic steps............................................................................................................................................................................59
5.2.3 Rhino logs containing exceptions................................................................................................................................................. 59
Resource Adaptors throwing Exceptions.................................................................................................................................... 59
5.2.4 Lock timeout messages in Rhino logs and/or console..................................................................................................................59
7
Rhino Troubleshooting Guide (V2.5.0)
5.2.5 Rate limiting.................................................................................................................................................................................. 60
5.2.6 A dependant external system is not functioning properly............................................................................................................. 60
5.3 A newly started node is unable to handle full traffic load............................................................................................................................61
5.3.1 Symptoms..................................................................................................................................................................................... 61
5.3.2 Workaround or Resolution............................................................................................................................................................ 61
5.4 Uneven CPU load/memory usage across cluster nodes............................................................................................................................ 62
5.4.1 Symptoms..................................................................................................................................................................................... 62
5.4.2 Diagnostic steps and correction....................................................................................................................................................62
6 Configuration Problems..................................................................................................................................................63
6.1 Security Related Exceptions.......................................................................................................................................................................63
6.1.1 Various connection related exceptions......................................................................................................................................... 63
Symptoms................................................................................................................................................................................... 63
Resolution................................................................................................................................................................................... 65
6.1.2 Various permission related exceptions......................................................................................................................................... 65
6.1.3 Symptoms..................................................................................................................................................................................... 65
6.1.4 Diagnosis and Resolution............................................................................................................................................................. 65
6.2 Memory Database Full................................................................................................................................................................................66
6.2.1 Profile Management and/or provisioning failing............................................................................................................................ 66
Symptoms................................................................................................................................................................................... 66
Resolution and monitoring.......................................................................................................................................................... 68
6.2.2 Deployment failing........................................................................................................................................................................ 69
Symptoms................................................................................................................................................................................... 69
Resolution and monitoring.......................................................................................................................................................... 71
6.2.3 Calls not being setup successfully................................................................................................................................................ 71
Resolution and monitoring.......................................................................................................................................................... 72
8
Rhino Troubleshooting Guide (V2.5.0)
6.2.4 Resizing MemDB Instances..........................................................................................................................................................73
6.3 Resource Adaptors refuse to connect using TCP/IP.................................................................................................................................. 74
6.3.1 Diagnostic steps............................................................................................................................................................................74
6.3.2 Workaround or Resolution............................................................................................................................................................ 75
6.4 Local hostname not resolved properly........................................................................................................................................................75
6.4.1 Symptoms..................................................................................................................................................................................... 75
6.4.2 Diagnostic steps............................................................................................................................................................................75
6.4.3 Workaround or Resolution............................................................................................................................................................ 76
7 Management.....................................................................................................................................................................77
7.1 Connections Refused for the Command Console, Deployment Script or Rhino Element Manager........................................................... 77
7.1.1 Symptoms..................................................................................................................................................................................... 77
7.1.2 Diagnostic steps and correction....................................................................................................................................................78
Rhino is not listening for management connections....................................................................................................................78
Rhino refuses connections..........................................................................................................................................................78
Management client is not configured to connect to the Rhino host.............................................................................................79
7.2 A Management Client Hangs......................................................................................................................................................................79
7.2.1 Symptoms..................................................................................................................................................................................... 79
7.2.2 Workaround or Resolution............................................................................................................................................................ 80
7.3 Statistics client reports “Full thread sample containers”............................................................................................................................. 80
7.3.1 Symptoms..................................................................................................................................................................................... 80
7.4 Statistics Client Out of Memory.................................................................................................................................................................. 80
7.4.1 Symptoms..................................................................................................................................................................................... 80
7.4.2 Workaround or Resolution............................................................................................................................................................ 81
7.5 Creating a SyslogAppender gives an AccessControlException................................................................................................................. 81
7.5.1 Symptoms..................................................................................................................................................................................... 81
9
Rhino Troubleshooting Guide (V2.5.0)
7.5.2 Workaround or Resolution............................................................................................................................................................ 82
7.6 Platform Alarms.......................................................................................................................................................................................... 82
7.6.1 Symptoms..................................................................................................................................................................................... 82
7.6.2 Diagnostic steps............................................................................................................................................................................82
7.6.3 Workaround or Resolution............................................................................................................................................................ 83
7.7 DeploymentException when trying to deploy a component........................................................................................................................ 84
7.7.1 Symptoms..................................................................................................................................................................................... 84
7.7.2 Diagnostic steps............................................................................................................................................................................84
7.7.3 Workaround or Resolution............................................................................................................................................................ 84
7.8 Deploying to multiple nodes in parallel fails................................................................................................................................................85
7.8.1 Symptoms..................................................................................................................................................................................... 85
7.8.2 Diagnostic steps............................................................................................................................................................................85
7.8.3 Workaround or Resolution............................................................................................................................................................ 85
7.9 Management of multiple Rhino instances...................................................................................................................................................86
7.9.1 Symptoms..................................................................................................................................................................................... 86
7.9.2 Workaround or Resolution............................................................................................................................................................ 86
7.10 Deployment problem on exceeding DB size.............................................................................................................................................86
7.10.1 Symptoms................................................................................................................................................................................... 86
7.11 Diagnostic steps....................................................................................................................................................................................... 86
7.12 BUILD FAILED when installing an OpenCloud product............................................................................................................................87
7.12.1 Symptoms................................................................................................................................................................................... 87
7.12.2 Diagnostic steps..........................................................................................................................................................................87
7.12.3 Workaround or Resolution.......................................................................................................................................................... 87
7.13 REM connection failure during management operations..........................................................................................................................88
7.13.1 Symptoms................................................................................................................................................................................... 88
10
Rhino Troubleshooting Guide (V2.5.0)
7.13.2 Diagnostic steps..........................................................................................................................................................................88
7.13.3 Workaround or Resolution.......................................................................................................................................................... 88
7.14 Export error: Multiple Profile Snapshot for profiles residing in seperate memdb instances is unsupported............................................. 88
7.14.1 Symptoms................................................................................................................................................................................... 88
7.14.2 Workaround or Solution.............................................................................................................................................................. 89
7.15 Unused log keys configured in Rhino....................................................................................................................................................... 89
7.15.1 Symptoms................................................................................................................................................................................... 89
7.15.2 Workaround or Resolution.......................................................................................................................................................... 89
7.16 Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT........................................................................................ 89
7.16.1 Symptoms................................................................................................................................................................................... 89
7.16.2 Diagnostic steps..........................................................................................................................................................................90
7.16.3 Workaround or resolution............................................................................................................................................................90
7.17 Log level for trace appender not logging.................................................................................................................................................. 90
7.17.1 Symptoms................................................................................................................................................................................... 90
7.17.2 Workaround or Resolution.......................................................................................................................................................... 90
7.18 Access to REM fails with Command CHECK_CONNECTION invoked without connection ID................................................................ 91
7.18.1 Symptoms................................................................................................................................................................................... 91
7.18.2 Workaround or Resolution.......................................................................................................................................................... 91
8 Database...........................................................................................................................................................................92
8.1 Management Database Server Failure.......................................................................................................................................................92
8.1.1 Symptoms..................................................................................................................................................................................... 92
8.1.2 Resolution and Mitigation..............................................................................................................................................................93
9 Signalware........................................................................................................................................................................94
9.1 CGIN RA to Signalware Backend Connection Errors................................................................................................................................. 94
9.1.1 Symptoms..................................................................................................................................................................................... 94
11
Rhino Troubleshooting Guide (V2.5.0)
9.1.2 Diagnostic steps............................................................................................................................................................................94
9.1.3 Workaround or Resolution............................................................................................................................................................ 95
9.2 CGIN RA Cannot Create Outgoing Dialogs................................................................................................................................................95
9.2.1 Symptoms..................................................................................................................................................................................... 95
9.2.2 Diagnostic steps............................................................................................................................................................................96
9.2.3 Workaround or Resolution............................................................................................................................................................ 96
9.3 CGIN RA Cannot Receive Incoming Dialogs..............................................................................................................................................96
9.3.1 Symptoms..................................................................................................................................................................................... 96
9.3.2 Diagnostic steps............................................................................................................................................................................96
9.3.3 Workaround or Resolution............................................................................................................................................................ 97
9.4 Problems with Signalware not involving the CGIN backends..................................................................................................................... 97
10 Exit Codes...................................................................................................................................................................... 98
10.1 Rhino Exit Codes...................................................................................................................................................................................... 98
10.2 JVM Exit Codes........................................................................................................................................................................................ 98
10.2.1 Internal JVM Exit Codes............................................................................................................................................................. 99
10.2.2 Signal Handler Exit Codes.......................................................................................................................................................... 99
10.3 Other exit codes......................................................................................................................................................................................101
12
Rhino Troubleshooting Guide (V2.5.0)
1 Rhino Troubleshooting Guide
This document contains troubleshooting help for the Rhino TAS and other OpenCloud products.
Section Troubleshooting for …
Environmental environmental issues with Rhino.
Clustering clustering issues with Rhino.
Performance Rhino performance issues.
Configuration Problems Rhino configuration.
Management Rhino management tools and utilities.
Common Symptoms common general issues with Rhino.
Database the Rhino management database persistent store.
Signalware using Signalware with Rhino’s CGIN Resource Adaptor.
Exit Codes Rhino exit codes.
13
Rhino Troubleshooting Guide (V2.5.0)
2 Common Symptoms
Below are troubleshooting steps — symptoms, diagnostic steps, and workarounds or resolutions — for common general issues with Rhino.
2.1 Rhino watchdog
The watchdog is a thread in Rhino that monitors the system clock and threads in the SLEE for strange behaviour. There are two types of threads
monitored:
• DeliveryThreads — internal threads inside the SLEE
• StageWorkers — threads responsible for executing deployed service logic.
The watchdog thread watches for the following strange behaviour:
• threads becoming “stuck” on page 14
• threads dying on page 14
• the system clock suddenly jumping forwards from the point of view of the JVM on page 16
• the system clock suddenly jumping backwards from the point of view of the JVM on page 16
2.1.1 Stuck Threads watchdog
The stuck threads watchdog behaviour handles stuck or dying threads. Under some circumstances as discussed below, the node will leave the
cluster and exit. If configured for restarting, the node will restart, rejoin the cluster and continue operating.
Symptoms
If a thread becomes stuck, the following appears in the logs:
2016-05-10 12:34:54.487 ERROR [watchdog.threadactive] <Global watchdog thread> Thread has become stuck: StageWorker/0
If a thread dies, the following appears in the logs:
14
Rhino Troubleshooting Guide (V2.5.0)
2016-05-10 12:34:54.487 ERROR [watchdog.threadactive] <Global watchdog thread> Thread has died: StageWorker/0
If the watchdog detects that a DeliveryThread has become stuck or has died, that node will exit with the following message and a stack trace:
2016-05-10 12:34:56.509 ERROR [watchdog] <Global watchdog thread> *** WATCHDOG TIMEOUT ***2016-05-10 12:34:56.509 ERROR [watchdog] <Global watchdog thread> Failed watchdog condition: ThreadActive:DeliveryThread/0
The watchdog monitors the number of alive StageWorker threads to ensure that more than 50% of them are alive. If fewer that 50% of the
StageWorker threads have remained alive without terminating or becoming stuck, then the watchdog will cause the node to exit with an error
message similar to the above. This proportion of surviving threads is configurable; for information about configuring the watchdog, see the Rhino
Production Getting Started Guide .
Conditions under which a StageWorker becomes stuck or dies are usually indicative that a deployed service is experiencing unexpected
problems or has a faulty implementation. In this scenario, the best approach is to examine the reasons for the service causing threads to block or
exit. Common culprits include excessive blocking on external resources such as JDBC connections, infinite loops or poor handling of unexpected
circumstances.
The watchdog timeout values for StageWorker threads can be increased, However this is not usually recommended as a blocked StageWorker
indicates a session with very long processing time. This may be a solution when a service consistently needs to perform a blocking action that
is expected to take a long time. It is also possible to increase the number of StageWorker threads; this will improve reactiveness when a large
proportion of StageWorker threads are blocking on an external service or long computation. Increasing the StageWorker thread count is
strongly recommended when increasing the timeout.
Conditions under which a DeliveryThread becomes stuck or dies usually indicate that there is a fault in the hardware, operating system,
libraries, or in the Rhino SLEE. If this is the case, first determine if hardware is an issue. It would be a good idea to refer to any available
documentation about diagnoses of hardware problems for the platform in question. Look at the output of the dmesg command on Unix operating
systems, and whether the machine has had a history of restarts, kernel panics, process segmentation faults, and so forth.
In the case of a DeliveryThread becoming stuck or dying, investigate and resolve any possible hardware issues. Contact your solution provider
with the stack trace information from the Rhino logs and circumstances under which the watchdog timeout occurred for information about what
further action could be taken to resolve this issue.
15
Rhino Troubleshooting Guide (V2.5.0)
2.1.2 Timewarp watchdog
The timewarp watchdog behaviour handles the system clock suddenly changing from the JVM perspective. A warning is printed in the log if the
system clock suddenly changes. A Critical alarm is raised whenever a timewarp is detected, and will not be automatically cleared. This alarm is
considered critical as all timewarps indicate severe problems in the underlying system that threaten cluster stability and ability to process sessions
in a timely manner.
A timewarp of 5s or greater is sufficient to break the clustering mechanism used in Rhino, and may cause the node to leave the cluster. This is
done to prevent a split-brain situation, or data corruption. A single node cluster will not self shutdown due to timewarps, but this should still be
considered a critical issue as the node is unavailable during the timewarps.
Rhino may fail if system time ever goes backwards on a single node while the cluster is running, or if it goes backwards to overlap with a time
period when the cluster was previously running. This is because the unique IDs being generated may no longer be unique.
If the watchdog observes a sudden change in the system clock, the following will be printed on the console:
2016-08-29 17:00:37.879 ERROR [watchdog] 2016-08-29 17:00:37.879 ERROR [watchdog] 2016-08-29 17:00:37.879 ERROR [watchdog] 2016-08-29 17:00:37.879ERROR [watchdog]<Global watchdog thread> Forward timewarp detected! interval=8021 but should be at most 2000<Global watchdog thread> old timestamp=1188363629857, new timestamp=1188363637878<Global watchdog thread> Check for CPU overloading or stop-the-world GC.<Global watchdog thread> Check for external processes that could step the system clock forwards ('date -s', NTP in step mode, etc)
Several causes of timewarps exist.
Long Garbage Collection(GC) pause
The most likely cause of forward timewarps is GC. GC cannot cause a reverse timewarp. To check GC pause times, look for entries in the console
log reporting real times greater than 1s for ParNew , CMS-initial-mark , and CMS-remark . Also look for entries reporting Full GC or
concurrent-mode-failure . concurrent-mode-failure indicates a severe problem.
2016-04-14 14:07:22.387 2016-04-14T14:07:19.279+0200: [GC2016-04-14T14:07:19.279+0200: [ParNew (promotion failed): 127844K->127844K(130112K),3.1073460 secs] 1766087K->1792457K(3144768K), 3.1078370 secs] [Times: user=4.05 sys=0.00, real=3.11 secs]
16
Rhino Troubleshooting Guide (V2.5.0)
2016-04-14 14:07:23.865 2016-04-14T14:07:22.387+0200: [Full GC2016-04-14T14:07:22.387+0200: [CMS: 1664612K->254343K(3014656K), 1.4772760 secs]1792457K->254343K(3144768K), [CMS Perm : 77532K->77327K(196608K)], 1.4775730 secs] [Times: user=1.50 sys=0.00, real=1.47 secs] 2015-09-24 22:09:13.668 (concurrent mode failure): 22907K->8498K(32768K), 0.0504460 secs] 49900K->8498K(65344K), [CMS Perm : 15238K->15237K(196608K)], 0.0796580 secs] [Times: user=0.20 sys=0.00, real=0.08 secs]
Resolving long GC pauses almost always requires modifying the heap size/new size to eliminate the long pause. This must be done in tandem
with performance testing to verify the effects of changes.
System clock changes
This is most commonly caused by a misconfigured time synchronization daemon (usually NTP). Rhino requires that NTP be run in slew mode, as
step mode may make steps greater than the timewarp limit when enabled. NTP steps may be both forwards, and reverse.
Virtual Machine pauses
Virtual machines may pause processing for many reasons, and may produce both foward and reverse timewarps depending on hypervisor
configuration. In order to minimize VM pauses, we strongly recommend avoiding overcommiting resources.
Excessive system IO waits
Excessive system IO waits may only cause foward timewarps. These occur when the host fails to handle disk IO in a timely manner, blocking the
JVM. This may be diagnosed through OS level logging and IO monitoring tools. Other causes of forward timewarps should be investigated first.
Note: VM’s are particularly prone to this form of timewarp.
2.2 Long-running transaction warnings
Long-running transaction warning messages appear in Rhino logs if a transaction has been running for too long. Long-running transactions can be
caused when:
• a profile has been opened for editing but has not been committed or restored
• a large deployable unit is being installed, done in a single transaction, and taking a long time to compile
• a single node is restarting
17
Rhino Troubleshooting Guide (V2.5.0)
Nodes load most of their state from the database in a single transaction, which can take some time.
• a service makes blocking calls to external resources that are very slow to respond
• the cluster is running under excessive load.
2.2.1 Symptoms
Warning messages like the following in the Rhino logs or console indicate long-running transactions.
...WARN [transaction.manager] There are 1 long-running transactions currently active:WARN [transaction.manager] 43190ms: [OCTransactionImpl:TransactionId:[101:16]:ACTIVE]...
2.2.2 Diagnostic steps and correction
A long-running transaction may be reported during installation of components during deployment or when a node is starting up. In this case, the
error message is benign; and the warning will be cleared when the components have finished installing.
If this message appears only during start-up or deployment of components, then it can be safely ignored.
If such messages persist or occur during normal operation, then they should be reported to your solution provider. Please gather the output of the
dumpthreads.sh command from the node directory. This output is written to the Rhino console and is not gathered as part of Rhino’s logging
system. It can usually be found in the file console.log. Please provide this output to your solution provider support.
2.3 Lock timeouts
Lock timeouts occur when a node cannot obtain a cluster-wide lock within a certain timeout. Lock timeouts are most likely to occur when nodes
leave the cluster, when nodes are experiencing overload conditions, or when network problems are being experienced. Some possible causes
are:
1. Overloaded nodes
2. Network congestion
18
Rhino Troubleshooting Guide (V2.5.0)
3. Network failure
4. Nodes which are about to leave the cluster due to some failure (e.g. garbage collection pauses)
5. Too much contention for the lock in question,
that users of the lock are queued sufficiently long enough for their requests to time out.
6. Too much load in general on the system,
such that the lock request can’t be processed by the in-memory database before it times out.
7. Deadlock,
when the locking resources are accessed in a different order.
8. A transaction has somehow been "lost" and left open holding a lock
this could be a product bug.
9. A thread has got stuck, holding its transaction open and therefore holding its lock,
this has various possible causes including: service bugs, JVM bugs and/or product bugs.
2.3.1 Symptoms
Below are examples of warning messages in Rhino logs or console that indicate lock timeouts.
...
... Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT ... ...
...
...
... ========= Lock timeout ========= ...
...
... Unable to acquire lock on [lockId] within 5000 ms ...
...
2.3.2 Diagnostic steps and correction
The diagnostic process determines whether a failure has occurred due to environmental issues or hardware, network, or software related issues.
A hardware, network, or software related issue is usually indicated by the presence of an alarm. In order to check for alarms, use the following
rhino-console command.
19
Rhino Troubleshooting Guide (V2.5.0)
./client/bin/rhino-console listactivealarms
This command produces one line of output for each active alarm. If the list of active alarms contains a line similar to the following, it indicates that
a node has left the cluster due to hardware, network, or software failure.
...Alarm ...(Node 101, 23-Nov-05 15:10:36.636): Major [rhino.cluster] Node 102 has left the cluster...
If an alarm indicates node failure, follow the diagnostic and resolution procedures for Node Failure on page 41 .
If the output of the command does not contain a line similar to above, or indicates no alarms are active (as below), then the symptoms are
occurring because of environmental issues or software faults.
No active alarms
Overload conditions are likely to cause lock timeouts. CPU usage should never exceed 70% for an extended period of time. CPU usage can be
measured on Linux using the top command.
Garbage collection, particularly under overload conditions can cause lock timeouts. The behaviour of the garbage collector can be analysed by
examining the garbage collection logs. By default these are logged to the Rhino console or the console.log file in
node-???/work/log For further information on diagnosing GC related problems, please refer to Java Virtual Machine Heap Issues on page
30 and Application or Resource Adaptor Heap Issues on page 32 .
SLEE load can also be investigated with the use of the Statistics Client. For more information about usage of the Statistics Client, please see the
rhino-stats section in the Rhino Administration and Deployment Guide. Of particular value are the time statistics for the lock manager, MemDB and
Savanna subsystems. If any of these report times in the 10s of milliseconds or greater this is usually a symptom of overload.
To resolve problems caused by overload it is usually necessary to add more cluster nodes on new hardware, correct the configuration of the
software or improve the efficiency of the service logic. Follow the diagnostic steps in the following sections Operating Environment Issues on page
28 , Java Virtual Machine Heap Issues on page 30 , and Application or Resource Adaptor Heap Issues on page 32 . If this does not lead
to a resolution contact your solution provider for assistance.
20
Rhino Troubleshooting Guide (V2.5.0)
If a lock timeout is caused by overloading the cluster, then an interim resolution is to enable a user-specified rate limiter to limit the number of
events processed by the cluster. Instructions for doing so are in the Rhino Administration and Deployment Guide .
Threshold rules should be set up to alert the SLEE administrator to adverse conditions within the SLEE.
Deadlock, lost transactions and stuck threads are typically caused by software bugs. These can be in Rhino, service logic, or RAs, and will often
be unrelated to call rate. To assist in diagnosing these problems observe the statistics parameter sets StagingThreads and Events. Diagnosing
the cause of these will usually require setting the log keys transaction.manager and transaction.instance to the debug level. The logs
obtained should be sent to your solution provider.
This writes a lot of log data and can impact service performance.
2.4 I/O exceptions
There are several causes for I/O exceptions to occur in Rhino logs. Below are the diagnostic steps and workarounds or resolution for I/O
exceptions from:
• Savanna on page 22
• Rhino’s logging infrastructure on page 22
• Rhino’s core on page 22
• resource adaptors or SBBs on page 23
For all diagnostic steps and workarounds or resolution first check:
1. There is sufficient disk space on the running system
2. Executing processes have write permissions to files in their subdirectories
3. The ulimit for the number of open file descriptors is not too low, depending on the protocols in use Rhino may require a higher open
files ulimit than the OS default.
21
Rhino Troubleshooting Guide (V2.5.0)
2.4.1 I/O exceptions from Savanna in Rhino logs
2.4.2 Symptoms
The log key for these I/O exceptions contain any of the following phrases:
• savanna
• framework
• ocio
2.4.3 Diagnostic steps and correction
These messages typically occur if the network is unable to accept multicast traffic from Rhino. This can be caused by either inappropriate kernel
IP routing configuration or problems with network elements such as routers, switches, ethernet cards, firewalls etc. Start diagnosis from Operating
Environment Issues on page 28 . If this does not produce a solution continue with b Clustering on page . Finally contact your solution
provider for support.
2.4.4 I/O exceptions from Rhino’s logging infrastructure in Rhino logs
2.4.5 Symptoms
The log key for these I/O exceptions contains the following phrases:
• rhino.logging
• log4j.*
2.4.6 Workaround or resolution
These messages typically occur if the Rhino installation does not have appropriate disk space and/or permissions available or there is a hardware
error.
2.4.7 I/O exceptions from Rhino’s core in Rhino logs
2.4.8 Symptoms
The log key for these I/O exceptions contains the following phrases:
22
Rhino Troubleshooting Guide (V2.5.0)
• rhino.*
• memdb
• transaction.*
2.4.9 Workaround or resolution
These messages should be provided to the solution provider unless they clearly indicate the cause.
2.4.10 I/O exceptions from resource adaptors or SBBs in Rhino logs
If the log messages do not occur on any of the log keys mentioned in the cases, then they are likely to come from a resource adaptor or SBB.
These messages should be provided to the solution provider and possibly the resource adaptor vendor.
2.5 Inactive RA provider
2.5.1 Symptoms
A service fails with an error that looks like:
IllegalStateExceptionjava.lang.IllegalStateException: sendQuery called on inactive RA provider
2.5.2 Resolution
The above exception indicates that you need to activate the RA entity. Documentation on managing RA entities can be found in the Rhino
Administration and Deployment Guide .
2.6 Leaked OCIO buffers
2.6.1 Symptoms
Rhino logs contain the message:
23
Rhino Troubleshooting Guide (V2.5.0)
2015-04-15 22:54:30.794 INFO [savanna.ocio.leaks] <OCIO Cleanup Thread> Garbage-collected 12201 leaked OCIO buffers2015-04-15 22:54:30.794 INFO [savanna.ocio.leaks] <OCIO Cleanup Thread> Garbage-collected 12301 leaked OCIO buffers
2.6.2 Diagnostic steps
Compare the times these warnings are logged with garbage collection cycles. Cleanup of leaked OCIO buffers happens during CMS GC cycles
and all G1 collections, and it writes one log entry per 100 buffers cleaned up. The log entry reports the accumulated total of all OCIO buffers
cleaned up since the node started. If this message is occurring on every GC cycle, that may indicate an environment-related performance
problem.
Continue diagnosis of the performance problem, starting at Operating Environment Issues on page 28 .
If there are no signs of overload, then the warning is harmless and indicates that a section of code in the Rhino clustering implementation has
cleaned up internal objects that would have otherwise leaked.
2.6.3 Profile Management and/or provisioning failing
Symptoms
Creating or importing profiles fails.
Diagnostic steps
If profile management and/or provisioning commands are unsuccessful this is possibly due to the size restriction of the ProfileDatabase installed in
Rhino. The following command can be used to monitor the size of the ProfileDatabase installed in Rhino.
./client/bin/rhino-stats -m MemDB-Replicated.ProfileDatabase
Check the Rhino log for messages with a log key profile.* similar to the following:
2016-12-01 13:41:57.081 WARN [profile.mbean] <RMI TCP Connection(2)-192.168.0.204> [foo:8] Error committing profile:javax.slee.management.ManagementException: Cannot commit transaction at com.opencloud.rhino.management.TxSupport.commitTx(TxSupport.java:36) at com.opencloud.rhino.management.SleeSupport.commitTx(SleeSupport.java:28)
24
Rhino Troubleshooting Guide (V2.5.0)
at com.opencloud.rhino.impl.profile.GenericProfile.commitProfile(GenericProfile.java:127) at sun.reflect.GeneratedMethodAccessor97.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.opencloud.rhino.management.DynamicMBeanSupport.doInvoke(DynamicMBeanSupport.java:157) at com.opencloud.rhino.management.DynamicMBeanSupport$1.run(DynamicMBeanSupport.java:121) at java.security.AccessController.doPrivileged(Native Method) at com.opencloud.rhino.management.DynamicMBeanSupport.invoke(DynamicMBeanSupport.java:113) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801) at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.opencloud.rhino.management.BaseMBeanInterceptor.intercept(BaseMBeanInterceptor.java:12) at com.opencloud.rhino.management.AuditingMBeanInterceptor.intercept(AuditingMBeanInterceptor.java:245) at com.opencloud.rhino.management.ObjectNameNamespaceQualifierMBeanInterceptor.intercept(ObjectNameNamespaceQualifierMBeanInterceptor.java:74) at com.opencloud.rhino.management.NamespaceAssociatorMBeanInterceptor.intercept(NamespaceAssociatorMBeanInterceptor.java:66) at com.opencloud.rhino.management.RhinoPermissionCheckInterceptor.intercept(RhinoPermissionCheckInterceptor.java:56) at com.opencloud.rhino.management.CompatibilityMBeanInterceptor.intercept(CompatibilityMBeanInterceptor.java:130) at com.opencloud.rhino.management.StartupManagementMBeanInterceptor.intercept(StartupManagementMBeanInterceptor.java:97) at com.opencloud.rhino.management.RemoteSafeExceptionMBeanInterceptor.intercept(RemoteSafeExceptionMBeanInterceptor.java:26) at com.opencloud.rhino.management.SleeMBeanServerBuilder$MBeanServerInvocationHandler.invoke(SleeMBeanServerBuilder.java:38) at com.sun.proxy.$Proxy9.invoke(Unknown Source) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487) at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328) at java.security.AccessController.doPrivileged(Native Method) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1427)
25
Rhino Troubleshooting Guide (V2.5.0)
at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848) at sun.reflect.GeneratedMethodAccessor80.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322) at sun.rmi.transport.Transport$2.run(Transport.java:202) at sun.rmi.transport.Transport$2.run(Transport.java:199) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:198) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:567) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:828) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.access$400(TCPTransport.java:619) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:684) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:681) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:681) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)Caused by: com.opencloud.transaction.TransientRollbackException: Unable to prepare due to size limits of the DB at com.opencloud.transaction.local.OCTransactionImpl.commit(OCTransactionImpl.java:107) at com.opencloud.transaction.local.OCTransactionManagerImpl.commit(OCTransactionManagerImpl.java:190) at com.opencloud.rhino.management.TxSupport.commitTx(TxSupport.java:33) ... 48 more
Resolution
Increase the size of the profile database as described in Memory Database Full on page 66
2.7 License over capacity alarms
If this occurs please contact your solution provider to purchase additional capacity licenses.
26
Rhino Troubleshooting Guide (V2.5.0)
2.8 Rhino logs containing exceptions
If the exception is of the type AccessControlException , please refer to section Security Related Exceptions on page 63 to solve the issue.
If the exceptions contain the text "Event router transaction failed to commit" it is likely that the in-memory database is not sized correctly for the
load. Follow the diagnostic and resolution steps in Memory Database Full on page 71 .
For other exceptions please contact your solution provider for support.
2.8.1 Resource Adaptors throwing Exceptions
If the log key for the exception log matches the tracer for a resource adaptor entity e.g. trace.sipra the cause of the problem is likely to be a
faulty resource adaptor. This is indicative of either a misconfiguration of the resource adaptors in question or a bug in their implementation. If this
is occurring frequently then please contact your solution provider support. Contact your solution provider for support attaching the log files that
contain the exception message.
27
Rhino Troubleshooting Guide (V2.5.0)
3 Environmental
Below are troubleshooting steps -- symptoms, diagnostic steps, and workarounds or resolutions -- for environmental issues with Rhino.
3.1 Operating environment issues
There are several common cases where Rhino’s operating environment is not appropriately configured. These are typically related to resource
competition or poor sizing for a given load. The most common causes are resource competition for CPU or heavy disk I/O.
3.1.1 Symptoms• Slow event response times
• Low event throughput rate
• Dropped calls
• Periodic failures
3.1.2 Diagnostic steps
Operating system utilities such as vmstat , mpstat , and iostat are particularly helpful in determining whether or not a problem is caused by
resource contention. Run such tools for enough time to observe the problem, and then review the logged output immediately before and after. Also
look for CPU, I/O, swap, and network use. You may also want to record these continuously, for analysis of long-term system performance and
transient faults.
If possible, collect system statistics no less frequently than 10s intervals, and cover every major subsystem. The UNIX standard SAR monitor also
collects a useful data; and other tools may offer higher resolution or better analysis support.
Some of the specific things to look for are listed below:
Please ensure that the system is sized such that no more than around 75% of CPU is used by Rhino under peak load with one failed node. If the
machine is above this utilisation of CPU, then the machine has been incorrectly sized. If running a virtualised environment, this should be scaled to
allow for failure of one physical host.
28
Rhino Troubleshooting Guide (V2.5.0)
If other processes running on the same hardware compete for CPU for long enough, the responsiveness of a Rhino node may suffer. In our
experience, operating system scheduling frequently does not provide Rhino with enough resources if there is competition for those resources.
Rhino should be installed on dedicated hardware with no competing applications. Processes that can be especially demanding of CPU and I/O
include databases and other Java servers.
Cron jobs can cause other system processes to run which may compete for system resources. Cron jobs which are not essential should be
disabled. Particular note should be taken of the locate service which will rebuild its database as a daily or weekly task, an I/O intensive
procedure.
CPU use of the machine should be monitored to ensure that there is more than about 25% of CPU available at all times, including with one node
failed. Ideally data should be collected no less frequently than every 10s, preferably with 1s intervals.
High levels of disk I/O can result in increased CPU use, and can cause the Rhino JVM process to be scheduled infrequently. This can also cause
writes to be scheduled synchronously, which may block threads during critical tasks such as GC. This is particularly likely on highly contended
disks such as found in virtualised environments and SAN storage arrays.
Swap use is often the cause of disk I/O. Rhino should be installed on a machine with sufficient physical RAM for the workload of the machine.
Swap will also significantly impair GC performance. If even a small amount of the JVM heap is swapped out, GC pause times can increase by up
to 300% and potentially much longer for greater swap usage. While the CMS collector will usually touch pages frequently enough to keep them in
physical memory, the G1 collector can leave regions untouched for long enough to be swapped out preemptively on Linux.
Rhino’s cluster membership services use multicast UDP. If the IP network is loaded heavily, this can cause packets to be dropped, potentially
causing cluster segmentation or failure if individual cluster nodes cannot reliably communicate.
See Cluster Segmentation on page 44 , Cluster Failing to Start on page 45 and Cluster Starts but Stops After a few Minutes on page 47
for more details of multicast related problems.
Large systems (that is, four or more sockets per host have significant asymmetry in memory access speed and latency between CPU sockets.
This can cause performance to be significantly less than optimum.
3.1.3 Workaround or resolution
If any of the machines Rhino is running on are running over 75% CPU use, and this is not due to other system processes, then consider:
• configuring a use rate limiter as a workaround
29
Rhino Troubleshooting Guide (V2.5.0)
• increasing the number of available threads
• installing more CPUs into a node
• installing more nodes into a cluster
• using faster CPUs.
Processes causing a problem due to resource usage should not be used on a machine running Rhino. Move these to another host.
To prevent preemptive swapping slowing GC, reduce the value of the sysctl vm.swappiness from the default of 60 to something low enough
to prevent this. 30 is a good starting point. If using G1 the vm.swappiness sysctl should be set no higher than 30 and preferably 10 or lower.
When running Rhino on NUMA hosts it is usually best to run multiple nodes on each host with each JVM bound to a separate NUMA domain.
Configure this with numactl --cpunodebind=X --membind=X on Linux or Locality Groups on Solaris.
3.2 Java Virtual Machine heap issues
Poor performance or failure of a Rhino node may result from a JVM configuration which has too little heap to suit the applications deployed and
incoming traffic processed by the application.
This section describes how to diagnose incorrect heap sizing. See Application or resource adaptor heap issues on page 32 to diagnose
application or resource adaptor memory leaks.
3.2.1 Symptoms• Large garbage collection pauses in garbage collection logs
• Longer than expected message processing and/or call setup times
• Failed or timed-out call setup or message processing attempts
• Out of memory error appearing in Rhino logs
• Watchdog errors "Over memory use limit"
3.2.2 Diagnostic steps
There are two common causes of heap issues when running Rhino:
• heap sizing may be incorrect
30
Rhino Troubleshooting Guide (V2.5.0)
• something may be leaking memory.
For the latter, see Application or resource adaptor heap issues on page 32 .
The heap usage of Rhino should be monitored. This includes both the Java heap and the native heap. The Java heap can be measured by
Rhino’s statistics tool rhino-stats , or one of the Java monitoring tools — jstat or jvisualvm . The native heap size can be measured using
system utilities such as ps or top .
Here’s an example of using the Rhino statistics tool to monitor the size of the Java heap:
./client/bin/rhino-stats -m JVM
The behaviour of the JVM process is that a well-sized heap should gradually decrease from, say, 1.5 gigabytes free to around, say, 800
megabytes free, and then increase up to its higher limit again. This will be overlaid by a shorter pattern of peaks of say 128 megabytes over the
long pattern usage. The actual figures will depend on the services that are active and deployed; and the total heap used should never exceed 60%
of the heap size allocated for sustained periods when using the default CMS collector.
If the JVM heap is sized appropriately for a given load, and a system appears to run stably at a given load and gradually degrades over a period of
at least several hours, then it is possible that the JVM, a resource adaptor or Rhino has a memory leak.
Check the number of live activities and SBBs using the Rhino statistics monitoring tool. If the number of each are within expected values, then it is
likely there is a memory leak. If there are too many, it indicates there may be a bug in the component. A strong indicator of an RA or service bug is
activities that are more than ten times the age of the average call duration, or older than one whole day.
3.2.3 Workaround or resolution
First raise the maximum heap size available to the JVM by modifying the HEAP_SIZE variable in the config/config_variables file. Ensure
that there is enough physical memory on the machine such that the system will not swap if this extra memory is used by the JVM. Also set the
NEW_SIZE based on the duration and memory usage of a typical session. The Rhino node must be restarted for this change to take effect. See to
JVM documentation for more information on heap sizing and related overheads.
Tuning garbage collection settings may improve performance, but this is NOT RECOMMENDED without prolonged performance
testing, as certain combinations of garbage collection settings may severely affect the stability of a cluster and cause JVM crashes. If
you wish to explore this avenue:
31
Rhino Troubleshooting Guide (V2.5.0)
• contact your solution provider support or OpenCloud support for help• thoroughly load test the system for at least two weeks using the new garbage collection settings before deployment to
production• consider adding more RAM to a machine, or more machines to the cluster.
Alternative garbage collectors such as G1, which may improve worst-case performance at some cost to average-case but have their
own tuning requirements. G1 requires the use of Java 8 or newer to function reliably.
With some allocation patterns it may not be possible to find a heap size or new size that provides an acceptable balance between pause times and
durations for the target load. If this is the case the only solutions are to use a different cluster size or contact your solution provider for an improved
version of the service that uses memory more efficiently. This problem should only arise as a result of an error in capacity planning and testing.
A workaround that is sometimes available if the host server has sufficient free RAM is to increase the heap size until it is large enough that the
pauses can be avoided by restarting nodes during idle periods (however this is rarely possible).
Please contact your solution provider support to work through component memory leak issues.
3.3 Application or resource adaptor heap issues
A system with application or resource adaptor related heap issues displays several symptoms, which are similar to those of an inappropriately
sized JVM heap on page 30 .
3.3.1 Symptoms• Large garbage collection pauses in garbage collection logs
• Longer than expected message processing and/or call setup times
• Failed or timed-out call setup or message processing attempts
• Out of memory errors in Rhino logs
3.3.2 Diagnostic steps
When these symptoms occur it is often necessary to check both for application or resource adaptor related heap issues and for JVM heap issues
on page 30 .
32
Rhino Troubleshooting Guide (V2.5.0)
Start the rhino-console, and query for activities that are older than expected for the system. For example, call activities typically last the duration of
a phone call; so query for activities that are older than the majority of phone calls (say one hour). To perform this query use the findactivities
command with -created-before as a flag.
Here’s an example of finding activities created before 2:30pm:
findactivities -created-before 14:30:00
You can also query for activities older than a specified age, for example:
findactivities -created-before 1h
The resulting number of activities should be a low number. If it is a number larger than, say, 30 (depending on your application) then there may be
an issue with some activities not being cleaned up properly by the service or resource adaptor. You should exclude ServiceActivities and
ProfileActivities from the set counted.
If the number of activities is small, test for the number of SBB entities in Rhino. SBB entities are removed when they are no longer in use; so if
you find a large number of SBB entities that were created a long time ago, it is likely that the service has a problem. The command to query SBB
entities is the findsbbs command.
Here’s an example of finding SBB entities for the SIP proxy service created before 2:30pm:
findsbbs -service SIP\ Proxy\ Service\ 1.5,\ Open\ Cloud -created-before 14:30:00
We also recommend that you use the statistics client to examine the memory usage and number of activities in the running SLEE:
$ client/bin/rhino-stats -m Activities2006-02-27 11:56:29.610 INFO [rhinostat] Connecting to localhost:11992006-02-27 11:56:33.952 INFO [rhinostat] Monitoring2006-02-27 11:56:34.957 INFO [rhinostat] Cluster has members [101]
Activities
time active dropped finished started----------------------- -----------------------------------2006-02-27 11:58:59.574 57 - - -
33
Rhino Troubleshooting Guide (V2.5.0)
2006-02-27 11:59:00.612 57 0 19 192006-02-27 11:59:01.635 57 0 21 212006-02-27 11:59:02.657 57 0 19 192006-02-27 11:59:03.875 57 0 21 212006-02-27 11:59:05.033 57 0 20 202006-02-27 11:59:06.053 57 0 20 20
3.3.3 Workaround or resolution
This indicates an unhandled case in the Services or resource adaptors being used. Activities and SBB entities known to be invalid may be
removed through client/bin/rhino-console using the removeactivity and removesbb commands respectively. These commands may
be used to stop a system from running out of memory before a patched application or resource adaptor is installed.
3.4 Rhino start-up fails with 'java.io.IOException: Not Enough Space'
The most likely cause for this exception is that there is not enough free memory. When compiling deployable units, Rhino versions earlier than
2.3.0.7 fork off another JVM for running javac. This may double memory requirements during deployment. The exact behaviour depends on the
memory management of the operating system. Since version 2.3.0.7, Rhino no longer forks to run javac unless specifically configured to for
diagnostic purposes. The problem can still occur but is less likely.
3.4.1 Symptoms
INFO [rhino.management.deployment.builder] INFO [rhino.management.deployment.builder]common classes java.io.IOException: Not enough space<main> Generating profile implementation<main> Compiling generated profile specification at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.<init>(UNIXProcess.java:52) at java.lang.Runtime.exec Internal(Native Method)at java.lang.Runtime.exec(Runtime.java:566)at java.lang.Runtime.exec(Runtime.java:491)at java.lang.Runtime.exec(Runtime.java:457)WARN [rhino.management.deployment] <main> Installation of deployable unit failed:com.opencloud.rhino.management.deployment.BuildException: javac failed - exit code -1Failed to start rhino nodecom.opencloud.rhino.management.deployment.BuildException: javac failed - exit code -1
34
Rhino Troubleshooting Guide (V2.5.0)
Node 101 exited due to misconfiguration.
3.4.2 Diagnostic steps
Firstly, check to see how much free memory is available to the JVMs, and that swap space is available. Rhino should never run using swap space
(doing so will cause unpredictable performance); but nonetheless, sufficient swap space should be present to allow the OS to manage memory
more effectively.
Check the amount of physical memory available. Under Linux, use the free command; under Solaris use /usr/sbin/prtconf \grep
Memory .
Compare this amount to the size of the JVM specified in the HEAP_SIZE variable in node-???/config/config_variables (for Rhino-
production). Make sure that the JVM can fit comfortably in the physical memory on the computer.
3.4.3 Workaround or resolution
Rhino versions earlier than 2.3.0.7 require about twice the amount of physical memory as the size of HEAP_SIZE to ensure that a fork() of the
JVM during compilation of services will always be able to run in physical memory and not incur delays by swapping memory to and from disk.
Linux systems use copy-on-write memory cloning when forking processes so the likelihood of swapping is much lower than on Solaris.
You may need to reduce the value of HEAP_SIZE by editing the file $RHINO_HOME/node-???/config/config_variables to make sure
Rhino can fit in the available physical memory.
Also make sure there are no other running processes on the computer which may be using excessive system resources, and that there is
sufficient swap space.
3.5 Warning about UDP buffer sizes
Rhino cluster communication requires large UDP send and receive buffers.
The operating system limits for socket transmit and receive buffers must be large enough to allow the buffer size to be set.
35
Rhino Troubleshooting Guide (V2.5.0)
3.5.1 Symptoms
Rhino prints warnings at startup:
WARN [savanna.ocio.socket.sc] <main> Asked for a UDP receive buffer of 261120 bytes, but only got 212992 bytes. Check system-wide limits (net.core.rmem_max, udp_max_buf)
3.5.2 Resolution
Ensure that the kernel parameters net.core.rmem_max and net.core.wmem_max are large enough.
See UDP buffer sizes in the Preparing your Network section of the Rhino Production Getting Started Guide for details.
3.6 Java Virtual Machine error
The Sun Java Virtual Machine is typically very stable and, theoretically, should not crash even under high load. On rare occasions, however, the
JVM can crash for any of the following reasons:
• Memory on the local computer has been corrupted.
• The swap partition on the local disk has had data corruption.
• Some other aspect of the hardware in the local computer has caused a JVM crash.
• A bug or library versioning issue in the operating system has caused the JVM to crash.
• A bug in one of the included Java Native Interface (JNI) libraries has caused the JVM to crash.
• A bug in the JVM itself has caused it to crash.
3.6.1 Symptoms
When the JVM crashes, the Rhino node will fail and the cluster will reconfigure itself. On Unix systems, the node will die with either a SIGSEGV or
a SIGBUS and perhaps dump core, depending on how your system is configured.
A message similar to the following appearing in logs and/or the Rhino console indicates a JVM error.
## An unexpected error has been detected by HotSpot Virtual Machine: ## SIGSEGV (0xb) at pc=0xf968c5f9, pid=27897, tid=64 #
36
Rhino Troubleshooting Guide (V2.5.0)
# Java VM: Java HotSpot(TM) Server VM (1.5.0_05-b05 mixed mode)# Problematic frame:# J com.opencloud.slee.resources.signalware.OutgoingBackendConnection.writePacket (Lcom/opencloud/slee/resources/signalware/PacketType;Lcom/opencloud/slee/resources/psc/Encodeable;)V #### An error report file with more information is saved as hs_err_pid27897.log# If you would like to submit a bug report, please visit: http://java.sun.com/webapps/bugreport/crash.jsp##27897 Abort - core dumped
The following are examples of configuration change messages.
...2016-04-27 12:16:24.562 INFO [rhino.main] <SavannaDelivery/domain-0-rhino-db-stripe-2> Group membership change: GrpCCM: domain-0-rhino-db-stripe-2 Reg [101,18108] - Memb [ 101, 102, 103, 104, 105, 106, 107 ] { BboneCCM: Reg [101,27068] - Memb [ 101, 102, 103, 104, 105, 106, 107, 200 ] }2016-04-27 12:16:24.569 INFO [rhino.main] <SavannaDelivery/rhino-monitoring> Group membership change: GrpCCM: rhino-monitoring Reg [101,15192] - Memb [ 101, 102, 103, 104, 105, 106, 107 ] { BboneCCM: Reg [101,27068] - Memb [ 101, 102, 103, 104, 105, 106, 107, 200 ] }2016-04-27 12:16:24.601 WARN [rhino.membership.rhino-cluster] <SavannaDelivery/rhino-cluster> Node(s) [108] left cluster operational member set2016-04-27 12:16:24.601 INFO [rhino.membership.rhino-cluster] <SavannaDelivery/rhino-cluster> Current membership set: [101,102,103,104,105,106,107,200]2016-04-27 12:16:24.619 INFO [framework.scs] <SavannaDelivery/rhino-management> [rhino-management] Component change: Component(members=[101,102,103,104,105,106,107],transitioning=[101,102,103,104,105,106,107],nonPrimary=[],id=5222)...
2016-04-27 12:22:39.041 WARN [rhino.membership.rhino-cluster] <SavannaDelivery/rhino-cluster> Node has left primary component2016-04-27 12:22:39.041 INFO [rhino.exithandler] <SavannaDelivery/rhino-cluster> Exiting process (Node left primary component)
An alarm appears in the console if a node has failed. Use this command to query active alarms:
./client/bin/rhino-console -listActiveAlarms
37
Rhino Troubleshooting Guide (V2.5.0)
This command produces a block of output for each active alarm. A block in the list of active alarms similar to the following indicates that a node
has left the cluster due to hardware, network, or software failure.
...Alarm 101:193861858302463 [rhino.node-failure] Rhino node : 101 Level : Major InstanceID : 102 Source : (Subsystem) ClusterStateListener Timestamp : 20161103 17:44:02 (active 0m 6s) Message : Node 102 has left the cluster...
3.6.2 Diagnstic steps
If any of these symptoms occur, then first check the logs of all Rhino cluster members (file console.log ). If a JVM error message is part of
the logs of any node, then a Java Virtual Machine error has occured. If an error message cannot be found in the logs please refer to Cluster
Segmentation on page 44 , Cluster Failing to Start on page 45 and Cluster Starts but Stops After a few Minutes on page 47
After checking Rhino’s logs, determine whether or not the hardware or operating system is causing the problem. Look at the logs of the local
machine and determine whether the machine has had a history of restarts, kernel panics, process segmentation faults, and so forth. On Unix
machines, system logs can be viewed using the dmesg command or by viewing logs in /var/logs .
When a JVM crash occurs, a file will be left with a name like hs_err_pid*.log . View this file and try to determine which part of the JVM caused
the crash. This may provide clues as to how to resolve the situation and will be needed if a bug in the JVM is to be reported.
3.6.3 Workaround or resolution
If the crash appears to be a one-off, the node can simply be restarted.
If the problem resides in faulty hardware, then that hardware will need to be fixed or replaced.
OpenCloud has experience with configuring the Sun Java Virtual Machine to be as stable as possible. Consider consulting OpenCloud support
for information regarding known stable configurations of JVMs. Also ensure that thorough stability tests are performed on a new version of a Java
Virtual Machine before it is deployed in a production system.
38
Rhino Troubleshooting Guide (V2.5.0)
Also scan through the list of dynamic libraries in the hs_err_pid*.log file and see if they are all the latest version. Keep an eye out for libraries
which should not be there, for example from other installations of older JVMs.
If it is a realistic option, upgrading the operating system to the latest version may also resolve this problem.
JVM errors that cannot be identified as being caused by system configuration or dynamic library problems should be reported to Oracle.
3.7 Multicast traffic is using the wrong network interface
3.7.1 Symptoms
The host machines for Rhino have multiple network interfaces, and we need to control which interface is used for SLEE cluster traffic. Network
monitoring shows this is currently not on the desired interface.
When the configuration differs between hosts, nodes may fail to go primary. This is indicated by repeated messages in the Rhino log indicating
that a node is waiting for the cluster to go primary and alack of configuration change messages.
Messages like this show up in the Rhino console and logs:
...INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []...
3.7.2 Diagnostic steps
Use a network monitoring tcpdump or snoop to detect multicast traffic on the groups configured for Rhino clustering.
Use netstat -rn , route , or ip route to examine the current multicast routing configuration.
39
Rhino Troubleshooting Guide (V2.5.0)
3.7.3 Workaround or resolution
The choice of interface is controlled solely by the machines' multicast configuration; it’s controlled at the OS level rather than by Rhino. Use
route or ip route to modify the OS routing configuration. See Cluster Failing to Start on page 45 for how to configure the interface used for
multicast traffic.
40
Rhino Troubleshooting Guide (V2.5.0)
4 Clustering
Below are troubleshooting steps — symptoms, diagnostic steps, and workarounds or resolutions — for clustering issues with Rhino.
4.1 Node failure
Rhino is designed to survive hardware and software failure by using a redundant cluster of nodes. There are many possible causes for node
failure; below are symptoms and diagnostic steps for the most common.
In most deployments Rhino is configured to automatically restart failed nodes if the failure was caused by network or software faults. In all cases
not caused by hardware faults contact your solution provider support, providing all the rhino.log and console.log files from the failed Rhino node.
Restart the failed node if it did not restart automatically.
4.1.1 Configuration change messages appear in Rhino log output.
Symptoms
A Rhino node may determine that another node has left the primary component (due to failure of hardware, network, or software). When this
happens, the output looks like this:
...WARN [rhino.membership] Node(s) [102] left cluster admin operational member setINFO [rhino.membership] Current membership set: [101]...
Diagnosis and Resolution
Please see Cluster Segmentation on page 44 .
4.1.2 An alarm indicates a node has failed.
Symptoms
An alarm appears in the console if a node has failed. To query active alarms run this command:
41
Rhino Troubleshooting Guide (V2.5.0)
./client/bin/rhino-console -listActiveAlarms
This produces one line of log output for each active alarm. If the list of active alarms contains a line similar to the following, it indicates that a node
has left the cluster due to hardware, network, or software failure.
...Alarm ...(Node 101, 23-Nov-05 15:10:36.636): Major [rhino.cluster] Node 102 has left the cluster ......
Diagnosis and Resolution
Please see Cluster Segmentation on page 44 .
4.1.3 A Rhino node exits the JVM.
Symptoms
A Rhino node may determine it is not in the cluster majority after a cluster segmentation. If this is the case it terminates.
An example of the log output from the terminated node is as follows.
INFO [rhino.main] Cluster backbone membership change: BboneCCM: Trans [103,30] - Memb [ 103 ]INFO [rhino.main] Cluster backbone membership change: BboneCCM: Reg [103,32] - Memb [ 103 ]...WARN [rhino.membership] Cluster admin group has left primary component - exiting processWARN [rhino.starter.loader] Node is no longer ready - exiting process
For more information please see Cluster Segmentation on page 44 .
4.1.4 Out of memory errors
Symptom
Out of memory errors in the console log (console.log).
42
Rhino Troubleshooting Guide (V2.5.0)
Diagnosis and Resolution
Please refer to Java Virtual Machine Heap Issues on page 30 and Application or Resource Adaptor Heap Issues on page 32 .
4.1.5 JVM errors.
A message in the Rhino console log like the following indicates a JVM error.
## A fatal error has been detected by the Java Runtime Environment:## SIGSEGV (0xb) at pc=0x00007f9c742ffe25, pid=16224, tid=0x00007f9c720c6700## JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build 1.8.0_92-b14)# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode linux-amd64 compressed oops)# Problematic frame:# V [libjvm.so+0x5c2e25] G1ParScanThreadState::copy_to_survivor_space(InCSetState, oopDesc*, markOopDesc*)+0x45## Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again## If you would like to submit a bug report, please visit:# http://bugreport.java.com/bugreport/crash.jsp#...
If such a message is present, please see Java Virtual Machine Error on page 36 .
4.1.6 Watchdog timeouts.
Symptoms
A message like the following in the Rhino logs indicates a watchdog timeout.
2005-09-06 10:18:13.484 ERROR [watchdog] Waiting for thread cleanup.....2005-09-06 10:18:14.494 ERROR [watchdog] ***** KILLING NODE *****
43
Rhino Troubleshooting Guide (V2.5.0)
Diagnosis and Resolution
If such a message is present please see Rhino Watchdog on page 14 .
4.2 Cluster segmentation
Cluster segmentation occurs when part of a Rhino cluster determines that a cluster node or a set of nodes is no longer reachable. Nodes that
determine that they no longer make up a majority of the cluster (known as a quorum ) automatically deactivate themselves.
Cluster segmentation can be caused by several things, the most common being a network failure such as a switch malfunctioning.
4.2.1 Symptoms• Configuration change messages appear in Rhino log output
• A Rhino node exits the JVM
• An alarm indicating a node has failed
• Watchdog timeouts
4.2.2 Diagnostic steps
First determine whether or not the segmentation is environmental. This means that the operating environment (see Operating Environment Issues
on page 28 ) and the JVM Heap (see Java Virtual Machine Heap Issues on page 30 ) must be suitably configured. If these are configured suitably
then the issue is likely to something with the network which interconnects the Rhino nodes. Typical network components to check for failure are
machines, network interface cards, routers, switches, and cables.
See Node failure on page 41 for other causes of node failure.
4.2.3 Workaround or resolution
If the cause of the symptoms are environmental, then refer to the workaround or resolution steps for these issues. If the cause is a network issue,
then it needs to be repaired by replacing the failed components.
44
Rhino Troubleshooting Guide (V2.5.0)
4.3 Cluster failing to start
Rhino nodes must be part of the “primary component” in order to perform any work. The primary component is the set of nodes sharing the same
cluster state. When booting up, nodes will wait to join the primary component before being able to perform work. This section describes the likely
issues which cause a node to wait on the primary component for an extended period of time (greater than several seconds).
If the following diagnostic steps do not resolve the problem, please ensure that the various cluster machines can reach each other. For example,
the ping command should indicate whether or not nodes can reach each other. If nodes cannot ping each other, then the cause is likely to be
network misconfiguration or hardware failure.
4.3.1 No Primary Component
Symptoms
Repeated messages in the Rhino log indicating that a node is waiting for the cluster to go primary.
Messages like this show up in the Rhino console and logs:
...INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []...
Resolution
If this is occuring on all nodes, run the make-primary.sh script on one node.
See the Rhino Production Getting Started Guide for more information about primary components.
45
Rhino Troubleshooting Guide (V2.5.0)
4.3.2 Multicast Configuration Error
Symptoms
No configuration change messages in the Rhino logs.
A lack of configuration change messages in a clustered setup indicates that the Rhino node is not able to communicate with other Rhino nodes in
the cluster.
Diagnostic and corrective steps This is caused by an inability for nodes to communicate. Diagnosis and correction depends on whether the cluster
communication mode is Multicast or Scattercast.
Ensure that a route for the multicast IP addresses being used by Rhino is present in the operating system.
Use the route and netstat system tools for more information.
Ensure that any routers are configured to allow multicast IP packets with the configured multicast addresses to be passed to the appropriate
machines.
Ensure that the following variables in the config/config_variables files on all Rhino nodes which are part of a cluster match:
SAVANNA_CLUSTER_IDSAVANNA_CLUSTER_ADDRSAVANNA_MCAST_STARTSAVANNA_MCAST_END
Check that the host firewall is not configured to block the address/port range Savanna uses on the interface used for multicast. A simple test is to
remove all firewall rules with iptables -F . Rhino uses UDP multicast to a range of IP addresses and ports set in node-<NODE_ID>/config/
config_variables and node-<NODE_ID>/config/savanna/cluster.properties The IGMP membership groups 224.0.0.1 and
224.0.0.2 must also be allowed on this interface.
Verify whether or not multicast IP messages are being passed from one machine to another correctly. To see the IP traffic between cluster
members, use tcpdump and/or snoop .
For example, to record packet traces for the network interface eth1 limiting to a maximum of two 500MB files:
46
Rhino Troubleshooting Guide (V2.5.0)
tcpdump -C 500 -i eth1 -W 2 -w mcast_dump.pcap
The collected packet capture can be copied to the administrator’s workstation and opened with Wireshark .
Ensure that all scattercast endpoint addresses are routable from every host in the cluster. Check that the scattercast endpoints set is up to
date on nodes waiting to go primary. If the node is sufficiently out of date to not receive from any node, it cannot detect that it is out of date and
shutdown.
Check that the host firewall is not configured to block the address/port pairs and ranges used by Savanna for scattercast. A simple test is to
remove all firewall rules with iptables -F . Rhino uses UDP unicast to a range of IP addresses and ports set in:
node-<NODE_ID>/config/config_variablesnode-<NODE_ID>/config/savanna/cluster.propertiesnode-<NODE_ID>/config/savanna/scattercast.endpoints
Verify whether or not Savanna UDP IP messages are being passed from one machine to another correctly. To see the IP traffic between cluster
members, use tcpdump and/or snoop .
For example, to record packet traces for the network interface eth1 limiting to a maximum of two 500MB files:
tcpdump -C 500 -i eth1 -W 2 -w mcast_dump.pcap
The collected packet capture can be copied to the administrator’s workstation and opened with Wireshark .
4.4 Cluster starts but stops after a few minutes
This affects multicast only.
4.4.1 Symptoms• Configuration change messages appearing in Rhino log output
• Many or all Rhino nodes exit the JVM
For example:
47
Rhino Troubleshooting Guide (V2.5.0)
2016-04-27 12:29:24.352 WARN [rhino.membership.rhino-cluster] <SavannaDelivery/rhino-cluster> Node has left primary component
• Multiple alarms indicating a node has failed
• Watchdog failures for the condition "GroupHeartbeat"
For example:
2016-04-27 12:29:24.207 ERROR [watchdog] <Global watchdog thread> Failed watchdog condition: GroupHeartbeat for group rhino-monitoring (sent=33 received=23)
4.4.2 Diagnostic steps
Check the current set of registered multicast groups with netstat -gn . These should include all the groups used by Savanna and the global
IGMP membership reporting address 224.0.0.1 . Compare the reported set between nodes and with the set present immediately after a node
has started.
Check for "Martians" in syslog. Sometimes a peer will be a device on the multicast network that is in a different IP range from the hosts.
To enable logging of these on Linux run:
echo 1 > /proc/sys/net/ipv4/conf/all/log_martians
When multicast networking is in use a network device, commonly the switch or router, acts as an IGMP querier, sending regular messages to
test for active multicast devices. The OS kernel receives and responds to these queries with membership reports. If no membership reports are
received before the timeout configured on the querier it will stop forwarding multicast traffic.
To report IGMP traffic (Multicast group membership queries and reports) for all network interfaces on a host:
tcpdump -i any igmp
Use tcpdump or snoop to capture IGMP traffic on the interface used for multicast. Look for Query messages and Membership Reports.
48
Rhino Troubleshooting Guide (V2.5.0)
4.4.3 Workaround or resolution
The IGMP membership addresses 224.0.0.1 and 224.0.0.2 must be routed on the interface used for Savanna clustering as well as the
configured Savanna group multicast addresses.
Configure the local switch to act as an IGMP querier. Especially important for virtualised environments
Sometimes you need to force the kernel to fall back to IGMPv2. Usually this will be because the switch does not support IGMPv3. To configure
this, run:
/sbin/sysctl -w net.ipv4.conf.eth0.force_igmp_version=2`
Make this change permanent by adding net.ipv4.conf.eth0.force_igmp_version=2 to /etc/sysctl.conf
While uncommon for hosts in the cluster, switches are frequently in different IP ranges. The Linux kernel will drop these messages from
"impossible" addresses by default as a DOS prevention measure. Also some switches will send IGMP queries on the All-Hosts group instead of
the specific group Rhino joined. If your querier is on a different subnet from the nodes, disable the Martian packet safety check.
Fix this by configuring the rp_filter sysctl to accept packets from IPs that are not reachable from the cluster interface, by setting the
rp_filter sysctl to 0 or 2 if the source address is on a network visible from another interface.
/sbin/sysctl -w net.ipv4.conf.default.rp_filter=0/sbin/sysctl -w net.ipv4.conf.all.rp_filter=0
Make this change permanent by adding these to /etc/sysctl.conf :
net.ipv4.conf.default.rp_filter=0net.ipv4.conf.all.rp_filter=0
For more information on rp_filter and other kernel multicast configuration, see https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt .
You will also need to configure the firewall to accept IGMP packets sent to 224.0.0.1 and 224.0.0.2 if your rule does not include these
implicitly.
49
Rhino Troubleshooting Guide (V2.5.0)
4.5 Rhino SLEE fails to start cluster groups
Typically, the failure to start a cluster group indicates a network configuration problem.
Unless configured to use scattercast, Rhino distributes state between cluster members using UDP multicast. The particular multicast addresses
used are defined during installation. Multicast addresses are, by definition, in the range 224.0.0.0 through to 239.255.255.255 , otherwise
notated as 224.0.0.0/4 .
A failure will occur if Rhino tries to create a multicast socket on a non-multicast address. The Rhino install script checks to ensure that a valid
multicast address is used, so this would only occur if the Rhino configuration has been edited manually after installation.
It is possible to change the addresses used by the cluster manually, by editing the configuration on every node that has been created. Care should
be taken to ensure all nodes in the cluster are configured to use the same multicast addresses.
To change these addresses, edit the file $RHINO_HOME/node-101/config/config_variables .
4.5.1 Symptoms
Output is usually something like this:
ERROR [savanna.stack.membership.manager] Failed to start cluster groupsjava.lang.RuntimeException: Failed to change addresses to /224.0.24.0:45601
4.5.2 Diagnostic steps
If the cluster is configured to use scattercast follow the diagnostic procedures for Cluster Failing to Start on page 45 and Scattercast endpoints
out of sync on page 52 . For multicast clusters follow the steps below:
The install script checks to ensure that a valid multicast range is used but this may be changed manually afterward. If the address reported is in
the multicast range described above, check that your OS kernel supports multicast:
cat /boot/config-<kernel version> \grep CONFIG_IP_MULTICAST
50
Rhino Troubleshooting Guide (V2.5.0)
4.5.3 Workaround or resolution
If the configured addresses are not all within the multicast range change the configuration to use only multicast addresses.
If the configured UDP multicast range is not used by other applications, it should work fine for Rhino. Failing that, any address range can be used
in 224.0.0.0/4 .
To change the addresses without reinstalling, edit the file $RHINO_BASE/node-101/config/config_variables and change the following
lines to use multicast addresses not taken by other programs:
SAVANNA_CLUSTER_ADDR=224.0.24.1SAVANNA_MCAST_START=224.0.24.1SAVANNA_MCAST_END=224.0.24.8
The master copy of this file (used when creating additional nodes) is $RHINO_BASE/etc/defaults/config/config_variables . If several
nodes are to be deployed, this can be changed before nodes are created, to save extra editing at a later stage.
4.6 Group heartbeat timeout
Rhino uses a heartbeat mechanism to ensure Savanna groups are alive and functioning correctly. This can cause a node to shut down when it
detects abnormal group behaviour.
4.6.1 Symptoms
Output is something like this:
ERROR [watchdog] <Global watchdog thread> Watchdog Failure has been notified to all listeners because condition 'GroupHeartbeat for group domain-0-rhino-db-stripe-0 (sent=20 received=10)' failedERROR [watchdog] <Global watchdog thread> *** WATCHDOG TIMEOUT ***ERROR [watchdog] <Global watchdog thread> Failed watchdog condition: GroupHeartbeat for group domain-0-rhino-db-stripe-0 (sent=20 received=10)ERROR [watchdog] <Global watchdog thread> ***** KILLING NODE *****
Diagnostic and corrective stepsThis is almost always caused by a networking issue that does not affect the backbone Savanna group used for cluster membership.
51
Rhino Troubleshooting Guide (V2.5.0)
Diagnosis and correction is therefore identical to bxref:#cluster-failing-to-start[Cluster Failing to Start].
Also see Rhino Watchdog on page 14 .
4.7 Scattercast endpoints out of sync
This issue can only occur on a cluster using scattercast communications mode. Scattercast does not permit autodiscovery of nodes. Therefore
each node must a priori know the addresses of all nodes in the cluster. Whenever the persistent set of known endpoints gets out of sync on a
node, it will not be able to successfully join the cluster.
4.7.1 Symptoms
This problem can manifest in several ways, depending on how the scattercast endpoints are out of date. One example is shown below. Almost all
examples of scattercast endpoints out of sync will result in the out of date nodes shutting down.
ERROR [savanna.stack.membership.manager] <SavannaIO/pool-thread-1> Halting due to scattercast endpoint mismatch detection:Network advertised version: 2 does not match local version: 1ERROR [savanna.stack.membership.manager] <SavannaIO/pool-thread-1> An up to date scattercast.endpoints file can be found on any running nodeINFO [rhino.exithandler] <SavannaIO/pool-thread-1> Exiting process (Exiting due to fatal Savanna misconfiguration or conflict)
It is possible to create a situation where a node will boot with an out-of-date persistent endpoints set, and fail to shut down. If booted after a
scattercast management command removes it from the endpoints set, a node will not go primary nor shut down.
This will trigger frequent cluster membership changes in the remaining nodes. The cluster membership change messages will not change the set
of members.
INFO [rhino.main] <SavannaIO/pool-thread-1> Cluster backbone membership change: BboneCCM: Trans [101,10] - Memb [ 101 ]INFO [rhino.main] <SavannaIO/pool-thread-1> Cluster backbone membership change: BboneCCM: Reg [101,12] - Memb [ 101 ]INFO [rhino.main] <SavannaIO/pool-thread-1> Cluster backbone membership change: BboneCCM: Trans [101,14] - Memb [ 101 ]
52
Rhino Troubleshooting Guide (V2.5.0)
INFO [rhino.main] <SavannaIO/pool-thread-1> Cluster backbone membership change: BboneCCM: Reg [101,16] - Memb [ 101 ]INFO [rhino.main] <SavannaIO/pool-thread-1> Cluster backbone membership change: BboneCCM: Trans [101,18] - Memb [ 101 ]INFO [rhino.main] <SavannaIO/pool-thread-1> Cluster backbone membership change: BboneCCM: Reg [101,20] - Memb [ 101 ]
Diagnostic and corrective steps If some nodes are operational, run the Rhino console command getscattercastendpoints to learn the
current cluster configuration. This will print the live in-memory configuration; and if the scattercast.endpoints file on any running node
contains a different configuration, it will print the memory state for each node and the differences from the disk state.
If all nodes have the same configuration, the output will be similar to the example below:
[Rhino@host (#0)] getscattercastendpoints[Consensus] Disk Mapping : Coherent[Consensus] Memory Mapping : [101] Address : 192.168.4.103:18000 [102] Address : 192.168.4.103:18001 [103] Address : 192.168.4.103:18002 [104] Address : 192.168.4.103:18003 [105] Address : 192.168.4.103:18004 [106] Address : 192.168.4.105:18005 [107] Address : 192.168.4.105:18006 [108] Address : 192.168.4.105:18007 [109] Address : 192.168.4.105:18008 [110] Address : 192.168.4.105:18009 [111] Address : 192.168.4.105:18010
If the node configuration is consistent but does not contain the expected set of nodes, then the missing nodes should be added by running the
addscattercastendpoints command.
If the disk mapping differs from the in-memory configuration, then the nodes with the differing configuration are likely to fail when restarted. An
example output where node 101 has a malformed file that will cause a node to fail to reboot is below:
53
Rhino Troubleshooting Guide (V2.5.0)
[Rhino@host (#0)] getscattercastendpoints[101] Disk Mapping : 192.168.4.103:18001 present twice in endpoints file[101] Memory Mapping : [101] Address : 192.168.4.103:18000 [102] Address : 192.168.4.103:18001 [103] Address : 192.168.4.103:18002 [104] Address : 192.168.4.103:18003 [105] Address : 192.168.4.103:18004 [106] Address : 192.168.4.105:18005 [107] Address : 192.168.4.105:18006 [108] Address : 192.168.4.105:18007 [109] Address : 192.168.4.105:18008 [110] Address : 192.168.4.105:18009 [111] Address : 192.168.4.105:18010
[102] Disk Mapping : Coherent[102] Memory Mapping : [101] Address : 192.168.4.103:18000 [102] Address : 192.168.4.103:18001 [103] Address : 192.168.4.103:18002 [104] Address : 192.168.4.103:18003 [105] Address : 192.168.4.103:18004 [106] Address : 192.168.4.105:18005 [107] Address : 192.168.4.105:18006 [108] Address : 192.168.4.105:18007 [109] Address : 192.168.4.105:18008 [110] Address : 192.168.4.105:18009 [111] Address : 192.168.4.105:18010...
If some nodes have a different configuration on disk from the live memory configuration, then copy the scattercast.endpoints file from a
node that has matching disk and memory configurations.
54
Rhino Troubleshooting Guide (V2.5.0)
See also the Repair section of Scattercast Management in the Rhino Administration and Deployment Guide.
55
Rhino Troubleshooting Guide (V2.5.0)
5 Performance
Below are troubleshooting steps -- symptoms, diagnostic steps, and workarounds or resolutions -- for Rhino performance issues.
5.1 High Latency
High latency can be observed in services running on the Rhino SLEE if the system is not configured appropriately for the workload or if a resource
leak is present. Services should always be performance tested before deployment however the tested load profile may not match the production
load so sometimes live diagnosis is required. Always attempt to reproduce performance problems in a test environment before trying solutions on
a live system unless the fix is both simple and obvious e.g. a faster disk or more nodes.
5.1.1 Symptoms• Long Garbage Collection pauses in Garbage Collection/console logs
• Longer than expected processing and/or response times
• Failed or timed out attempts in your application
5.1.2 Diagnostic steps1. Firstly, determine whether or not the operating environment is the cause of the problem. For more information on the operating
environment refer to Operating Environment Issues on page 28
2. Secondly, determine whether or not heap use is the cause of the problem. This should be considered the most likely cause when
garbage collection pauses are long. For more information on heap use refer to sections Java Virtual Machine Heap Issues on page
30 and Application or Resource Adaptor Heap Issues on page 32 .
The following steps are to be followed if the previous two stages of diagnosis were unsuccessful. All three steps use the output from client/
bin/rhino-stats monitoring the Staging Threads parameter set.
The output for the Staging Threads should typically be monitored over at least a 30 minute period in order to ensure that the system
characteristics are noticed through different garbage collection cycles.
An example command which connects to a Rhino cluster member on the local host and monitors the Staging Threads parameter set is as follows.
56
Rhino Troubleshooting Guide (V2.5.0)
./client/bin/rhino-stats -h localhost -m Staging\ Threads
Low number of available threads in Rhino statistics output
The number of available threads is in the avail column of the rhino-stats output. Typically in a system which has too few threads configured the
number of available threads is close to the maximum number of threads however it may drop to a near zero value for a brief period of time (half a
second to several seconds).
High staging queue size in Rhino statistics output
The staging queue is used by Rhino to store incoming events when there are no free threads available to process the event. In a well sized
system the staging queue is almost always empty. This can be observed by looking at the queueSize column. This should be a low number, less
than the number of available threads. If it is not try doubling the number of threads available to Rhino.
Dropped staging items
Dropped staged items means that queued events have been discarded due to lack of queue size to hold the item. If this value is ever non-zero
it indicates that events are not being processed fast enough. This can indicate that the system is configured inappropriately with respect to the
number of threads or that a Service or Resource Adaptor in Rhino has an issue. Overload conditions can also cause cause dropped staging items.
High event processing time
Event processing time should be monitored for each event type that makes up the application flows, particularly those flows showing poor
performance e.g.
./client/bin/rhino-stats -h localhost -m "Events.ocsipra.[javax.sip.message.Request.INVITE net.java.slee, 1.2]"
The most important statistic in event processing is the SBB Processing Time (SBBT). This tells how long the thread processing took in the service
logic and is an indicator of the theoretical upper limit on how many events/s can be processed per thread. High SBB Processing Time statistics
indicate either too many staging threads, high CPU load or, more frequently, a service that has slow dependencies e.g. a database query.
The Event Router Time (ERT) statistic shows the Rhino overhead of processing the event including transaction commit time. This is dependent
on Rhino overheads, SBB initial-event-selectors and the SBB’s data storage activity. High Event Router Time statistics usually indicate either
57
Rhino Troubleshooting Guide (V2.5.0)
too many staging threads, high CPU load or a service that stores very large amounts of data (more than 1MB per transaction) in a the replicated
MemDB. It may also be a symptom of a Rhino bug.
Event Processing Time (EPT) is the equal to the sum of these and indicates the actual limit on per-thread event processing rate with the currently
deployed set of services.
5.1.3 Workaround or Resolution
If the diagnostic steps show that there are not enough threads available, try doubling the number of staging threads. This can be achieved using
the rhino-console command at runtime, or by editing the thread-count attribute of the staging queues element in config/config.xml and then
restarting that node.
Note that it is possible to set the number of staging threads too high, and in doing so cause other performance problems. We do not recommend
setting the number of staging threads higher than 200 on most systems or approximately 10x the number of CPU cores. Setting it too high will
cause scheduler thrashing which will increase latency and reduce throughput.
If the problem persists then this indicates there is a potential issue with the Services, and/or Resource Adaptors installed in Rhino.
In the case of performance problems with services or suspected bugs in Rhino please contact your solution provider for support. If possible the
problem should be reproduced in a test environment so additional logging for diagnosis does not impact live traffic.
5.2 Dropped Calls
Services running on Rhino have real-time commitments and latency for those services needs to be at an absolute minimum. “Dropped calls”
occur when an activity cannot be completed in an appropriate timeframe. This has many causes. The most common causes are described and
diagnostic steps for each is provided.
5.2.1 Symptoms• Longer than expected processing and/or call setup times
• Failed or timed out call setup attempts
• Timed out or dropped stage items
• Rhino logs and/or console output containing exceptions
58
Rhino Troubleshooting Guide (V2.5.0)
• Resource Adaptors throwing Exceptions
• Lock timeout messages in Rhino logs and/or console
• Rate limiting
• A dependant external system is not functioning properly
5.2.2 Diagnostic steps1. Check the Rhino log for exceptions. Exceptions in the logs often indicate a cause for dropped calls. If exceptions are found in the
logs, follow the steps in Rhino logs and/or console output containing exceptions on page 59
2. Follow the diagnostic steps defined in sections Operating Environment Issues on page 28 , Java Virtual Machine Heap Issues on
page 30 and Application or Resource Adaptor Heap Issues on page 32 and follow through diagnostic steps to determine whether or
not they are the cause.
3. Follow the diagnostic steps defined in section High Latency on page 56 .
5.2.3 Rhino logs containing exceptions
If the exception is of the type AccessControlException , please refer to section Security Related Exceptions on page 63 to solve the issue.
If the exceptions contain the text "Event router transaction failed to commit" it is likely that the in-memory database is not sized correctly for the
load. Follow the diagnostic and resolution steps in Memory Database Full on page 71 .
For other exceptions please contact your solution provider for support.
Resource Adaptors throwing Exceptions
If the log key for the exception log matches the tracer for a resource adaptor entity e.g. trace.sipra the cause of the problem is likely to be a
faulty resource adaptor. This is indicative of either a misconfiguration of the resource adaptors in question or a bug in their implementation. If this
is occurring frequently then please contact your solution provider support. Contact your solution provider for support attaching the log files that
contain the exception message.
5.2.4 Lock timeout messages in Rhino logs and/or console
If lock timeout messages are occuring please refer to section Lock Timeouts on page 18 . Two examples of lock timeout messages are as follows.
...
59
Rhino Troubleshooting Guide (V2.5.0)
... Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT ...
...
...
... ========= Lock timeout ========= ...
... Unable to acquire lock on [lockId] within 5000 ms ...
...
If lock timeout messages are occurring please refer to Lock Timeouts on page 18 .
5.2.5 Rate limiting
A rate limiter may be rejecting input if the incoming rate exceeds its specified rate.
The rate limiter raises an alarm and outputs to the Rhino logging system if it is actively rejecting input. An example of this is shown as follows.
2011-08-02 22:02:59.931 Major [rhino.facility.alarm.manager] <Timer-1> Alarm 103:212775261552:14 [SubsystemNotification[subsystem=ThresholdAlarms],LIMITING,queue-saturation-limiter-rejecting-work] was raised at 2011-08-02 22:02:59.931 to level Major QueueSaturation limiter is rejecting work2014-11-18 23:26:41.134 Major [rhino.facility.alarm.manager] <Timer-1> Alarm 101:316812519480:1 [SubsystemNotification[subsystem=ThresholdAlarms],LIMITING,system-input-limiter-rejecting-work] was raised at 2014-11-18 23:26:41.134 to level Major
A user rate limiter is designed to reject input above a certain threshold to avoid overloading Rhino. If this threshold is appropriately configured then
the messages should occur very rarely. If they occur frequently then it is possible that the solution is not sized appropriately for the incoming load.
In this case please contact your solution provider support.
5.2.6 A dependant external system is not functioning properly
This is difficult to diagnose in a general manner as it is specific to the solution running on Rhino. However the system administrator should be
aware of any external systems which the solution communicates with. Examples may include rating/charging engines, HLRs, HSS, MSCs, IVRs,
SMSCs, MMSCs, databases, etc.
Please refer to administrators of dependant systems and if necessary contact your solution provider support.
60
Rhino Troubleshooting Guide (V2.5.0)
5.3 A newly started node is unable to handle full traffic load
5.3.1 Symptoms• A new cluster or node has been brought into service and latencies are much greater than target
• Traffic to a new node fails with timeouts
5.3.2 Workaround or Resolution
Java takes some time to settle during application startup due to Just in Time(JIT) compilation of code hotspots. For this reason, we recommend
slowly introducing traffic to a newly booted node. Most of the hotspots will be optimised in the first minute, however it is better to ramp up load over
the course of up to 15 minutes to allow JIT compilation and optimization to occur.
Cluster size should be set during provisioning to allow sufficient reserve capacity to handle peak-hour traffic with at least one failed node. This
allows a lot of flexibility in bringing a failed node back into the cluster.
The recommended approach to bringing a node back into the cluster is to use a load balancing mechanism to limit traffic to the rebooted node.
(Signalware has this mechanism built in) Usually a smooth increase from zero to a fair share over the course of 5 to 15 minutes (depending on
total load) allows the JIT compiler to work without causing call failures or unacceptable latencies.
Another option, if sufficient reserve capacity exists, is to simply wait until cluster load drops to a level that a new node can cope with before
rebooting the failed node. This works, as the reserve capacity ensures that the node failure is not visible to callers.
For Java 7 update 45 and newer enabling tiered compilation may result in some improvement at node boot time. This will require a much, much
larger code cache as well due to the more aggressive compilation. This it does not avoid the initial very slow interpreted calls, it merely begins
compilation more aggressively, which improves performance faster. This option is on by default in Java 8 where the default code cache size has
been increased to handle the extra data.
OPTIONS="$OPTIONS -XX:+TieredCompilation -XX:InitialCodeCacheSize=512m"
Note that to avoid performance decreasing, the default code cache size must be set when using Java 7, as the default size of 48MB is not large
enough. Java 8 defaults to 240MB and fixes some performance bugs in flushing old compilation data however you may still see benefit from
increasing the code cache.
61
Rhino Troubleshooting Guide (V2.5.0)
5.4 Uneven CPU load/memory usage across cluster nodes
5.4.1 Symptoms
CPU load or memory usage is inconsistent across the cluster.
5.4.2 Diagnostic steps and correction
Examine the service usage statistics to rule out differences in applied load.
A small difference in CPU load is expected for nodes servicing management connections. These nodes will also consume slightly more heap and
Java Permanent Generation memory. If the amount of PermGen memory used on a node is greater than 90% of the maximum a CMS GC cycle
will be triggered to try to reclaim PermGen, When permgen remains above 90% CMS cycles will run almost continuously and cause a significant
increase in CPU usage. Check the heap statistics for the node or the GC log entries in the console log to determine if this is the cause of the
imbalance. Log entries will contain a section similar to the following with a high proportion of the Perm space used and only a small change after
GC:
[CMS Perm : 186608K->186604K(196608K)]
If the problem is memory related increase the allocated heap or PermGen so that there is sufficient free space for normal operation. For more
information on heap configuration see Java Virtual Machine Heap Issues on page 30
If the problem is not memory related investigate the balance of load across threads. A small number of threads consuming a large percentage
of CPU time may indicate a software bug or a load profile that is unusually serial in execution. On Linux the command top -ch will display the
busiest threads and the process they belong to. Run the command $RHINO_HOME/dumpthreads.sh to trigger a Java thread dump. The thread
IDs in the dump can be matched to the PIDs of the threads displayed in top by converting the PID to hexadecimal. Frequently the stack traces for
busy threads will be similar. This can suggest the part of the code that contains the bug and may help identify the trigger conditions. Contact your
solution provider for assistance, attaching the thread dump and list of busy threads.
62
Rhino Troubleshooting Guide (V2.5.0)
6 Configuration Problems
Below are troubleshooting steps -- symptoms, diagnostic steps, and workarounds or resolutions -- for Rhino configuration.
6.1 Security Related Exceptions
Rhino provides a secured environment for its management infrastructure and for deployed components such as SBBs, Resource
Adaptors, Profiles and Mlets. If some component of this management infrastructure attempts to perform an unauthorized operation, then
`AccessControlException`s will be reported in the logging output of Rhino. The two most common forms are described in this section.
• Various connection related exceptions on page 63
• Various permission related exceptions on page 63
6.1.1 Various connection related exceptions
Symptoms
The default installation of Rhino is strict about which remote machines may connect to Rhino to perform management operations. The following
exceptions are common when a machine attempts to connect to Rhino that has not been explicitly configured. Examples of the two most common
messages are as follows.
A refused JMX-Remote connection produces the following message:
Exception in thread "RMI TCP Connection(6)-192.168.0.38" java.security.AccessControlException: access denied (java.net.SocketPermission 192.168.0.38:48070 accept,resolve)at java.security.AccessControlContext.checkPermission(AccessControlContext.java:264)at java.security.AccessController.checkPermission(AccessController.java:427)at java.lang.SecurityManager.checkPermission(SecurityManager.java:532)at java.lang.SecurityManager.checkAccept(SecurityManager.java:1157)at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.checkAcceptPermission(TCPTransport.java:560) at sun.rmi.transport.tcp.TCPTransport.checkAcceptPermission(TCPTransport.java:208)at sun.rmi.transport.Transport$1.run(Transport.java:152)at java.security.AccessController.doPrivileged(Native Method)at sun.rmi.transport.Transport.serviceCall(Transport.java:149)
63
Rhino Troubleshooting Guide (V2.5.0)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:460)at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:701)at java.lang.Thread.run(Thread.java:595)
This can be caused by a client/bin/rhino-console , client/bin/rhino-stats , client/bin/rhino-console, and external clients
such as the Rhino Element Manager or any other application which uses Rhino’s JMX-Remote connection support.
The following message is caused by the rhino-console connecting from a host which is not in the list of accepted IP addresses:
2016-12-01 12:18:38.444 ERROR [rhino.main] <RMI TCP Connection(idle)> Uncaught exception detected in thread Thread[RMI TCP Connection(idle),5,RMI Runtime]: java.security.AccessControlException: access denied ("java.net.SocketPermission" "192.168.2.33:34611" "accept,resolve")java.security.AccessControlException: access denied ("java.net.SocketPermission" "192.168.2.33:34611" "accept,resolve") at java.security.AccessControlContext.checkPermission(AccessControlContext.java:372) at java.security.AccessController.checkPermission(AccessController.java:559) at java.lang.SecurityManager.checkPermission(SecurityManager.java:549) at java.lang.SecurityManager.checkAccept(SecurityManager.java:1170) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.checkAcceptPermission(TCPTransport.java:668) at sun.rmi.transport.tcp.TCPTransport.checkAcceptPermission(TCPTransport.java:305) at sun.rmi.transport.Transport$2.run(Transport.java:201) at sun.rmi.transport.Transport$2.run(Transport.java:199) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:198) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:567) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:828) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.access$400(TCPTransport.java:619) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:684) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:681) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:681) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
64
Rhino Troubleshooting Guide (V2.5.0)
Resolution
In this case, the components can be given permission to accept connections from certain IP addresses or hostnames by entering those addresses
in the appropriate section in config/mlet-permachine.conf for the production system and in config/mlet.conf for the SDK. This is
described in more detail in the Rhino Administrators Guide .
If the connection is local, it may be necessary to add that local interface to the LOCALIPS variable in the config/config_variables file. Both
IPv4 and IPv6 addresses will need to be specified, and that node or SDK will need to be restarted.
6.1.2 Various permission related exceptions
6.1.3 Symptoms
Components inside Rhino run in a security context. These components include SBBs, Resource Adaptors, Profiles and Mlets. If a component
attempts to perform some operation which requires a security permission that the component does not have, then an exception is generated. An
example of this is when a Resource Adaptor attempts to write to the filesystem when it does not have permission to do so.
...WARN [rhino.remotetx] <StageWorker/RTM/1> Prepare failed:Node 101 failed: com.opencloud.slee.ext.resource.ResourceException: Could not create initial CDR file ...at java.lang.Thread.run(Thread.java:534)Caused by: java.io.IOException: Access denied trying to create a new CDR logfile...at java.security.AccessController.doPrivileged(Native Method)... 14 moreCaused by: java.security.AccessControlException: access denied (java.io.FilePermission ... read)at java.security.AccessControlContext.checkPermission(AccessControlContext.java:269)at java.security.AccessController.checkPermission(AccessController.java:401)at java.lang.SecurityManager.checkPermission(SecurityManager.java:524)at java.lang.SecurityManager.checkRead(SecurityManager.java:863)at java.io.File.exists(File.java:678)... 19 more...
6.1.4 Diagnosis and Resolution
Typically this indicates that the component is not configured appropriately.
65
Rhino Troubleshooting Guide (V2.5.0)
As a step in diagnosing security-related problems, the security manager can be disabled by commenting out the following line in the read-config-
variables script:
OPTIONS="$OPTIONS -Djava.security.manager"
This is not recommended as a permanent solution and should not be used on a production system. Disabling the security manager should only be
used temporarily as a means to diagnose the problem in a test environment. To determine which permissions are required uncomment this line in
$RHINO_HOME/read-config-variables:
#OPTIONS="$OPTIONS \-Djava.security.debug=access:failure"
Rhino will then print the permissions that are failing security checks to the console log. The required permissions should be added to the
<security-permissions> section of the component deployment descriptor. For more information regarding granting additional permissions to
an RA or SBB, refer to the sections on RA and SBB Deployment Descriptors in the Rhino Administration Manual .
6.2 Memory Database Full
Rhino’s in-memory database (MemDB) has a specified fixed capacity. Transactions which execute against a MemDB will not commit if the
transaction would take MemDB past its fixed capacity. This typically occurs due to inadequate sizing of a Rhino installation but may indicate faulty
logic in a service or resource adaptor. Log messages containing "Unable to prepare due to size limits of the DB" or "Unable to prepare due to
committed size limits of MemDB Local" are a clear indicator of this problem. The symptoms are varied and are described below.
• Profile Management and/or provisioning failing on page 66
• Deployment failing on page 69
• [bxref#calls-not-being-setup-successfully[Calls not being setup successfully (refused requests)]
• [ Exceptions in Rhino logs on page
6.2.1 Profile Management and/or provisioning failing
Symptoms
Creating or importing profiles fails. The Rhino log contains a message with a log key profile.* similar to the following:
66
Rhino Troubleshooting Guide (V2.5.0)
2016-12-01 13:41:57.081 WARN [profile.mbean] <RMI TCP Connection(2)-192.168.0.204> [foo:8] Error committing profile:javax.slee.management.ManagementException: Cannot commit transaction at com.opencloud.rhino.management.TxSupport.commitTx(TxSupport.java:36) at com.opencloud.rhino.management.SleeSupport.commitTx(SleeSupport.java:28) at com.opencloud.rhino.impl.profile.GenericProfile.commitProfile(GenericProfile.java:127) at sun.reflect.GeneratedMethodAccessor97.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.opencloud.rhino.management.DynamicMBeanSupport.doInvoke(DynamicMBeanSupport.java:157) at com.opencloud.rhino.management.DynamicMBeanSupport$1.run(DynamicMBeanSupport.java:121) at java.security.AccessController.doPrivileged(Native Method) at com.opencloud.rhino.management.DynamicMBeanSupport.invoke(DynamicMBeanSupport.java:113) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801) at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.opencloud.rhino.management.BaseMBeanInterceptor.intercept(BaseMBeanInterceptor.java:12) at com.opencloud.rhino.management.AuditingMBeanInterceptor.intercept(AuditingMBeanInterceptor.java:245) at com.opencloud.rhino.management.ObjectNameNamespaceQualifierMBeanInterceptor.intercept(ObjectNameNamespaceQualifierMBeanInterceptor.java:74) at com.opencloud.rhino.management.NamespaceAssociatorMBeanInterceptor.intercept(NamespaceAssociatorMBeanInterceptor.java:66) at com.opencloud.rhino.management.RhinoPermissionCheckInterceptor.intercept(RhinoPermissionCheckInterceptor.java:56) at com.opencloud.rhino.management.CompatibilityMBeanInterceptor.intercept(CompatibilityMBeanInterceptor.java:130) at com.opencloud.rhino.management.StartupManagementMBeanInterceptor.intercept(StartupManagementMBeanInterceptor.java:97) at com.opencloud.rhino.management.RemoteSafeExceptionMBeanInterceptor.intercept(RemoteSafeExceptionMBeanInterceptor.java:26) at com.opencloud.rhino.management.SleeMBeanServerBuilder$MBeanServerInvocationHandler.invoke(SleeMBeanServerBuilder.java:38) at com.sun.proxy.$Proxy9.invoke(Unknown Source) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487) at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97)
67
Rhino Troubleshooting Guide (V2.5.0)
at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328) at java.security.AccessController.doPrivileged(Native Method) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1427) at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848) at sun.reflect.GeneratedMethodAccessor80.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322) at sun.rmi.transport.Transport$2.run(Transport.java:202) at sun.rmi.transport.Transport$2.run(Transport.java:199) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:198) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:567) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:828) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.access$400(TCPTransport.java:619) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:684) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:681) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:681) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)Caused by: com.opencloud.transaction.TransientRollbackException: Unable to prepare due to size limits of the DB at com.opencloud.transaction.local.OCTransactionImpl.commit(OCTransactionImpl.java:107) at com.opencloud.transaction.local.OCTransactionManagerImpl.commit(OCTransactionManagerImpl.java:190) at com.opencloud.rhino.management.TxSupport.commitTx(TxSupport.java:33) ... 48 more
Resolution and monitoring
If profile management and/or provisioning commands are unsuccessful due to the size restriction of the profile database installed in Rhino the size
of the database should be increased. To resize the profile database follow the instructions in Resizing MemDB Instances on page 73 to alter
the size of the ProfileDatabase.
68
Rhino Troubleshooting Guide (V2.5.0)
The following command can be used to monitor the size of the ProfileDatabase installed in Rhino.
./client/bin/rhino-stats -m MemDB-Replicated.domain-0-ProfileDatabase
6.2.2 Deployment failing
Symptoms
If deployment is unsuccessful this is possibly due to the size restriction of the ManagementDatabase installed in Rhino.
The error message and Rhino log will look similar to:
javax.slee.management.ManagementException: File storage error at com.opencloud.rhino.node.FileManager.store(FileManager.java:119) at com.opencloud.rhino.management.deployment.Deployment.install(Deployment.java:400) at com.opencloud.rhino.management.deployment.Deployment.install(Deployment.java:267) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.opencloud.rhino.management.DynamicMBeanSupport.doInvoke(DynamicMBeanSupport.java:157) at com.opencloud.rhino.management.DynamicMBeanSupport$1.run(DynamicMBeanSupport.java:121) at java.security.AccessController.doPrivileged(Native Method) at com.opencloud.rhino.management.DynamicMBeanSupport.invoke(DynamicMBeanSupport.java:113) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801) at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.opencloud.rhino.management.BaseMBeanInterceptor.intercept(BaseMBeanInterceptor.java:12) at com.opencloud.rhino.management.AuditingMBeanInterceptor.intercept(AuditingMBeanInterceptor.java:245) at com.opencloud.rhino.management.ObjectNameNamespaceQualifierMBeanInterceptor.intercept(ObjectNameNamespaceQualifierMBeanInterceptor.java:74) at com.opencloud.rhino.management.NamespaceAssociatorMBeanInterceptor.intercept(NamespaceAssociatorMBeanInterceptor.java:66) at com.opencloud.rhino.management.RhinoPermissionCheckInterceptor.intercept(RhinoPermissionCheckInterceptor.java:56)
69
Rhino Troubleshooting Guide (V2.5.0)
at com.opencloud.rhino.management.CompatibilityMBeanInterceptor.intercept(CompatibilityMBeanInterceptor.java:130) at com.opencloud.rhino.management.StartupManagementMBeanInterceptor.intercept(StartupManagementMBeanInterceptor.java:97) at com.opencloud.rhino.management.RemoteSafeExceptionMBeanInterceptor.intercept(RemoteSafeExceptionMBeanInterceptor.java:26) at com.opencloud.rhino.management.SleeMBeanServerBuilder$MBeanServerInvocationHandler.invoke(SleeMBeanServerBuilder.java:38) at com.sun.proxy.$Proxy9.invoke(Unknown Source) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487) at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328) at java.security.AccessController.doPrivileged(Native Method) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1427) at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848) at sun.reflect.GeneratedMethodAccessor82.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322) at sun.rmi.transport.Transport$2.run(Transport.java:202) at sun.rmi.transport.Transport$2.run(Transport.java:199) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:198) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:567) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:828) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.access$400(TCPTransport.java:619) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:684) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:681) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:681) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)Caused by: javax.slee.TransactionRolledbackLocalException: Cannot commit transaction at com.opencloud.rhino.facilities.FacilitySupport.commitTx(FacilitySupport.java:47) at com.opencloud.rhino.node.FileManager.store(FileManager.java:112) ... 49 more
70
Rhino Troubleshooting Guide (V2.5.0)
Caused by: com.opencloud.transaction.TransientRollbackException$$_Safe: Unable to prepare due to size limits of the DB at com.opencloud.transaction.local.OCTransactionImpl.commit(OCTransactionImpl.java:107) at com.opencloud.transaction.local.OCTransactionManagerImpl.commit(OCTransactionManagerImpl.java:190) at com.opencloud.rhino.facilities.FacilitySupport.commitTx(FacilitySupport.java:44) ... 50 more
Resolution and monitoring
If deployment commands are unsuccessful due to the size restriction of the management database installed in Rhino the size of the database
should be increased. To resize the management database follow the instructions in Resizing MemDB Instances on page 73 to alter the size of
the ManagementDatabase.
The following command can be used to monitor the size of the ManagementDatabase installed in Rhino.
./client/bin/rhino-stats -m MemDB-Replicated.ManagementDatabase
6.2.3 Calls not being setup successfully
If calls are not being setup successfully the failures may be caused by the size restriction of either or both of the LocalMemoryDatabase and
ReplicatedMemoryDatabase installed in Rhino.
If the service does not use replicated transactions the Rhino log will contain messages similar to this:
2016-11-30 09:08:01.732 WARN [rhino.er.stage.eh] <jr-74> Event router transaction failed to commitcom.opencloud.transaction.TransientRollbackException: Unable to prepare due to committed size limits of MemDB Local at com.opencloud.ob.Rhino.bO.commit(2.3-1.20-85576:108) at com.opencloud.ob.Rhino.me.commit(2.3-1.20-85576:191) at com.opencloud.ob.Rhino.AB.run(2.3-1.20-85576:95) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at com.opencloud.ob.Rhino.bq$a$a$a.run(2.3-1.20-85576:440)
Replicated services will produce log entries similar to this:
71
Rhino Troubleshooting Guide (V2.5.0)
2016-11-30 09:08:01.732 WARN [rhino.er.stage.eh] <jr-74> Event router transaction failed to commitcom.opencloud.transaction.TransientRollbackException: Unable to prepare due to size limits of the DB at com.opencloud.ob.Rhino.bO.commit(2.3-1.20-85576:108) at com.opencloud.ob.Rhino.me.commit(2.3-1.20-85576:191) at com.opencloud.ob.Rhino.AB.run(2.3-1.20-85576:95) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at com.opencloud.ob.Rhino.bq$a$a$a.run(2.3-1.20-85576:440)
Resolution and monitoring
If calls are failing due to the size restriction of the in-memory database installed in Rhino the size of the database should be increased. Data
should be collected before resizing to determine which database is reaching its size limit.
The following commands can be used to monitor the size of these databases installed in Rhino.
./client/bin/rhino-stats -m MemDB-Local.LocalMemoryDatabase
./client/bin/rhino-stats -m MemDB-Replicated.domain-0-DomainedMemoryDatabase
./client/bin/rhino-stats -m MemDB-Replicated.domain-0-ReplicatedMemoryDatabase
Note that it may be necessary to additionally monitor the output of the following memory databases in Rhino: MemDB-
Replicated.ManagementDatabase , and MemDB-Replicated.ProfileDatabase .
If Rhino is configured to use striping, the DomainedMemoryDatabase will be divided into stripes that are each limited to a fraction of the size
allocated to the MemDB instance. When using a striped MemDB it is possible for individual stripes to become full without the whole database
filling. This is typically a symptom of a highly asymmetric workload or a poorly designed service. To check for full stripes monitor all the stripe
statistics in the DB displaying problems. For example the DomainedMemoryDatabase stripes in an 8-stripe configuration are:
"MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-0" "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-1" "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-2" "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-3" "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-4" "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-5" "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-6" "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-7"
72
Rhino Troubleshooting Guide (V2.5.0)
If the committed size of the whole MemDB instance reported is close to the maximum, increase the configured size of this instance. If the
committed size of only one stripe is close to the maximum then the service is a poor match for striping. Reduce the number of stripes.
To resize a MemDB instance follow the instructions in Resizing MemDB Instances on page 73 to alter the size of the LocalMemoryDatabase,
DomainedMemoryDatabase or ReplicatedMemoryDatabase.
When monitoring the MemDB, watch the output of the churn, committedSize and maxCommittedSize fields. The churn field is the total change
in the content of the Memory Database in bytes. The committedSize field is the current committed size of the Memory Database in kilobytes.
The maxCommittedSize field is the maximum allowed committed size of the Memory Database in kilobytes. If the the difference between
committedSize and maxCommittedSize is close to the churn or less than 100kB it is possible that the Memory Database is not committing
transactions as they grow the database past its specified limit.
If the committedSize increases over time or does not approximately track the system load this may indicate a fault in a resource adaptor or
service. Increase the MemDB size to prevent call failures and contact your solution provider for assistance.
For other problems that cause dropped calls refer to sections Dropped Calls on page 58 , Java Virtual Machine heap issues on page 30 and
Application or resource adaptor heap issues on page 32 .
6.2.4 Resizing MemDB Instances
The general workaround for MemDB sizing problems is to make the appropriate Memory Database larger. To do this, edit the node-*/config/
rhino-config.xml file and look for a <memdb> or <memdb-local> entry which has a <jndi-name> with the appropriate Database name
(e.g. ProfileDatabase). Increase the <committed-size> of the database to increase its maximum committed size.
Increasing the size of a Memory Database also requires increasing the maximum heap size of the JVM to accomodate the larger database. To do
this, edit the node-*/config/config_variables file and increase the HEAP_SIZE by the same amount.
These steps will need to be performed for all nodes, and the nodes will need to be restarted for the change to take effect.
Initially try doubling the size of a database and monitor usage to determine the relationship between load and usage. Assuming usage is
proportional to call load you may then need to alter the configured size to accomodate the highest sustained load peak. If usage does not appear
to be proportional to load or the problem is not solved after a second increase in size, please contact your solution provider for support.
73
Rhino Troubleshooting Guide (V2.5.0)
6.3 Resource Adaptors refuse to connect using TCP/IP
Make sure that your SLEE is in the “Running” state. A SLEE that is not running will not activate resource adaptors so will neither listen nor initiate
connections.
If you have IPv6 installed on your machine, then Java may be attempting to use IPv6 rather than IPv4. Depending on the configuration of the
network services and host computer’s interfaces Java may be resolving hostnames differently from the other network components. If this is the
case then the resource adaptor may attempt to connect to the remote system using IPv6 when the system is only listening on an IPv4 address.
Less frequently the reverse may be true.
6.3.1 Diagnostic steps
Use the network diagnostic tools provided with your OS distribution to check the current connected and listening ports for running programs. One
common tool for this purpose is netstat :
$ netstat --inet6 -anp(Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.)Active Internet connections (servers and established)Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program nametcp6 0 0 :::1202 :::* LISTEN 6429/javatcp6 0 0 :::1203 :::* LISTEN 6429/javatcp6 0 0 :::53 :::* LISTEN -tcp6 0 0 :::22 :::* LISTEN -tcp6 0 0 ::1:631 :::* LISTEN -tcp6 0 0 ::1:5432 :::* LISTEN -tcp6 0 0 :::51673 :::* LISTEN -tcp6 0 0 :::9474 :::* LISTEN 6429/javatcp6 0 0 :::1199 :::* LISTEN 6429/javatcp6 0 0 :::111 :::* LISTEN -tcp6 0 0 :::22000 :::* LISTEN 6429/javatcp6 0 0 127.0.0.1:55682 127.0.0.1:5432 ESTABLISHED 6429/javatcp6 0 0 127.0.0.1:55684 127.0.0.1:5432 ESTABLISHED 6429/javaudp6 0 0 ::1:35296 ::1:35296 ESTABLISHED -udp6 0 0 :::45751 :::* -udp6 0 0 :::5353 :::* 7991/plugins --disa
74
Rhino Troubleshooting Guide (V2.5.0)
udp6 0 0 192.168.0.204:16100 :::* 6429/javaudp6 0 0 :::53 :::* -udp6 0 0 :::111 :::* -udp6 0 0 fe80::9eeb:e8ff:fe0:123 :::* -udp6 0 0 ::1:123 :::* -udp6 0 0 :::123 :::* -udp6 0 0 :::704 :::* -raw6 0 0 :::58 :::* 7 -
Check the firewall configuration on both the host running Rhino and the host providing the network function the RA connects to. Also check any
intermediate network components that provide filtering functions.
6.3.2 Workaround or Resolution
Try adding`-Djava.net.preferIPv4Stack=true` as a command-line argument to Rhino. This option can be set by adding a line EXTRA_OPTIONS=-
Djava.net.preferIPv4Stack=true to the RHINO_HOME/config_variables file. Other Java programs that connect to Rhino may also need
this argument added. For other programs an RA connects to consult the documentation for that program to learn how to configure the address
family to use. If a program cannot be configured appropriately you may need to disable support for IPv6 in the OS configuration.
If a firewall is configured to block the addresses used by the resource adaptor it must be reconfigured to allow connections.
6.4 Local hostname not resolved properly
In some cases the local name of a host (e.g. test-vm.example.com) is resolved wrongly (to 127.0.0.1 instead of its external address e.g.
192.168…). This results in remote connections being refused when trying to connect via management clients from a different host.
6.4.1 Symptoms• Connections via management clients are refused
6.4.2 Diagnostic steps
First check that proper permissions have been given in the appropriate sections of $RHINO_HOME/config/mlet.conf (SDK) respectively
$RHINO_HOME/etc/default/config/mlet-permachine.conf (Production).
75
Rhino Troubleshooting Guide (V2.5.0)
Check that no matter whether you are trying to connect through the Standalone Webconsole or the Commandline Console application you get an
exception similar to the one below:
...Connection failed to mycomputer:1199Connection failed to mycomputer:1199 (unreachable at Thu Apr 26 11:32:10 NZST 2007) javax.security.auth.login.LoginException: Could not connect...Caused by: com.opencloud.slee.client.SleeClientException: Could not connect to a Rhino management host at com.opencloud.slee.mlet.shell.spi.jmx.RhinoJmxClient.connect(RhinoJmxClient.java:186) at com.opencloud.slee.mlet.shell.spi.jmx.RhinoJmxClient.connect(RhinoJmxClient.java:124) at com.opencloud.slee.mlet.shell.spi.jmx.RhinoJmxClient.login(RhinoJmxClient.java:242)
6.4.3 Workaround or Resolution
Edit the contents of the file /etc/hosts. It should contain entries as shown below.
...127.0.0.1 localhost192.168.123.123 mycomputer...
Entries that resolve the machine’s name to its local address should be commented out:
...#127.0.0.1 localhost.localdomain localhost mycomputer...
76
Rhino Troubleshooting Guide (V2.5.0)
7 Management
Below are troubleshooting steps — symptoms, diagnostic steps, and workarounds or resolutions — for Rhino management tools and utilities.
7.1 Connections Refused for the Command Console, Deployment Script or Rhino ElementManager
The remote management clients can not connect to the Rhino SLEE
7.1.1 Symptoms
The management clients show the following error when attempting to connect to the SLEE:
user@host:~/rhino/client/bin$ ./rhino-consoleCould not connect to Rhino: [localhost:1199] Connection refused -> This normally means Rhino is not running or the client is connecting to the wrong port.
Use -D switch to display connection debugging messages.
Could not connect to Rhino: [localhost:1199] No route to host
Use -D switch to display connection debugging messages.
Could not connect to Rhino: [localhost:1199] Could not retrieve RMI stub -> This often means the m-let configuration has not been modified to allow remote connections.
Use -D switch to display connection debugging messages.
BUILD FAILED~/rhino/client/etc/common.xml:99: The following error occurred while executing this line:~/rhino/client/etc/common.xml:77: error connecting to rhino: Login failed
77
Rhino Troubleshooting Guide (V2.5.0)
7.1.2 Diagnostic steps and correction
Rhino is not listening for management connections
First, check that there is a running Rhino node on the host the client is trying to connect to. Use the ps command to check that the Rhino
process is running, e.g. ps ax | grep Rhino . If Rhino is running, check the rhino.log to determine if the node has joined the primary
component and started fully. If the Rhino node is failing to join the primary component or otherwise failing to fully start then consult the Clustering
troubleshooting guide.
Make sure that the remote host is accessible using the ping command. Alternatively, make sure that you can log in to the remote host using ssh to
make sure the network connection is working (some firewalls block ping).
Rhino refuses connections
By default, Rhino is set up to not allow remote connections by management clients. Permissions to do so need to be manually configured before
starting the SLEE, as described in the next section.
The management clients connect to Rhino via SSL secured JMX connections. These require both a client certificate and permission to connect
configured in the Java security configuration for Rhino.
To allow remote connections to the SLEE, the MLet configuration file will need to be edited. On the SDK version of Rhino, this is in
$RHINO_HOME/config/mlet.conf and for the Production version of Rhino, this is in $RHINO_HOME/node-???/config/mlet-
permachine.conf for each node.
Edit the MLet configuration file and add the following permission to the JMXRAdaptor MLet security-permission-spec. This should already be
present but commented out in the file. You will need to replace “host_name” with either a host name or a wildcard (e.g. * ).
grant { permission java.net.SocketPermission "{host_name}", "accept,resolve";}
It is also possible that the Rhino SLEE host has multiple network interfaces and has bound the RMI server to a network interface other than the
one that the management client is trying to connect to.
If this is the case then the following could be added to file:$RHINO_HOME/read-config-variables for the SDK or file:$RHINO_HOME/
node-???/read-config-variables for the Production version of Rhino:
78
Rhino Troubleshooting Guide (V2.5.0)
OPTIONS="$OPTIONS -Djava.rmi.server.hostname={public IP}"
Rhino will need to be restarted in order for any of these changes to have effect. For the SDK, this simply means restarting it. For the Production
version, this means restarting the particular node that has had these permissions added.
Management client is not configured to connect to the Rhino host
Make sure that the settings for the management clients are correct. For rhino-console, these are stored in client/etc/client.properties .
You can also specify the remote host and port to connect to using the -h <hostname> and -p <port> command-line arguments. If the SLEE
has been configured to use a different port than the standard one for management client connections (and this has not been configured in the
client/etc/client.properties files), then the port will also need to be specified on the command-line arguments.
If connecting to localhost then the problem is likely to be a misconfigured /etc/hosts file causing the system to resolve localhost to an
address other than 127.0.0.1 .
For Ant deployment scripts run with ant -v and ant will tell you the underlying exception which will provide more detail.
To run the command console or run deployment scripts from a remote machine:
1. Copy $RHINO_HOME/client to the host
2. Edit the file client/etc/client.properties and change the remote.host property to the address of the Rhino host
3. Make sure your Ant build script is using the correct client directory. The Ant property ${client.home} must be set to the location of
your client directory
7.2 A Management Client Hangs
The management clients use SSL connections to connect securely to Rhino. To generate keys for secure connections, these read (and block
doing so) from the /dev/random device. The /dev/random device gathers entropic data from the current system’s devices, but on an idle
system it is possible that the system has no entropy to gather, meaning that a read on /dev/random will block.
7.2.1 Symptoms
A management client hangs for a long period of time on start-up as it tries to read from /dev/random .
79
Rhino Troubleshooting Guide (V2.5.0)
7.2.2 Workaround or Resolution
The ideal problem resolution is to create more system entropy. This can be done by wiggling the mouse, or on a remote server by logging in
and running top or other system utilities. Refer also to the operating system’s documentation, on Linux this is the random(4) man page: man 4
random .
7.3 Statistics client reports “Full thread sample containers”
If statistics gathering is done at a sampling rate which is set too high, the per-thread sample containers may fill before the statistics client can read
the statistics out of those containers.
7.3.1 Symptoms
When gathering statistics, the following may appear in the logs:
2006-10-16 12:59:26.353 INFO [rhino.monitoring.stats.paramset.Events]<StageWorker/Misc/1> [Events] Updating thread sample statisticsfound 4 full thread sample containers
This is a benign problem and can be safely ignored. The reported sample statistics will be slightly inaccurate. To prevent it, reduce the sampling
rate.
7.4 Statistics Client Out of Memory
When running in graphical mode, the statistics client will, by default, store 6 hours of statistical data. If there is a large amount of data or if the
statistics client is set to gather statistics for an extended period of time, it is possible for the statistics client to fail with an Out of Memory Exception.
7.4.1 Symptoms
The statistics client will fail with an Out of Memory Exception.
80
Rhino Troubleshooting Guide (V2.5.0)
7.4.2 Workaround or Resolution
The user is recommended to use the -k option of the statistics client when running in graphical mode to limit the number of hours of statistics kept.
If it is required that statistics be kept for a longer period, it is recommended that the statistics client be run in command-line mode and the output
be piped to a text file for later analysis.
For more information, run client/bin/rhino-stats without any parameters. This will print a detailed usage description of the statistics client.
7.5 Creating a SyslogAppender gives an AccessControlException
Creating a SyslogAppender using the following entry in logging.xml will not work, as the appender does not perform it’s operations using the
proper security privileges:
<appender appender-class="org.apache.log4j.net.SyslogAppender" name="SyslogLog"> <property name="SyslogHost" value="localhost"/> <property name="Facility" value="user"/></appender>
7.5.1 Symptoms
The following error would appear in Rhino’s logs:
2006-10-19 15:16:02.311 ERROR [simvanna.threadedcluster] <ThreadedClusterDeliveryThread> Exception thrown in delivery thread java.security.AccessControlException: access denied (java.net.SocketPermission 127.0.0.1:514 connect,resolve)at java.security.AccessControlContext.checkPermission(AccessControlContext.java:264)at java.security.AccessController.checkPermission(AccessController.java: 427)at java.lang.SecurityManager.checkPermission(SecurityManager.java:532)at java.lang.SecurityManager.checkConnect(SecurityManager.java:1034)at java.net.DatagramSocket.send(DatagramSocket.java:591)at org.apache.log4j.helpers.SyslogWriter.write(SyslogWriter.java:69)at org.apache.log4j.helpers.QuietWriter.write(QuietWriter.java:39)at org.apache.log4j.helpers.SyslogQuietWriter.write(SyslogQuietWriter.java:45)at org.apache.log4j.net.SyslogAppender.append(SyslogAppender.java:245)
81
Rhino Troubleshooting Guide (V2.5.0)
7.5.2 Workaround or Resolution
For the case where a SyslogAppender is required, the createsyslogappender command of the rhino-console provides a much easier user
interface to achieve this task.
Replacing the entry org.apache.log4j.net.SyslogAppender above with com.opencloud.rhino.logging.RhinoSyslogAppender
will also fix this problem. The Open Cloud version of the SyslogAppender is a simple wrapper around the Log4J version which wraps the
append(LoggingEvent event) method in a “doPrivileged” block. For custom appenders not provided with Rhino, the same method can be used:
public void append(final LoggingEvent event) { AccessController.doPrivileged(new PrivilegedAction() { public Object run() { RhinoSyslogAppender.super.append(event); return null; } });}
7.6 Platform Alarms
Rhino raises alarms in various situations, some of which are discussed in this section for troubleshooting purposes. The full list of current Rhino
core alarms is available using the alarmcatalog command in the rhino-console .
7.6.1 Symptoms• Alarm notification messages in Rhino logs
• Alarms appearing in network management systems
• Entries present in rhino-console command listactivealarms
7.6.2 Diagnostic steps
The presence of an alarm may be viewed by the output of the following command.
./client/bin/rhino-console listactivealarms
82
Rhino Troubleshooting Guide (V2.5.0)
1. Upon the loss of a node from the cluster an alarm with alarm type of rhino.node-failure and an alarm source of
ClusterStateListener is raised. This alarm is cleared either by the administrator or when the node rejoins the cluster. This alarm
is not raised for quorum nodes.
2. If a user rate limiter capacity is exceeded an alarm with an alarm source of ThresholdAlarms is raised. This alarm is cleared when
the event rate drops below the limiters configured capacity.
3. If a JMX mlet cannot be started successfully an alarm with an alarm source of MLetStarter is raised. These alarms must be
cleared manually.
4. If the rule for any user-defined threshold-based alarm is met an alarm with a user defined alarm type and alarm source is raised.
These alarms are cleared when the rule condition is no longer met or if the administrator clears them
5. The licenses installed on the platform are insufficient for the deployed configuration: The alarm type is “rhino.license” and the alarm
source is “LicenseManager”.
a. A license has expired.
b. A license is due to expire in the next seven days.
c. A license units are being processed for a currently unlicensed function.
d. The consumption rate for a particular license is greater than the consumption rate which the license allows.
The alarm type used in notifications that are reporting an alarm has cleared is the original alarm type plus .clear , for example rhino.node-
failure.clear .
7.6.3 Workaround or Resolution
Alarms with an alarm source of ThresholdAlarms indicate that the system is receiving more input that it has been configured to receive.
Alarms with an alarm source of LicenseManager indicate that a Rhino installation is not licensed appropriately. Other alarms are either user
defined licenses, or defined by an Application or Resource Adaptor.
Alarms with an alarm source of MLetStarter or some other non-obvious key usually indicate a software issue, such as a misconfiguration of the
installed cluster.
In most of these cases, the remedy is to contact your solution support provider for a new license or for instructions on how to remedy the situation.
83
Rhino Troubleshooting Guide (V2.5.0)
7.7 DeploymentException when trying to deploy a component
7.7.1 Symptoms
DeploymentException when trying to deploy a component.
7.7.2 Diagnostic steps
Native Library XXXXLib.so already loaded in another classloader
Each time you deploy the RA, it happens in a new classloader (because the code may have changed). If no class GC has happened, or if
something is holding a reference to the old classloader and keeping it alive, the old library will still be loaded as well. See http://java.sun.com/docs/
books/jni/html/design.html#8628
modifying the deployable-unit.xml file within it and the output of jar -tvf looks similar to the below:
$ jar -tvf service.jar1987 Wed Jun 13 09:34:02 NZST 2007 events.jar76358 Wed Jun 13 09:34:02 NZST 2007 sbb.jar331 Wed Jun 13 09:34:02 NZST 2007 META-INF\deployable-unit.xml106 Wed Jun 13 09:34:02 NZST 2007 META-INF\MANIFEST.MF693 Wed Jun 13 09:34:02 NZST 2007 service-jar.xml
7.7.3 Workaround or Resolution• Restart Rhino nodes before redeployment
• Force a full GC manually before redeployment (Requires Rhino to be configured with -XX:-DisableExplicitGC )
• Change the JNI library name whenever redeploying
• Ensure the classes that use JNI are loaded by a higher-level classloader, e.g. the Rhino system classloader or a library. (of course,
that also means you can’t deploy new versions of those classes at runtime)
Jars always use forward slashes ("/") as a path separator. Repackage the DU jar with a different file archiver, preferrably the jar tool.
84
Rhino Troubleshooting Guide (V2.5.0)
7.8 Deploying to multiple nodes in parallel fails
7.8.1 Symptoms
You are deploying Rhino using a script that creates and deploys components to multiple nodes asynchronously. The deployment fails with one of
the following exceptions on each node. When deploying the nodes serially, one after the other, no exceptions are reported.
WARN [rhino.management.deployment]Installation of deployable unit failed:javax.slee.management.AlreadyDeployedException: URL already installed: file:/opt/rhino/apps/sessionconductor/rhino/dist/is41-ra-type_1.2-du.jarat com.opencloud.rhino.management.deployment.Deployment.install(4276)
[WARN, rhino.management.resource, RMI TCP Connection(4)-192.168.84.173] -->Resource adaptor entity creation failed: java.lang.IllegalStateException: Not in primary componentat com.opencloud.ob.Rhino.runtime.agt.release(4276)...
7.8.2 Diagnostic steps
Rhino provides single system image for management. You do not need to deploy a DU on each node in a cluster. Installing a deployable unit on
any node in Rhino cluster propagates that DU to all nodes in the cluster, so if the DU had already deployed via node 102, it can’t also be deployed
via node 101.
In addition, if a new node is created and joins a running cluster, it will be automatically synchronised with the active cluster members (i.e DUs
installed, service states, log levels, trace levels, alarms etc).
A Rhino cluster will only allow one management operation that modifes internal state to be executed at any one time, so you can’t, for example,
install a DU on node 101 and a DU on node 102 at the same time. One of the install operations will block until the other has finished. You can run
multiple read-only operations simultaneously, though.
7.8.3 Workaround or Resolution
Create all nodes from the same base install, optionally starting the nodes in parallel. Wait for the nodes to start then run the component installation
script against only one node.
85
Rhino Troubleshooting Guide (V2.5.0)
7.9 Management of multiple Rhino instances
7.9.1 Symptoms
You are trying to use rhino-console to talk to multiple rhino instances however it will not connect to the second instance.
7.9.2 Workaround or Resolution
Unfortunately it is not possible to store keys for multiple Rhino instances in the client’s keystore, they are stored using fixed aliases. With the
current implementation, there are two ways to connect to multiple Rhino instances from a single management client:
1. Copy the rhino-private.keystore to all the Rhino home directories so that all instances have the same private key on the
server. This may be adequate for test environments.
2. Create a copy of client.properties that points to a different client keystore, and tweak the scripts to parameterise the
client.properties Java system property. Example:
OPTIONS="$OPTIONS -Dclient.properties=file:$CLIENT_HOME/etc/${RMISSL_PROPERTIES:client.properties}"
If doing this you may also want to parameterise the keystore password to restrict access to authorised users.
7.10 Deployment problem on exceeding DB size
7.10.1 Symptoms
Deployment fails with Unable to prepare due to size limits of the DB
7.11 Diagnostic steps
See Memory Database Full on page 66 for how to diagnose and resolve problems with the size of the Rhino in-memory databases, including the
management database.
86
Rhino Troubleshooting Guide (V2.5.0)
7.12 BUILD FAILED when installing an OpenCloud product
7.12.1 Symptoms
Installation fails with an error like:
$:/opt/RhinoSDK/cgin-connectivity-trial-1.5.2.19 # ant -f deploy.xmlBuildfile: deploy.xml
management-init: [echo] Open Cloud Rhino SLEE Management tasks defined
login:
BUILD FAILED/opt/RhinoSDK/client/etc/common.xml:102: The following error occurred while executing this line:/opt/RhinoSDK/client/etc/common.xml:74: No supported regular expression matcher found: java.lang.ClassNotFoundException: org.apache.tools.ant.util.regexp.Jdk14RegexpRegexp
Total time: 0 seconds
7.12.2 Diagnostic steps
Run Ant with debugging output to check which version of Ant is being used and the classpath
ant -d -f deploy.xml > output.txt
7.12.3 Workaround or Resolution
Add the missing libraries to your Ant lib directory
We recommend you use the Ant version shipped with Rhino to avoid this problem e.g.
/opt/RhinoSDK/client/bin/ant -f deploy.xml
87
Rhino Troubleshooting Guide (V2.5.0)
7.13 REM connection failure during management operations
7.13.1 Symptoms
Performing a management operation, e.g. activating an RA entity, fails with the following error:
Could not acquire exclusive access to Rhino server
7.13.2 Diagnostic steps
The message is sometimes seen when Rhino is under load and JMX operations are slow to return. Check the CPU load on the Rhino servers.
REM Exceptions of this type can also occur when stopping or starting the whole cluster.
When REM auto-refresh interval is set to low value (default is 30 seconds) there is high likelihood of a lock collision happening. With higher auto-
refresh intervals the likelihood drops down. With auto-refresh interval set to "Off" the exception may not occur at all.
If the rem.interceptor.connection log key is set to DEBUG in REM’s log4j.properties , then the logs will show which operations could
not acquire the JMX connection lock.
7.13.3 Workaround or Resolution
If the CPU load on the Rhino server is high then follow the resolution advice in Operating environment issues on page 28 .
If the auto-refresh interval is low then increase it until the problem stops.
For further diagnostic and resolution assistance contact Open Cloud or your solution provider, providing the REM logs.
7.14 Export error: Multiple Profile Snapshot for profiles residing in seperate memdb instances isunsupported
7.14.1 Symptoms
Trying to export Rhino configuration with rhino-export fails with an error like:
88
Rhino Troubleshooting Guide (V2.5.0)
com.opencloud.ui.snapshot.SnapshotClientException: Multiple Profile Snapshot for profiles residing in seperate memdb instances is unsupportedat com.opencloud.ob.client.be.a(80947:202)at com.opencloud.ui.snapshot.SnapshotClient.performProfileTableSnapshot(80947:294)at com.opencloud.rhino.management.exporter.Exporter.b(80947:382)at com.opencloud.rhino.management.exporter.Exporter.a(80947:350)at com.opencloud.rhino.management.exporter.Exporter.run(80947:291)at com.opencloud.rhino.management.exporter.Exporter.main(80947:201)Press any key to continue...
7.14.2 Workaround or Solution
Run rhino-export with the -J option to use JMX for exporting the profile data. This is slightly less efficient but can handle multiple profile
storage locations.
7.15 Unused log keys configured in Rhino
7.15.1 Symptoms
After installing multiple versions of a service on Rhino listlogkeys reports a number of obsolete keys. How can these be removed?
7.15.2 Workaround or Resolution
The unused log keys are marked for removal - they should disappear when the Rhino JVM restarts. Until then they are harmless.
7.16 Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT
7.16.1 Symptoms
Any management operation done on the cluster fails with:
2014-06-12 11:40:39.156 WARN [rhino.management.trace] <RMI TCP Connection(79)-10.240.83.131> Error setting trace level of root tracer for SbbNotification[service=ServiceID[name=TST,vendor=XXX,version=1.2.0],sbb=SbbID[na
89
Rhino Troubleshooting Guide (V2.5.0)
me=SDMSbb,vendor=XXX,version=1.2.0]]: com.opencloud.savanna2.framework.lock2.LockUnavailableException: Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT, current owners=[TransactionId:[101:153930147780781]] at com.opencloud.ob.Rhino.aR.a(2.3-1.12-72630:662) at com.opencloud.ob.Rhino.aR.acquireExclusive(2.3-1.12-72630:65) at com.opencloud.rhino.management.trace.Trace.setTraceLevel(2.3-1.12-72630:256) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
7.16.2 Diagnostic steps
The management lock is acquired and held for the duration of all management operations to prevent concurrent modification of Rhino state and
on-disk data. To find out more information about the transaction holding the lock use the gettransactioninfo console command. To find out
what method is blocking release of the lock use jstack $(cat node-???/work/rhino.pid) or kill -QUIT $(cat node-???/work/
rhino.pid) to dump the current thread state to the Rhino console log.
7.16.3 Workaround or resolution
Contact your solution provider with the Rhino logs showing the problem and a list of the management operations that were performed immediately
prior to the one that timed out. If the management operation is permanently blocked, e.g. by an infinite loop in the raStopping() callback of an RA,
the cluster will need to be restarted to interrupt the stuck operation. If it is not permanently blocked you must wait until the operation has finished.
7.17 Log level for trace appender not logging
7.17.1 Symptoms
Setting the log level for a logger trace.??? does not change the level of information logged under this key. The log level was set using the
setloglevel command.
7.17.2 Workaround or Resolution
The log keys trace.??? are special. These are SLEE tracers which have their own level configuration that they feed into the logging subsystem.
Use the settracerlevel command to set tracer levels for RA entities and SBBs.
90
Rhino Troubleshooting Guide (V2.5.0)
7.18 Access to REM fails with Command CHECK_CONNECTION invoked without connection ID
7.18.1 Symptoms
After updating or reinstalling REM access fails with an error in the REM log similar to:
2012-02-24 16:16:17.204 ERROR [rem.server.http.connection] <btpool0-6> Command CHECK_CONNECTION invoked without connection ID
7.18.2 Workaround or Resolution
The most common cause for these errors is that the browser-hosted part of REM does not match the server code. Refresh the browser tab to
reload the client code. You may need to clear the browser cache.
91
Rhino Troubleshooting Guide (V2.5.0)
8 Database
Below are troubleshooting steps -- symptoms, diagnostic steps, and workarounds or resolutions -- for the Rhino management database persistent
store.
8.1 Management Database Server Failure
Rhino uses either the PostgreSQL or Oracle Database Server as a non-volatile storage mechanism. It is possible that the database server can
terminate due to administrative action or software failure. Failure to communicate with the database server may also occur due to a network fault.
This is a problem related to the database only and is not indicative of a problem in Rhino.
A Rhino cluster gracefully handles the failure of a management database server, however if all nodes in the cluster fail and are restarted before
the database server is restarted, then the cluster will pause in its booting process until a server is available. If all nodes in the cluster fail during the
period the database server was unavailable, then any configuration or profile changes made after the database failure will be lost.
8.1.1 Symptoms
Warning messages in the system console from the log key memdb.storer
Messages similar to the following indicate that a PostgreSQL Database Server has terminated either due to administrative action or failure.
...WARN [memdb.storer] <SQLStableStorage Storer thread for rhino_profiles> [ProfileDatabase/rhino_profiles] Connection failedorg.postgresql.util.PSQLException: An I/O error occured while sending to the backend. at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:201) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:388) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:329) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:239) ... at java.lang.Thread.run(Thread.java:595)Caused by: java.io.EOFException at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:243)
92
Rhino Troubleshooting Guide (V2.5.0)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1122) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:175) ... 8 moreWARN [memdb.storer] <SQLStableStorage Storer thread for rhino_management> [ManagementDatabase/rhino_management] Connection failed org.postgresql.util.PSQLException: An I/O error occured while sending to the backend. at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:201) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:388) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:329) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:239) ... at java.lang.Thread.run(Thread.java:595)Caused by: java.io.EOFException at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:243) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1122) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:175)... 8 more...
8.1.2 Resolution and Mitigation
The management database server should be restarted as soon as possible. Once the database has been restarted and the Rhino cluster has
reconnected, messages similar to the following should appear in the Rhino logs.
...INFO [memdb.storer] <SQLStableStorage Storer thread for rhino_profiles> [ProfileDatabase/rhino_profiles] Connected to databaseINFO [memdb.storer] <SQLStableStorage Storer thread for rhino_management> [ManagementDatabase/rhino_management] Connected to database...
To reduce the risk of this causing data loss Rhino can be configured to replicate persisted data across multiple PostgreSQL instances. When
using Oracle as the persistence database, replication is provided by the Oracle RAC clustering mechanism.
93
Rhino Troubleshooting Guide (V2.5.0)
9 Signalware
Below are troubleshooting steps — symptoms, diagnostic steps, and workarounds or resolutions — for using Signalware with Rhino’s CGIN
Resource Adaptor.
9.1 CGIN RA to Signalware Backend Connection Errors
Rhino’s CGIN resource adaptor for the SS7 protocol family requires a native backend process, known as a Signalware backend, to communicate
with the Signalware SS7 stack. The Signalware backend process must run on the Signalware host systems, and the resource adaptor
communicates with it via TCP. Generally, the cause is configuration related or a version mismatch between the resource adaptor and backend.
9.1.1 Symptoms• Resource Adaptor cannot connect to backend.
• Resource Adaptor drops connection to backend.
9.1.2 Diagnostic steps
If, when activating the resource adaptor (or starting the Rhino SLEE with the resource adaptor active) the following message immediately appears
in the Rhino logs and an alarm is raised stating that a backend connection has been lost, then this indicates that the backend process is not
reachable by the resource adaptor. Generally this would mean it is either not running or, due to resource adaptor or network misconfiguration, the
resource adaptor is unable to establish a TCP connection to it.
2013-07-18 14:54:24.907 Major [rhino.facility.alarm.manager] <New I/O worker #47> Alarm 101:140617544395779 [RAEntityNotification[entity=insis-ptc-external],noconnection,localhost:10102] was raised at 2013-07-18 14:54:24.907 to level Major Lost connection to backend localhost:101022013-07-18 14:54:24.907 INFO [rhino.facility.alarm.csv] <New I/O worker #47> 2013-07-18 14:54:24.907,raised,2013-07-18 14:54:24.907,101:140617544395779,RAEntityNotification[entity=insis-ptc-external],noconnection,localhost:10102,Major,Lost connection to backend localhost:10102,
94
Rhino Troubleshooting Guide (V2.5.0)
If the resource adaptor successfully established a connection to the backend but later one or both of the following messages appears in the logs
this indicates that the connection to the backend process was setup successfully but lost due to backend or network failure.
2011-03-29 10:20:30.942 Warning [trace.cginra.backend.[sedition:24146]] <RAContextTimer-cginra> sedition:24146#4: connection heartbeat lost. Last sent: 1301347220, last received: 1301347210
2016-11-17 16:20:26.070 Major [rhino.facility.alarm.manager] <cginra-thread-1> Alarm 101:194478654849023 [RAEntityNotification[entity=cginra],noconnection,autotests-signalware:10101] was raised at 2016-11-17 16:20:26.069 to level Major Lost connection to backend autotests-signalware:101012016-11-17 16:20:26.071 Warning [trace.cginra.backend.[autotests-signalware:10101]] <cginra-thread-1> Connection autotests-signalware:10101#1 lost: Remote host closed connection
9.1.3 Workaround or Resolution
First determine whether the backend process is running. If the backend process is not running then either it or Signalware may be incorrectly
configured.
If the backend process is running check that the host and port it is listening on is correctly configured in the resource adaptor entity configuration
properties. Also verify that the host running the backend is reachable from the Rhino host and that the port is not blocked by a firewall on either
host.
If a connection is established but gets dropped this indicates either a software failure in the backend or a network level failure. Check that the
backend process is running and if not, restart it. Any error messages from a failed backend should be sent to your solution provider to determine
the cause of failure. If the backend process is still running check for any network connectivity problems between the Rhino host and the host
running the backend processes.
9.2 CGIN RA Cannot Create Outgoing Dialogs
If one of Rhino’s Signalware Resource Adaptors has more than the configured maximum number of active dialogs with a single backend process it
will fail to allocate new dialogs.
9.2.1 Symptoms• Applications fail to create outgoing dialogs using the CGIN Resource Adaptor
95
Rhino Troubleshooting Guide (V2.5.0)
9.2.2 Diagnostic steps
If the backend process has reached its maximum number of active dialogs an warning similar to the following will be logged by the Resource
Adaptor:
2016-11-17 14:49:42.410 Warning [trace.cginra.backend.[signalware:10100]] <cginra-thread-2> Unable to allocate a dialog handle: Out of dialog ids
Additionally an error similar to the following will be visible in the output of the backend process:
2016-11-17 11:45:28.823003 rhino_252_1: Failed to allocate a new dialog ID. errno=7
9.2.3 Workaround or Resolution
By default each backend process supports up to 32,000 simultaneous dialogs. If the number of simultaneous dialogs exceeds this limit the above
errors will occur. There are two approaches to working around this issue:
• Increase the number of dialogs per backend process using the -maxdialogs N parameter (where N is the number of dialogs. Up to
32,000 concurrent dialogs are supported).
• If the backend process is already configured for the maximum 32,000 dialogs add additional backend processes, then, to support
more active calls, spread the load across additional processes. See the https://developer.opencloud.com/devportal/display/CGIN1v5/
Using+the+Ulticom+Signalware+TCAP+Stack for details.
9.3 CGIN RA Cannot Receive Incoming Dialogs
9.3.1 Symptoms• Incoming dialogs do not arrive at CGIN RA.
9.3.2 Diagnostic steps
If the backend process has reached its maximum number of active dialogs the Signalware backend process will log something similar to the
following:
96
Rhino Troubleshooting Guide (V2.5.0)
2016-11-17 13:12:39.035658 rhino_252_1: cTCAPTakeMsg() failed: (ctcap_errno=12), Out of Dialog entries
9.3.3 Workaround or Resolution
The root cause and solution for this is identical to that for CGIN RA Cannot Create Outgoing Dialogs on page 95 . Please see that section for
details.
9.4 Problems with Signalware not involving the CGIN backends
Please refer to the Signalware Installation Manual for more information about installing and configuring Signalware. If you do not find a solution for
your problem, contact your solution provider and provide the system report created by the dumpce command.
97
Rhino Troubleshooting Guide (V2.5.0)
10 Exit Codes
Below is a list of the possible exit codes of the Rhino process.
10.1 Rhino Exit Codes
Exit Code Error Action
1 Unspecified error. See the Rhino and console logs for details.
2 Misconfiguration found while reading a config
file. See the Rhino and console logs for
details.
Check the config files and fix any problems
reported in the logs.
3 Watchdog shutdown Check the Rhino log for the cause. Most
frequently this is caused by system overload
or a faulty service.
4 Timeout waiting for normal shutdown to
complete
No action required
5 Restart of node requested by management
operation
No action required
10.2 JVM Exit Codes
The JVM reports a number of exit codes that may be added to from version to version. At the time of writing this they include the following:
98
Rhino Troubleshooting Guide (V2.5.0)
10.2.1 Internal JVM Exit Codes
No known exit codes other than 0 (Normal exit)
10.2.2 Signal Handler Exit Codes
Exit codes above 128 occur as a result of the JVM receiving a signal from the OS. These are always a result of either OS process management
or a JVM bug that has triggered an OS level protection mechanism. In many cases the console log will have information on the cause. There may
also be information in the system log. On Linux these are the following, other OSes differ:
Signal Name Signal Number Exit Code Common causes
HUP 1 129 • User closed the
terminal running
Rhino.
• Operator sent signal
to the Rhino process.
INT 2 130 • Ctrl-C in the terminal
running Rhino.
• Operator sent signal
to the Rhino process.
QUIT 3 N/A JVM does not exit. Operator triggered JVM thread
stack dump.
ILL 4 134 JVM Bug
TRAP 5 133 Bug in a debugger attached to the
Rhino JVM
ABRT/IOT 6 134 JVM bug
99
Rhino Troubleshooting Guide (V2.5.0)
BUS 7(x86) 10(SPARC) 135 JVM bug
FPE 8 134 JVM bug
KILL 9 137 Operator or OS killed the Rhino
process (Out of memory or
system shutdown)
USR1 10(x86) 30(SPARC) 138 Operator sent a signal to the
Rhino process
SEGV 11 134 JVM bug
USR2 12(x86) 31(SPARC) 134 Operator sent a signal to the
Rhino process
ALRM 14 142 Native library set an OS alarm
timer
TERM 15 143 Operator or OS killed the Rhino
process (system shutdown)
STKFLT 16(x86 only) 144 JVM bug
XCPU 24(x86) 30(SPARC) 152 Ulimit on CPU time
VTALRM 26(x86) 28(SPARC) 154 Native library set an OS alarm
timer
PROF 27(x86) 29(SPARC) 155 Bug in profiler attached to the
Rhino JVM
100
Rhino Troubleshooting Guide (V2.5.0)
IO 29(x86) 22(SPARC) 157 Bug in native library performing
asynchronous IO
SYS 31(x86) 12(SPARC) 140 Bug in JVM, native library or
kernel
10.3 Other exit codes
Some libraries may call System.exit() with a code not in this list. We are not aware of any libraries used by Rhino that do this and recommend
avoiding any that do when developing services.
101