Download - Rep Server Monitoring Best Practices[1]
Sybase, Inc. 2009 Page 1 of 51
Best Practice Recommendation Subject: Monitoring Replication System
Author(s): Deepak Upadhyay, Sr.DBA, Sybase IT
Reviewer(s): David Burgess, Staff DBA, Sybase IT
Udaya Challapalli, Sr. DBA Manager, Sybase IT
Hema Seshadri, Sr. DBA Manager, Sybase IT
Contributor(s): David Burgess, Staff DBA, Sybase IT
Abstract: The purpose of this document is to provide best practices to monitor a typical Sybase replication environment.
Sybase, Inc. 2009 Page 2 of 51
Table of Contents 1 Introduction ...................................................................................................................................... 3
1.1 Definitions.............................................................................................................................. 3 1.1.1 Relational Database....................................................................................................... 3 1.1.2 Primary database........................................................................................................... 3 1.1.3 Replicate database ........................................................................................................ 3 1.1.4 Standby database.......................................................................................................... 3 1.1.5 Replication Server.......................................................................................................... 3 1.1.6 Primary replication server .............................................................................................. 3 1.1.7 Replicate replication server ........................................................................................... 3
Best Practice Solution ..................................................................................................................... 4 1.2 General Monitoring................................................................................................................ 4
1.2.1 Status monitoring........................................................................................................... 5 1.2.2 Errorlog ........................................................................................................................ 17 1.2.3 Disk space ................................................................................................................... 21 1.2.4 Replication topology..................................................................................................... 22
1.3 Performance monitoring...................................................................................................... 28 1.3.1 Latency ........................................................................................................................ 28 1.3.2 Throughput –................................................................................................................ 34 1.3.3 Statistics monitoring (i.e. Monitor counters) – ............................................................. 35
1.4 Alerting/Notification – .......................................................................................................... 43 1.4.1 RSM Event monitoring................................................................................................. 43 1.4.2 Scripts .......................................................................................................................... 45
1.5 General Troubleshooting..................................................................................................... 49 1.5.1 Skipping transaction – ................................................................................................. 49 1.5.2 Dumping stable queue –.............................................................................................. 49 1.5.3 Disabling secondary truncation point –........................................................................ 49 1.5.4 Enabling secondary truncation point –......................................................................... 50
References .................................................................................................................................... 51
Sybase, Inc. 2009 Page 3 of 51
Introduction The purpose of this document is to list out best practices used to monitor Sybase Replication System.
1.1 Definitions
1.1.1 Relational Database Type of database which groups the data into related tables, and the tables has two major elements (i.e. ROW & COLUMN).
1.1.2 Primary database A database where transactions are originally performed (by end-user/process) and those transactions are grouped/captured for replication server.
1.1.3 Replicate database A database which receives replicated transactions from replication server and applied those transactions to its own copy of primary database. A replicate database may or may not be the exact copy of its primary database.
1.1.4 Standby database A database which receives replicated transactions from replication server and applied those transactions to its own copy of primary database. A standby database is generally the exact copy of its primary database.
1.1.5 Replication Server Sybase Open Client/Server product which performs continuous, asynchronous transfer of transaction log from a primary to replicate database(s).
1.1.6 Primary replication server An instance of a replication server performs continuous, asynchronous transfer of transaction log from a primary to other replication server(s). It is also capable to replicate transactions to replicate databases.
1.1.7 Replicate replication server An instance of replication server generally receives transactions from primary replication server and replicates to replicate database(s).
Sybase, Inc. 2009 Page 4 of 51
Best Practice Solution In this section monitoring is categorized as follows • General Monitoring • Performance monitoring • Alerting/Notification • General trouble-shooting We will discuss each of the above categories in detail in the sections below.
1.2 General Monitoring Monitoring replication is very critical. Effective monitoring is the key to maintaining a replication system, since time is an important factor (i.e. time to “FIX” any issues). If connections are suspended for a long time, this can cause: • The stable device to fill • The Replication Agent to suspend at the primary site • Transaction log to FILL up at primary site • MAY CAUSE ALL TRANSACTION TO SUSPEND/ABORT (i.e. STOP ALL ACTIVITY) In this category of monitoring basically the objective is to make sure all components of replication system are up and running AND to avoid any surprise failure of the system. Generally monitoring for category is divided as follows • Status monitoring • Log (i.e. error log) monitoring • Space monitoring (i.e. disk space) • Overall topology (i.e. functioning as it is supposed to be)
o Table schema monitoring - To make sure the schema matches with replication definitions and with replicate database(s), especially after any application changes
o Marking replication - All required tables are marked for replication correctly o Data is in sync between primary and replicate sites
Following diagram lists out various components to monitor
Sybase, Inc. 2009 Page 5 of 51
1.2.1 Status monitoring This section details various replication systems’ components and also provides detail on why it is important to monitor them. Following major components are considered in this section • Servers • Connection/Routes • Replication queues • Replication Agents/LTMs • Replication threads/modules
1.2.1.1 Servers Generally there are two types of servers • Database servers
Sybase, Inc. 2009 Page 6 of 51
• Replication server
1.2.1.1.1 Database servers – In replication servers basically three types of database servers used • Primary • Replicate • RSSD Primary – The most important database of any application, but from replication system point of view this is of its critical component to monitor. Whether the server is up and running can be monitored using many different methods • Scripts • RMS/RSM event monitoring • Other third-party tools Replicate – When performed as standby (i.e. DR) this is as critical as primary database server. Shutdown of replicate site(s) have a big impact (to the primary site i.e. final application) based on the configuration of the replication system. Several key factors (below) values can define impact of down replicates sites • Time to fix and bring up replicate site(s) • Space allocated to the stable device(s) for the replication server(s) • Complexity of the replication system (i.e. how easy is to rebuild the replication
system or how easy is to remove/add the replicate site from/to replication system. Again, whether the server is up and running can be monitored using many different methods • Scripts • RMS/RSM event monitoring • Other third-party tools RSSD – Equally important as other replicate database servers, since failure of RSSD database server may contribute to failure of replication system Again, whether the server is up and running can be monitored using many different methods • Scripts • RMS/RSM event monitoring • Other third-party tools
Sybase, Inc. 2009 Page 7 of 51
1.2.1.1.2 Replication servers – Mostly there is only one replication server which performs all replications in the replication system. More complex environment will have multiple types of required replication servers as mentioned below. The status of the server can be monitored using scripts or RMS/RSM event monitoring. ID – Must be up and running during at-least on following conditions • Add new replication server to the replication system • Adding databases to the replication system • Adding routes to the replication system Primary – Responsible to collecting published data. Can impact the business, If down for longer duration (without being monitored!!). Intermediate - Flow of data between two replication servers. Monitoring the intermediate replication server is equally important as other replication servers. Replicate – Apply replicate data. Monitoring the intermediate replication server is equally important as other replication servers.
1.2.1.2 Connections/Routes – Critical components of the replication server, there status generally make sure that data is replication smoothly or of any issues Logical connection – In warm-standby environment, it is important to verify the current active and standby connections isql –U<user> -S<RepSrv> -P<pwd>
1> admin logical_status 2> go
Logical Conn Active Connection name
Active Conn State
Standby Connection name
Standby conn state
Controller RS
Operations in progress
State of Operations in progress
Spid
[278] DBS.DBSfast_LC
[281] PDSDBS1.DBSfast
Active/ [1526] PDSDBS5.DBSfast
Active/ [16777358] PRLDBS1A
None None
[283] DBS.DBSudef_LC
[284] PDSDBS1.DBSudef
Active/ [1528] PDSDBS5.DBSudef
Active/ [16777358] PRLDBS1A
None None
[375] DBS.DBSuomm_LC
[539] PDSDBS1.DBSuomm
Active/ [1529] PDSDBS5.DBSuomm
Active/ [16777358] PRLDBS1A
None None
[376] DBS.DBSvend_LC
[540] PDSDBS1.DBSvend
Active/ [1527] PDSDBS5.DBSvend
Active/ [16777358] PRLDBS1A
None None
[543] DBS.u_DBScta_LC
[544] PDSDBS1.u_DBScta
Active/ [1530] PDSDBS5.u_DBScta
Active/ [16777358] PRLDBS1A
None None
Physical database connection – Make sure that the database connection is up and running especially for replicate database connection.
Sybase, Inc. 2009 Page 8 of 51
isql –U<user> -S<RepSrv> -P<pwd>
1> admin health 2> go
Mode Quiesce Status ---- ------- ------ NORMAL FALSE HEALTHY
In most of below “admin” command’s output the column “State” is important to observe. Following table describes possible values for the column
State Description
Active Actively processing a command.
Awaiting Command The thread is waiting for a client to send a command.
Awaiting I/O The thread is waiting for an I/O operation to finish.
Awaiting Message The thread is waiting for a message from an Open Server message queue.
Awaiting Wakeup The thread has posted a sleep and is waiting to be awakened.
Connecting The thread is connecting.
Down The thread has not started or has terminated.
Getting Lock The thread is waiting on a mutual exclusion lock.
Inactive The status of an RSI User thread at the destination of a route when the source replication Server is not connected to the destination replication Server.
Initializing The thread is being initialized.
Suspended The thread has been suspended by the user.
1> admin who_is_down
Sybase, Inc. 2009 Page 9 of 51
2> go
Spid Name State Info DSI EXEC Suspended 414(1) PRSPS3.cons DSI Suspended 414 PRSPS3.cons
1> admin who 2> go
Spid Name State Info 58 DIST Awaiting Wakeup 376 DBS.DBSvend_LC 67 SQT Awaiting Wakeup 376:1 DIST DBS.DBSvend_LC 29 SQM Awaiting Message 376:1 DBS.DBSvend_LC 28 SQM Awaiting Message 376:0 DBS.DBSvend_LC 82 DSI EXEC Awaiting Command 540(1) PDSDBS1.DBSvend 42 DSI Awaiting Message 540 PDSDBS1.DBSvend 8427 REP AGENT Awaiting Command PDSDBS1.DBSvend 8203 DSI EXEC Awaiting Command 1527(1) PDSDBS5.DBSvend 8202 DSI Awaiting Message 1527 PDSDBS5.DBSvend 4646 DSI EXEC Awaiting Command 614(1) PRSDBS1.DBSvend 4645 DSI Awaiting Message 614 PRSDBS1.DBSvend 32 SQM Awaiting Message 614:0 PRSDBS1.DBSvend 38 RSI Awaiting Wakeup PRLRMDBS1 37 SQM Awaiting Message 16777372:0 PRLRMDBS1 86 RSI USER Awaiting Command PRLRMDBS1 54 dSUB Sleeping 15 dCM Awaiting Message 18 dAIO Awaiting Message 62 dREC Sleeping dREC 63 dSTATS Sleeping 1152 USER Active sa 14 dALARM Awaiting Wakeup
Direct/In-direct routes – Based on outbound queue size of the source replication server, it is critical to monitor the status of the route. isql –U<user> -S<RepSrv> -P<pwd>
1> admin who,rsi 2> go In addition to “State” column, other two highlighted columns should tell (i.e. if they are different value) if there is still data to be process by replication server for the RSI.
Sybase, Inc. 2009 Page 10 of 51
Spid State Info Packets Sent
Bytes Sent Blocking Reads
Locater Sent Locater Deleted
38 Awaiting Wakeup
PRLRMDBS1 2655009 426460847 383262 0x0000000000000000000000000000000000000000000000000000000000014c5900320002
0x0000000000000000000000000000000000000000000000000000000000014c5900320002
1> admin who_is_down 2> go
Spid Name State Info DSI EXEC Down 418(1)
DTRSPS3.cons
1.2.1.3 Queues – Inbound – During peak hours, it is important to monitor whether the data is moving thru the queue. Outbound – Except ONLY warm-standby environment, it is equally important to monitor the size and movement of the data in the queue Materialization – Only important during materialization Below replication command “admin who,sqm” shows about seventeen column output, and all column are recommended to monitor carefully. Below four columns can quickly provide brief state of the all replication queues. When connection is active but if it observed that data is not replicating, and then look if column “Duplicates” is rising, unique transaction may be incorrectly resolved as duplicates. Additionally, other two columns “First Segment.Block” and “Last Seg.Block” can quickly tell how much approx data in queue is there to be process. For example, in below the queue (378:1 DBS.DBSglep_LC) has about 46 MB (i.e. 600520-600474) of data to be processed. Notice: Segment = 1MB consists of 64 BLOCKS (i.e. BLOCK SIZE = 16K) isql –U<user> -S<RepSrv> -P<pwd>
1> admin who,sqm
Sybase, Inc. 2009 Page 11 of 51
2> go
Info Duplicates First Segment.Block
Last Seg.Block
16777372:0 PRLRMDBS1 167 294680.39 294680.39 615:0 PRSDBS1.DBSvmst 150 2735.24 2735.24 606:0 PRSDBS1.DBSglep 3 159995.25 159995.25 379:0 DBS.DBSvmst_LC 0 0.1 0 378:1 DBS.DBSglep_LC 12 600474.19 600520.47 378:0 DBS.DBSglep_LC 0 0.1 0 201:0 PDSREP1.PRLDBS1B_RSSD
9 8.9 8.9
201:1 PDSREP1.PRLDBS1B_RSSD
5 212582.54 212582.54
1.2.1.4 Replication Agent – Sybase Replication Agent – Internal thread in Sybase ASE which scans the Xact log of the database server, make sure the agent is collecting marked data, forwarding Xacts to the replication server and importantly moving the secondary truncation point. In order to verify the state of the replication agent, following stored procedure can be executed. isql –U<user> -S<PrimaryDBSrv> -P<pwd> -D<DBName>
1> sp_help_rep_agent <DBName>,process 2> go
Dbname Spid sleep status retry count last error DBSallc 19 end of log 0 0
The “sleep status” column (in above output!!) shows current activity of the replication agent
Status Comment
not running RepAgent is not running.
not active RepAgent is not in recovery mode.
Initial RepAgent is initializing in recovery mode.
end of log RepAgent is in recovery mode and has
Sybase, Inc. 2009 Page 12 of 51
reached the end of the transaction log.
Unknown none of the above.
Further, the shown “Spid” of the replication agent can be verified using stored procedure “sp_who”. isql –U<user> -S<PrimaryDBSrv> -P<pwd> -D<DBName>
1> sp_who “19” 2> go
fid spid status loginame Origname hostname blk_spid dbname cmd block_xloid 0 19 background NULL NULL 0 DBSallc REP
AGENT 0
1> select * from master..syslogshold 2> where dbid = db_id(“DBSallc”) 3> and name = “$replication_truncation_point” 4> go
dbid Reserved spid page xactid masterxactid starttime name xloid 5 0 0 21762 0x000000000000 0x000000000000 Jan 26
2009 9:00AM
$replication_truncation_point 0
LTMs – For non-Sybase primary data only, which collects the primary Xact log (i.e. delta) and transfers to replication server it is equally important as Sybase replication agent to monitor its errorlog.
1.2.1.5 Replication threads SQM – Make sure different between “First Segment” and “Last Seg” is less than equal to zero (from “admin who,sqm”). As mentioned earlier, below output shows the difference of 46 MB data in the queue. isql –U<user> -S<RepSrv> -P<pwd>
1> admin who,sqm 2> go
Info Duplicates First Segment.Block
Last Seg.Block
16777372:0 PRLRMDBS1 167 294680.39 294680.39
Sybase, Inc. 2009 Page 13 of 51
Info Duplicates First Segment.Block
Last Seg.Block
615:0 PRSDBS1.DBSvmst 150 2735.24 2735.24 606:0 PRSDBS1.DBSglep 3 159995.25 159995.25 379:0 DBS.DBSvmst_LC 0 0.1 0 378:1 DBS.DBSglep_LC 12 600474.19 600520.47 378:0 DBS.DBSglep_LC 0 0.1 0 201:0 PDSREP1.PRLDBS1B_RSSD
9 8.9 8.9
201:1 PDSREP1.PRLDBS1B_RSSD
5 212582.54 212582.54
Notice : Segment = 1MB consists of 64 BLOCKS (i.e. BLOCK SIZE = 16K) SQT – Using “admin who,sqt” look for any large Xact SQT is processing, and not affecting replication system. If column “Full” is often observed “1” then SQT cache size is not enough or small. Also column “Removed” shows number of Xact’s messages are move out from SQT cache (due to their sizes), if there are many or even single for long time, observe other columns for example “Open” or “First Tran” (with ST = O and large number of Cmds) The column “First Tran” contains information in three parts ST: Followed by O/C/R/D (Open/Closed/Read/Deleted) Cmds: Followed by number of SQL commands in first transaction <qid>: Followed by exact position of the first transaction (i.e. segment:block:row) Make sure “Cmds” are changing and <qid> is increasing. isql –U<user> -S<RepSrv> -P<pwd>
1> admin who,sqt 2> go
Info Closed
Read
Open
Trunc
Removed
Full
SQLBlocked
First Trans Parsed
SQL Reader
Change Oqids
Detect Orphans
209:1 DIST ITDSREP1.ITRLDR1_RSSD
0 0 0 0 0 0 1 0 0 0 0
323:1 DIST GIT.nisdb_LC2
0 0 0 0 0 0 1 0 0 0 0
210:1 DIST GIT.nisdb_LC
0 0 0 0 0 0 1 0 0 0 0
324 112 0 0 112 0 0 0 st:C,cmds:3,q 0 0 0 1
Sybase, Inc. 2009 Page 14 of 51
Info Closed
Read
Open
Trunc
Removed
Full
SQLBlocked
First Trans Parsed
SQL Reader
Change Oqids
Detect Orphans
DTSGIT1.nisdb
id:15:23:0
212 DOSNIS1A.nisdb
72 0 0 72 0 0 0 st:C,cmds:3,qid:21:58:0
0 0 0 1
214 DTSGIT1.nisdbdev
112 0 0 112 0 0 0 st:C,cmds:3,qid:435:4:0
0 0 0 1
209 ITDSREP1.ITRLDR1_RSSD
0 0 0 0 0 0 0 0 0 0 1
DIST – The “Status” columns (from “admin who,dist”) will provide current status of the thread, either “Normal” or “ignoring”. Other useful columns to look for are “PendingCmds” and “Duplicates”. isql –U<user> -S<RepSrv> -P<pwd>
1> admin who,dist 2> go
Info PrimarySite
Type Status PendingCmds
SqtBlocked
Duplicates Transprocessed CmdsProcessed
MaintUserCmds
NoRepDefCmds
200 PDSREP1.PRLDBS1A_RSSD
200 P Normal 0 1 0 479 1625 0 0
543 DBS.u_DBScta_LC
544 L Normal 0 1 0 344572 5245341
0 4550199
690 PDSDBS1.logindb
690 P Normal 0 1 0 813 2439 0 0
376 DBS.DBSvend_LC
540 L Normal 0 1 0 62246 203586
0 70644
375 DBS.DBSuomm_LC
539 L Normal 0 1 0 6116 18166 0 0
283 DBS.
284 L Normal 0 1 0 6168 18274 4 0
Sybase, Inc. 2009 Page 15 of 51
Info PrimarySite
Type Status PendingCmds
SqtBlocked
Duplicates Transprocessed CmdsProcessed
MaintUserCmds
NoRepDefCmds
DBSudef_LC 278 DBS.DBSfast_LC
281 L Normal 0 1 0 12698 534422
4 503479
DSI – Make sure the DSI thread is UP and running and it is NOT suppose to be down for long time in order avoid processing backlog There are many columns to look for when “admin who,dsi” displays results, but following few columns can quickly provide quick status isql –U<user> -S<RepSrv> -P<pwd>
1> admin who,dsi 2> go
Status Info
Maintenance User Xacts_skipped TriggerStatus ReplStatus
Awaiting Message
618 PRSDBS1.u_DBScta u_DBScta_maint 0 off on
Awaiting Message
200 PDSREP1.PRLDBS1A_RSSD
PRLDBS1A_RSSD_maint 0 on on
Awaiting Message
1530 PDSDBS5.u_DBScta u_DBScta_maint 0 off off
Awaiting Message
1529 PDSDBS5.DBSuomm DBSuomm_maint 0 off off
Awaiting Message
1528 PDSDBS5.DBSudef DBSudef_maint 0 off off
Awaiting Message
544 PDSDBS1.u_DBScta u_DBScta_maint 0 on on
Awaiting Message
690 PDSDBS1.logindb logindb_maint 0 on on
Awaiting Message
540 PDSDBS1.DBSvend DBSvend_maint 0 on on
Awaiting Message
539 PDSDBS1.DBSuomm DBSuomm_maint 0 on On
Awaiting Message
284 PDSDBS1.DBSudef DBSudef_maint 0 on On
Awaiting Message
281 PDSDBS1.DBSfast DBSfast_maint 0 on On
Awaiting Message
691 PRSDBS1.logindb logindb_maint 20 on On
Awaiting Message
614 PRSDBS1.DBSvend DBSvend_maint 0 off On
Awaiting Message
1526 PDSDBS5.DBSfast DBSfast_maint 0 off Off
Awaiting Message
1527 PDSDBS5.DBSvend DBSvend_maint 0 off Off
Sybase, Inc. 2009 Page 16 of 51
Other quick commands to check the status of DSIs are as follows 1> admin who_is_down 2> go .
.
1> admin who_is_up 2> go RSI – Make sure route’s status is UP. isql –U<user> -S<RepSrv> -P<pwd>
1> admin who,rsi 2> go
Spid State Info Packets Sent
Bytes Sent Blocking Reads
Locater Sent Locater Deleted
38 Awaiting Wakeup
PRLRMDBS1 2655009 426460847 383262 0x0000000000000000000000000000000000000000000000000000000000014c5900320002
0x0000000000000000000000000000000000000000000000000000000000014c5900320002
DAEMONS – Make sure to monitor status of daemons dAlarm, dAIO,dSUB and dCM regularly. isql –U<user> -S<RepSrv> -P<pwd>
1> admin who 2> go
Spid Name State Info 58 DIST Awaiting Wakeup 376 DBS.DBSvend_LC 67 SQT Awaiting Wakeup 376:1 DIST DBS.DBSvend_LC 29 SQM Awaiting Message 376:1 DBS.DBSvend_LC 28 SQM Awaiting Message 376:0 DBS.DBSvend_LC 82 DSI EXEC Awaiting Command 540(1) PDSDBS1.DBSvend 42 DSI Awaiting Message 540 PDSDBS1.DBSvend 8427 REP AGENT Awaiting Command PDSDBS1.DBSvend 8203 DSI EXEC Awaiting Command 1527(1) PDSDBS5.DBSvend 8202 DSI Awaiting Message 1527 PDSDBS5.DBSvend 4646 DSI EXEC Awaiting Command 614(1) PRSDBS1.DBSvend
Sybase, Inc. 2009 Page 17 of 51
Spid Name State Info 4645 DSI Awaiting Message 614 PRSDBS1.DBSvend 32 SQM Awaiting Message 614:0 PRSDBS1.DBSvend 38 RSI Awaiting Wakeup PRLRMDBS1 37 SQM Awaiting Message 16777372:0 PRLRMDBS1 86 RSI USER Awaiting Command PRLRMDBS1 54 dSUB Sleeping 15 dCM Awaiting Message 18 dAIO Awaiting Message 62 dREC Sleeping dREC 63 dSTATS Sleeping 1152 USER Active sa 14 dALARM Awaiting Wakeup
1.2.2 Errorlog
1.2.2.1 Database errorlogs By default located at path -- > $SYBASE/$SYBASE_ASE/install Primary database errorlog – Especially to look for errors related to Sybase Replication agent and/or any corruption in the primary database. 00:00000:00586:2008/12/13 14:10:13.30 server Started Rep Agent on database, 'DBSglep' (dbid = 21).
02:00000:00586:2008/12/13 14:10:13.45 server Error: 692, Severity: 20, State: 1
02:00000:00586:2008/12/13 14:10:13.45 server Uninitialized logical page '1498656' was read while accessing object '8' in database '2
1'. Please contact Sybase Technical Support.
02:00000:00586:2008/12/13 14:10:13.45 server Rep Agent Thread for database 'DBSglep' (dbid = 21) terminated abnormally with error. (
major 0, minor 92)
Replicate database errorlog – Verify the replicate site is up and running with adequate resources (i.e. NOT running out number of connections, locks,log space etc). 00:00000:00006:2007/10/13 16:21:55.88 server Error: 1105, Severity: 17, State: 4
00:00000:00006:2007/10/13 16:21:55.89 server Error: 1105, Severity: 17, State: 4
00:00000:00006:2007/10/13 16:21:55.89 server Can't allocate space for object 'syslogs' in database 'DBSbank' because 'logsegment' se
gment is full/has no free extents. If you ran out of space in syslogs, dump the transaction log. Otherwise, use ALTER DATABASE to inc
rease the size of the segment. RSSD database errorlog – Very critical database for replication server, look for current available space in all segments and is up and running with adequate resources (i.e. NOT running out number of connections, locks etc).
Sybase, Inc. 2009 Page 18 of 51
00:00000:00068:2008/07/31 16:27:18.07 server RepAgent(7): Received the following error message from the Replication Server: Msg 1106
0. CT/CS Lib function 'ct_results' failed. Retcode = 0..
00:00000:00068:2008/07/31 16:27:18.07 server Error: 9261, Severity: 20, State: 0
00:00000:00068:2008/07/31 16:27:18.07 server RepAgent(7): This Rep Agent Thread is aborting due to an unrecoverable communications o
r Replication Server error.
00:00000:00068:2008/07/31 16:27:18.07 server Rep Agent Thread for database 'PRLDBS1D_RSSD' (dbid = 7) terminated abnormally with err
or. (major 92, minor 61)
1.2.2.2 Replication errorlog – Use “admin log_name” (as shown below) to find location of the errorlog, where replication server records informational and error messages. isql –U<user> -S<RepSrv> -P<pwd>
admin log_name
go
Output will look like as below Log File Name ------------- /cis1/PRLDBS1A/log/PRLDBS1A.log Below error code are important to observe I - Informational messages W - Warning E - Error H - Replication thread died F - Due to serious error replication server died N - Internal error Below is the classic error from the replication server errorlog E. 2007/01/27 22:45:12. ERROR #1028 DSI EXEC(1010(1) PDSDBS5.u_DBScta) - dsiqmint.c(3071)
Message from server: Message: 2601, State 2, Severity 14 -- 'Attempt to insert duplicate key row in object 'esa_invoice_job'
with unique index 'pk_esa_invoice_job'
Sybase, Inc. 2009 Page 19 of 51
'.
I. 2007/01/27 22:45:12. Message from server: Message: 3621, State 0, Severity 10 -- 'Command has been aborted.
'.
H. 2007/01/27 22:45:12. THREAD FATAL ERROR #5049 DSI EXEC(1010(1) PDSDBS5.u_DBScta) - dsiqmint.c(3078)
The DSI thread for database 'PDSDBS5.u_DBScta' is being shutdown. DSI received data server error #2601 which is mapped to STO
P_REPLICATION. See logged data server errors for more information. The data server error was caused by output command #1 mapped from
input command #2 of the failed transaction.
I. 2007/01/27 22:45:12. The DSI thread for database 'PDSDBS5.u_DBScta' is shutdown.
If stable device partition is failed (or not available) look for failure messages in errorlog as shown below I. 2008/08/22 00:38:36. Embedded database id is '101'.
E. 2008/08/22 00:38:36. ERROR #6078 GLOBAL RS(GLOBAL RS) - sun_svr4.c(139)
Could not open file '/dev/rdsk/c2t0d0s5'. System error 'No such file or directory(2)'
I. 2008/08/22 00:38:36. Unable to open partition '/dev/rdsk/c2t0d0s5'.
E. 2008/08/22 00:38:37. ERROR #6021 GLOBAL RS(GLOBAL RS) - m/sqmext.c(2086)
Stable queue '101:1' cannot be started. It is on a failed partition 'sq1'.
E. 2008/08/22 00:38:37. ERROR #6034 GLOBAL RS(GLOBAL RS) - m/sqmext.c(1259)
Cannot start the stable queue named '101:1'
W. 2008/08/22 00:38:37. WARNING #6131 GLOBAL RS(GLOBAL RS) - qm/sqmsp.c(2689)
Replication Server has no partitions.
I. 2008/08/22 00:38:38. Replication Server 'ITRLID1' is started.
I. 2008/08/22 00:38:38. DIST for 'ITDSREP1.ITRLID1_RSSD' is Starting
E. 2008/08/22 00:38:38. ERROR #30020 DIST(101 ITDSREP1.ITRLID1_RSSD) - xec/dist.c(1647)
Unable to start distributor thread for queue '101'.
I. 2008/08/22 00:38:38. The distributor for 'ITDSREP1.ITRLID1_RSSD' is shutting down
I. 2008/08/22 00:38:38. The DSI thread for database 'ITDSREP1.ITRLID1_RSSD' is started.
I. 2008/08/22 00:38:38. SQM starting: 101:0 ITDSREP1.ITRLID1_RSSD
I. 2008/08/22 00:39:28. Replication Agent for ITDSREP1.ITRLID1_RSSD connected in passthru mode.
E. 2008/08/22 00:39:28. ERROR #14023 REP AGENT(ITDSREP1.ITRLID1_RSSD) - /execint.c(3463)
SQM had an error writing to the inbound-queue.
I. 2008/08/22 00:39:55. Shutting down.
Sybase, Inc. 2009 Page 20 of 51
Later after executing “rebuild queues” to recover the replication server look for loss detection in the errorlog I. 2008/08/22 01:02:00. Partition 'SQ2' is added.
I. 2008/08/22 01:48:06. Partition 'sq1' is in the process of being dropped.
I. 2008/08/22 01:50:26. Rebuild Queues: Starting
I. 2008/08/22 01:50:27. Resetting Replication Agent starting log position for ITDSREP1.ITRLID1_RSSD
I. 2008/08/22 01:50:27. Shutting down the DSI thread for 'ITDSREP1.ITRLID1_RSSD'.
I. 2008/08/22 01:50:27. The DSI thread for database 'ITDSREP1.ITRLID1_RSSD' is shutdown.
I. 2008/08/22 01:50:27. DSI: enabled loss detection for 'ITDSREP1.ITRLID1_RSSD'.
I. 2008/08/22 01:50:27. Rebuild queues: deleting queue 101:1
I. 2008/08/22 01:50:27. Rebuild queues: done rebuilding queue 101:1. Restarting.
I. 2008/08/22 01:50:27. Rebuild queues: deleting queue 101:0
I. 2008/08/22 01:50:27. SQM starting: 101:1 ITDSREP1.ITRLID1_RSSD
I. 2008/08/22 01:50:27. SQM stopping: 101:0 ITDSREP1.ITRLID1_RSSD
I. 2008/08/22 01:50:27. Rebuild queues: done rebuilding queue 101:0. Restarting.
I. 2008/08/22 01:50:27. SQM starting: 101:0 ITDSREP1.ITRLID1_RSSD
I. 2008/08/22 01:50:28. Starting DIST for 101:1.
I. 2008/08/22 01:50:28. DIST for 'ITDSREP1.ITRLID1_RSSD' is Starting
I. 2008/08/22 01:50:28. Starting the DSI thread for 'ITDSREP1.ITRLID1_RSSD'.
I. 2008/08/22 01:50:28. The DSI thread for database 'ITDSREP1.ITRLID1_RSSD' is started.
I. 2008/08/22 01:50:28. Rebuild Queues: Complete
The loss can also be detected by querying RSSD tables as shown below isql –U<user> -S<RSSDSrv> -P<pwd> -D<RSSD>
1> select dsname,dbname from rs_databases where dbid in ( select distinct case when origin_lsite_id = 0 then origin_site_id else origin_lsite_id end from rs_oqid where valid > 0)
2> go
1.2.2.3 LTM errorlog – In hybrid replication system, important to monitor the LTM process’s log for it’s status and any other related errors (i.e. error connecting to replication server etc)
Sybase, Inc. 2009 Page 21 of 51
1.2.2.4 Dbltm (i.e. Rep agent for ERSSD) – Make sure to monitor the process if routes are used and respective replication servers are using ERSSD
1.2.3 Disk space Database segments • Primary Database’s log segment • Replicate Database’s all segments • RSSD’s all segments isql –U<user> -S<DBSrv> -P<pwd> -D<DBName>
sp_help_segment <SegmentName>
go
segment name status
----------- ---- -----------
2 logsegment 0
device size
------ ----
raw07 320.0MB
free_pages
-----------
163197
table_name index_name indid
---------- ---------- -----------
syslogs syslogs 0
total_size total_pages free_pages used_pages reserved_pages
---------- ----------- ---------- ---------- --------------
320.0MB 163840 163197 643 0
Threshold monitoring can be setup for all required segments of the database Replication stable device – Use “admin disk_space” to monitor all stable devices. isql –U<user> -S<RepSrv> -P<pwd>
admin disk_space
go
Partition Logical Part.Id Total
Segs Used Segs
State
/dev/vx/rdsk/sybase2/raw2g14 SQM14 112 2000 0 ON-LINE// /dev/vx/rdsk/sybase2/raw2g13 SQM13 111 2000 6 ON-LINE//
Sybase, Inc. 2009 Page 22 of 51
Partition Logical Part.Id Total Segs
Used Segs
State
/dev/vx/rdsk/sybase2/raw2g09 SQM1 110 2000 6 ON-LINE// File system managed by operating system (i.e. disk space for errorlogs) – The filesystem space used all servers (i.e. database servers and replication servers) For example, Sybase replication server installation can be Sun Solaris 10 can be monitored using simple “df” unix command
hypnos-mis-/cis1/PRLDBS1A/log> df -k $SYBASE
Filesystem kbytes used avail capacity Mounted on
/dev/vx/dsk/sybase1/cis1_fs 10485760 5013734 5139665 50% /cis1
1.2.4 Replication topology Tools like Sybase central, Sybase power designer allows to generate graphical replication topology used for an organization. For large and complex replication system, regularly monitoring topology of the replication system for any changes mode to the replication system. The changes are not limited to… • Enabling data replication between two sites • Disabling data replication between two sites • Changing direction of data replication between two sites
1.2.4.1 Monitoring changes to table Schema Schema for all required databases/tables needs to be verify at least between • Primary site and replicate site(s) • Primary site and respective replication definition
This step is very critical especially, during application upgrade when most likely the database schema changes. Many tools/methods can be used to find schema of a table or replication definition of the respective table which are not limited to • Sybase Central • Sybase Power Designer (for replication requires “Information Liquidity Model”) • Shell/Perl scripts • Simple SQL commands (i.e. “sp_help” for table schema and “rs_helprep” for
replication definition) • Other third party tools
Sybase, Inc. 2009 Page 23 of 51
1.2.4.2 Marking for replication This is to make sure all required databases or database objects (i.e. mostly user tables) are marked for replication. Following sample stored procedure can provide brief information create procedure sps_check_for_repmrk as
declare @dbnm varchar(100)
select @dbnm = db_name()
if (getdbrepstat() >= 0)
if exists (select 1 from sysobjects where type = 'U' and sysstat & -32768 = -32768)
select name from sysobjects where type = 'U' and sysstat & -32768 = -32768
else
select "No objects in database "+@dbnm+" is marked for replication"
else
select "Entire database "+@dbnm+" is marked for replication"
1.2.4.3 Data consistency Regularly (weekly, monthly or at least before some important days for example quarter end closing/ year end closing) making sure data is consistent between primary and replicate site. Sybase provided tool “rs_subcmp” (on WINDOW it called “subcmp”) can be used to find data inconsistency between primary and replicate sites. For example, in order to find data inconsistency between primary site (Server Name = PDSDBS1, Database Name = DBSCOMMON, table name = attach) and its replicate site create a configuration file (using “vi” or other editor) # attach.cfg – This is the file name
# PDSDBS1.DBSCOMMON.dbo.attach with
# PRSDBS1.DBSCOMMON.dbo.attach.
#
PDS = PDSDBS1
RDS = PRSDBS1
PDB = DBSCOMMON
RDB = DBSCOMMON
PTABLE = attach
RTABLE = attach
PSELECT = select wijt_location,caller_id,create_date,file_name,description,file_contents,chgstamp
from attach
order by caller_id, create_date, file_name
RSELECT = select wijt_location,caller_id,create_date,file_name,description,file_contents,chgstamp
from attach
order by caller_id, create_date, file_name
Sybase, Inc. 2009 Page 24 of 51
PUSER = svr_maint
RUSER = svr_maint
PPWD = forget1t
RPWD = forget1t
KEY = caller_id
KEY = create_date
KEY = file_name
RECONCILE = N
VISUAL = Y
NUM_TRIES = 3
WAIT = 10
Then simply use above configuration file as below to find the data inconsistency $SYBASE/$SYBASE_REP/bin/rs_subcmp -f attach.cfg
A separate user (i.e. svr_maint) can be created and bind it to its user defined temporary database (which can also be bind to a user defined cache) can be used to avoid resource competition with other remaining user. Other option is to use command line switches for “rs_subcmp”. In order to Sync the whole database, it is recommended to create a batch process having set of “rs_subcmp”s for every user table in database. Following script can be used to generate the required script (i.e. the script which will actually verify/sync using rs_subcmp commands), may require few modifications to customize local environment
isql -Usa –SPDSDBS1 -DDBSCOMMON <<EOF
create table #table_list
(
id int ,
uu int
)
GO
create table #table_def
(
colid tinyint
,name char(30)
)
GO
insert #table_list
select id,uid
from sysobjects
where type='U'
and name not like 'rs_%'
order by name
Sybase, Inc. 2009 Page 25 of 51
GO
declare cursor_tabs cursor for
select id,uu from #table_list
GO
declare @dbname varchar(30)
,@tabid int
,@tabname varchar(100)
,@msg varchar(255)
,@pmsg varchar(255)
,@colid tinyint
,@indid int
,@counter int
,@colname varchar(30)
,@uu int
,@uuc varchar(100)
select @tabid = 0
open cursor_tabs
fetch cursor_tabs into @tabid,@uu
while (@@sqlstatus = 0)
begin
select @uuc = user_name(@uu)
setuser @uuc
select @tabname = object_name(@tabid)
insert #table_def
select A.colid, A.name
from syscolumns A
where A.id = @tabid
order by A.colid
select @msg = 'rs_subcmp –SPDSDBS1 -DDBSCOMMON –sPRSDBS1 -dDBSCOMMON -c"select * from '+@tabname+' order by '
print @msg
select @indid = min(indid)
from sysindexes
where id = @tabid
and indid > 0
and (status & 2) = 2
select @pmsg=' '
if (@indid <> NULL)
begin
select @counter = 1
Sybase, Inc. 2009 Page 26 of 51
while @counter <= 16
begin
select @colname = index_col(@tabname, @indid, @counter)
if (@colname is NULL)
break
if (@counter > 1)
select @pmsg = @pmsg + ", "
select @pmsg = @pmsg + rtrim(@colname)
select @counter = @counter + 1
end
end
else
begin
select @colid = 0
while (select min(colid) from #table_def where colid > @colid) != NULL
begin
select @colid = min(colid) from #table_def where colid > @colid
select @pmsg = @pmsg + convert(varchar(30),name)
from #table_def
where colid = @colid
if (select count(*) from #table_def) > 1
and exists (select * from #table_def where colid>@colid)
begin
select @pmsg = @pmsg + ","
end
end
end
select @msg = @msg + @pmsg + '" -u'+@uuc+' -U'+@uuc+' -t'+@tabname+'-V -k'+@pmsg
print @msg
truncate table #table_def
fetch cursor_tabs into @tabid,@uu
select @uuc = user_name(@uu)
setuser
end
GO
close cursor_tabs
GO
deallocate cursor cursor_tabs
GO
EOF
Sybase, Inc. 2009 Page 27 of 51
Sybase, Inc. 2009 Page 28 of 51
1.3 Performance monitoring Following diagram lists out various performance units to measure/monitor for effective monitoring
1.3.1 Latency Difference (generally the UNIT is seconds) between “work” done in primary database and replicate database. Many methods can be used to determine the difference based on how exactly the latency is defined. The latency can calculated for a single transaction, a batch job or for an entire database system.
1.3.1.1 Rs_lastcommit – Replication server maintains this table (i.e. “rs_lastcommit”) in every replicate database which store the most recent committed transactions from specific source/primary site. This method does NOT really provide the best method since the timings specified in the table are generally NOT correct. Since it reports ONLY the last committed transaction, so it is difficult link the respective primary transaction. Also, in
Sybase, Inc. 2009 Page 29 of 51
large complex environment it is difficult to identify the latency for various batch processes or single transactions. For example, below is the output from one of replicate database isql –U<user> -S<Replicate_DBSrv> -P<pwd> -D<DBName>
select origin_time,dest_commit_time from rs_lastcommit
go
origin_time dest_commit_time Jan 28 2009 2:02PM Jan 28 2009 2:52PM
1.3.1.2 Heartbeat – This is Sybase Central feature to monitor latency in replication system, which creates replication enabled table (called “rsm_heartbeat”) and modifies the table at frequent interval. It provides latency in nice graphical form. Restrictions includes are • Must use Sybase Central (i.e. must be connected to both primary and replicate) • Latency measure is good for single row updates To configure Heartbeat using Sybase central select the database connection (i.e. primary database connection for which heartbeat needs to configure) right click and select “Heartbeats”. Complete detail steps are available in “Help” section from RS-Plugin (see below print screen)
Sybase, Inc. 2009 Page 30 of 51
1.3.1.3 Manually managed ping/time table(s) – User defined table(s) can be created having columns defaulting to time on respective database servers. Inserts into these tables can be done before/after/during (i.e. based on application/batch-job) and latency can be derived by comparing the values from the tables. For example table can be defined as follows in primary database isql –U<user> -S<PDBSrv> -P<pwd> -D<DBName>
Create table PRLREP1_timer
( daats_id int,
p_dt datetime default getdate()
)
Go
Create unique clustered index PRLREP1_timer_idx01 on PRLREP1_timer (daats_id)
Go
Grant all on PRLREP1_timer to public
go
On replicate database create similar table as mentioned below isql –U<user> -S<RDBSrv> -P<pwd> -D<DBName>
Sybase, Inc. 2009 Page 31 of 51
Create table PRLREP1_timer
( daats_id int,
p_dt datetime,
r_dt datetime default getdate()
)
Go
Create unique clustered index PRLREP1_timer_idx01 on PRLREP1_timer (daats_id)
Go
Grant all on PRLREP1_timer to public
go
Once replication setup is completed for above created table, first two columns will get replicated from primary values. Now in order to calculate latency insert values into the table before and after a large batch of transaction isql –U<user> -S<PDBSrv> -P<pwd> -D<DBName> Insert into PRLREP1_timer select max(daats_id) +1 from PRLREP1_timer
Go
/* EXECUTE BATCH PROCESS */ Insert into attach ….
Update attach set …
……
Insert into attach ….
Update attach set …
Go
Insert into PRLREP1_timer select max(daats_id) +1 from PRLREP1_timer
go
After the batch load is completed and replicated, following SQL can be used (on replicate database) to calculate latency isql –U<user> -S<PDBSrv> -P<pwd> -D<DBName>
Select datediff(ss,min(p_date),max(r_date)) from PRLREP1_timer
go
Clocks for primary and replicate site must be synchronized to measure latency effectively.
Sybase, Inc. 2009 Page 32 of 51
1.3.1.4 Rs_ticket – A rs_ticket can be think of a message which travels from primary database to replicate database hopping at following replication threads • EXEC • DIST • DSI At each hopping stop (i.e. the traveling message handled by particular replication thread) time is appended to the message. Once the message arrives at destination (i.e. replicate database) stored procedure “rs_ticket_report” can modified to append the message with time and store the entire message to a user defined table. The user defined table can be used for further analysis (i.e. measuring latency). The rs_ticket_report function string must be enabled by modifying the replicate connection isql –U<user> -S<RepSrv> -P<pwd>
1> alter connection to servername.databasename set 'dsi_rs_ticket_report' to 'on'
2> go On replicate site the stored procedure “rs_ticket_report” can be modified as follows isql –U<user> -S<RDBSrv> -P<pwd> -D<DBName>
create procedure rs_ticket_report
(@rs_ticket_param varchar(255))
as
begin
set nocount on
declare @new_cmd varchar(255),
@c_time datetime,
@c_secs numeric(6,3)
select @c_time = getdate()
select @c_secs = datepart( millisecond, @c_time)
select @c_secs = datepart( second, @c_time) + @c_secs/1000
select @new_cmd =
@rs_ticket_param + ";RDB(" + db_name()+ ")="
+ convert( varchar(2), datepart( hour, @c_time))
+ ":" + convert( varchar(2), datepart minute, @c_time))
+ ":" + convert( varchar(6), @c_secs)
insert daats_tkt values (@new_cmd)
Sybase, Inc. 2009 Page 33 of 51
end On primary site we execute like follows to gather performance related data isql –U<user> -S<PrimaryASE> -P<pwd> -D<PrimaryDB>
1> exec rs_ticket 'BEGIN BATCH PROCESS' 2> EXECUTE BATCH PROCESS 3> exec rs_ticket 'COMPLETE BATCH PROCESS' 4> go On replicate site .. isql –U<user> -S<RepASE> -P<pwd> -D<RepDB>
select * from daats_tkt
go
Output should look like… #################################################################### V=1;H1=BEGIN BATCH PROCESS;PDB(pdsdbs1)=09:51:49.180; EXEC(29)=09:51:49.0;B(29)=43690;DIST(20)=09:51:52.0;DSI(27)=09:51:55.0;RDB(prsdbs1)=09:51:55.413 V=1;H1=COMPLETE BATCH PROCESS;PDB(pdsdbs1)=09:51:49.193; EXEC(29)=09:51:49.0;B(29)=44894;DIST(20)=09:51:52.0;DSI(27)=09:51:55.0;RDB(prsdbs1)=09:51:55.413 #################################################################### Understanding the output from the “rs_ticket” process • V – Version number • H – Header information; String input by "rs_ticket" in primary site • PDB - Primary database name and the time (from the host clock) rs_ticket was
executed • EXEC – spid of user executed rs_ticket in primary and the time • B – Total bytes received from Replication Agent and spid number. In this case total
bytes received = 43690 • DIST - spid number (shown in "admin who") and the time rs_ticket passes through
the DIST • DSI - spid number (shown in "admin who") and the time rs_ticket passes through
the DSI • RDB - Replicate database name and the time rs_ticket_report called to add the
message to the results table (i.e. daats_tkt)
Sybase, Inc. 2009 Page 34 of 51
As shown in above output, latency can be calculated by difference of time between the “rs_ticket” message arrives at replicate database and the “rs_ticket” message sent from primary database i.e. approximately two seconds ( 09:51:55:413 – 09:51:49:180 = 2 Seconds).
1.3.2 Throughput – Throughput can be calculated by measuring latency (using above defined methods) for given “work” (for example 1000 transactions or total bytes transferred). Many commands can be used to to find out how much amount of data (i.e. in bytes) is processed through replication server, for example using “admin who,sqm” column “Bytes” shows the total number of bytes written. First run “admin who,sqm” and look for the bytes column on the interested connection (below output shows only three column, other column are not shown) State Info Bytes Awaiting Message 16777372:0 PRLRMDBS1 268793702 Awaiting Message 615:0 PRSDBS1.DBSvmst 24000982 Awaiting Message 606:0 PRSDBS1.DBSglep -926256804 Awaiting Message 379:1 DBS.DBSvmst_LC 81859762 Awaiting Message 379:0 DBS.DBSvmst_LC 0 Awaiting Message 378:1 DBS.DBSglep_LC -1344189992 Awaiting Message 378:0 DBS.DBSglep_LC 0 Awaiting Message 248:1 DBS.DBSwact_LC -1605776577 Awaiting Message 248:0 DBS.DBSwact_LC 0 Awaiting Message 236:1 DBS.DBSarpc_LC 68229190 Awaiting Message 236:0 DBS.DBSarpc_LC 0 Awaiting Message 231:1 DBS.DBSallc_LC 120230729 Awaiting Message 231:0 DBS.DBSallc_LC 0 Awaiting Message 201:0
PDSREP1.PRLDBS1B_RSSD 257502
Awaiting Message 201:1 PDSREP1.PRLDBS1B_RSSD
1330086096
Now process the batch load as mentioned in previous sections (i.e. to calculate latency) and at the completion of batch replication execute “admin who,sqm” again State Info Bytes Awaiting Message 16777372:0 PRLRMDBS1 268793702 Awaiting Message 615:0 PRSDBS1.DBSvmst 24002982 Awaiting Message 606:0 PRSDBS1.DBSglep -926256804 Awaiting Message 379:1 DBS.DBSvmst_LC 81859762 Awaiting Message 379:0 DBS.DBSvmst_LC 0 Awaiting Message 378:1 DBS.DBSglep_LC -1344189992 Awaiting Message 378:0 DBS.DBSglep_LC 0 Awaiting Message 248:1 DBS.DBSwact_LC -1605745641
Sybase, Inc. 2009 Page 35 of 51
State Info Bytes Awaiting Message 248:0 DBS.DBSwact_LC 0 Awaiting Message 236:1 DBS.DBSarpc_LC 68229190 Awaiting Message 236:0 DBS.DBSarpc_LC 0 Awaiting Message 231:1 DBS.DBSallc_LC 120230729 Awaiting Message 231:0 DBS.DBSallc_LC 0 Awaiting Message 201:0
PDSREP1.PRLDBS1B_RSSD 257502
Awaiting Message 201:1 PDSREP1.PRLDBS1B_RSSD
1330088366
As shown during the batch load total 2000 (24002982 – 24000982) bytes were processed by the replication server. To calculate throughput divide total bytes transfer (i.e 2000 bytes) by total latency (2 second from previous section) i.e. 1000 bytes/sec will be the final throughput. Another method to calculate total bytes transferred (in order to calculate throughput) is using “admin statistics,SQM,BytesWritten” command. Make sure to reset counter before starting the large batch in primary database by executing “admin statistics,reset”.
1.3.3 Statistics monitoring (i.e. Monitor counters) – Monitors and counters can be used to monitor replication in more detail fashion not limited to • Finding “ignored” transactions • Managing segments • Finding transaction sizes • Finding command sizes • Calculating throughput and latency • Calculating read/write rate for stable device • Help configuring replication to its optimal (i.e. sizing SQT size, parallel DSI etc) Monitoring counters for replication server in version 12.6 was done using following two commands isql –U<user> -S<RepSrv> -P<pwd>
1> admin who 2> go .
.
1> admin statistics 2> go
Sybase, Inc. 2009 Page 36 of 51
Counters provides details related to following replication modules • CM • DIST • DSI • DSIEXEC • REPAGENT • RSI • SQM • SQT • STS The counters can be distinctly identified for each instance (i.e. occurrence) of the module. Counters for single-instance module can be identified by their respective module name (for example STS, CM). For multi-instance they can identified using following two ways • Module name and instance ID (i.e. LDBID,DBID) for example, RSI/DSI-S/DIST • Module name, instance ID and instance value for example, SQT/DSI-Exec Replication Monitors and counters can be assigned into one of following groups based on their outputs (i.e. characteristic of generated statistics) • Observers – Results in number of occurrences • Monitors – Result their current value • Counters – Results collection statistics Additionally, each replication Monitors and counters can have one or more following status by their end results are calculated • CNT_SYSMON – These counters can be used by “admin statistics, sysmon”
command • CNT_MUST_SAMPLE – Their results are always in sampled form • CNT_NO_RESET – Cannot be reset (i.e. initialized) • CNT_DURATION – Counters which measure durations • CNT_KEEP_OLD – Counters which keeps their current and previous values • CNF_CONFIGURE – Counters which keeps current value of replication configuration
parameter
1.3.3.1 Replication counters version 12.6 – Use stored procedure “rs_helpcounter” (in related RSSD) to find detail information on each counter. In this version, counters are categorized in following types
Sybase, Inc. 2009 Page 37 of 51
• Total • Last • Max • Avg Following methods can be used to monitor the counters Using “admin statistics” – Quick and easy way to monitor the replication counters. For example, simply executing “admin statistics,sysmon” will list out all non-intrusive counters in this category. Detail explanation about those counters can be obtained by executing “rs_helpcounter sysmon”. Configuring replication server to collect counters – This is the recommended method to monitor the replication counters in order to derive performance and other useful statistics for the replication system. Below are high level steps for the setup Start sampling for all types (intrusive/non-intrusive) counters – Following commands can be used
configure replication server set “stat_sampling” to “ON”
admin stats_intrusive_counter,”ON”
Start collecting the counters into RSSD (..make sure to monitor RSSD database) – This step will collect the counters into following RSSD tables at regular interval
rs_statcounters – Details about ALL counters.
rs_statdetail – Collected counters
rs_statrun – Stored statistical information for each collection (i.e. flush to RSSD)
Following commands can be used to setup the collection 1. configure replication server set “stat_flush_rssd” to “ON” 2. configure replication server set “stat_reset_after_flush” to
“ON”
3. Configure replication server set “stat_daemon_sleep_time” to “ON”
4. admin statistics, reset (This command is optional to reset counters manually at any point of time)
Filter collection – By starting “sampling” replication server collection many counters. Filters can be added to collection only specific (to the required replication module) counters into RSSD tables. Following commands can be used to setup the filters 1. admin stats_config_module 2. admin stats_config_connection 3. admin stats_config_route
Sybase, Inc. 2009 Page 38 of 51
4. admin statistics,flush_status
1.3.3.2 Replication counters version 15.0 – Unlike in version 12.6, in this version counters are not categorized into various type (i.e. Total, Last, Max, Avg), but all counters will collect • Number of observation • Total observations • Last observed value • Max observed value Also, starting version 15.0, there are no intrusive counters. Starting this version collecting/monitoring replication monitors/counters can be achieved in very simple steps. Basically, using command “admin stats” it is required to define • What statistics to collect • Final destination of collected counters (i.e. screen/RSSD) • How long to collect (i.e. sampling period and number of observations) After executing the command “admin stats” user is prompted to replication server’s command prompt, from where user can exit the session or continue with other work. Later “admin stat,status” can be used view the progress of previously executed “admin stats” (to collect counters). In order to stop collecting “admin stat,cancel” can be used any time to stop collection Once the counters are collected RSSD tables (mentioned above) can be queried for further analysis of the replication system. The newly introduced RSSD stored procedure “rs_dump_stats” dumps all collected counters into CSV formatted file which can further be loaded into Excel sheet for further analysis (Note:- Excel sheet may have limitation of storing only 65K rows.)
1.3.3.3 Sp_sysmon – Sybase ASE ‘s stored procedure “sp_sysmon” also provides a section specific to Replication Agent which provides detail statistical information for each replication agent configured. Below is the sample output of “sp_sysmon” In “Log Scan Activity” section provides
• replicated DDL activities • CLRs – Log records which were partially or fully rolled back Replication Agent
Sybase, Inc. 2009 Page 39 of 51
-----------------
Replication Agent: DBSCOMMON
Replication Server: PRLDBS1G
per sec per xact count % of total
------------ ------------ ---------- ----------
Log Scan Summary
Log Records Scanned n/a n/a 9 n/a
Log Records Processed n/a n/a 1 n/a
Log Scan Activity
Updates n/a n/a 0 n/a
Inserts n/a n/a 0 n/a
Deletes n/a n/a 0 n/a
Store Procedures n/a n/a 0 n/a
DDL Log Records n/a n/a 0 n/a
Writetext Log Records n/a n/a 0 n/a
Text/Image Log Records n/a n/a 0 n/a
CLRs n/a n/a 0 n/a
In “Transaction Activity” can make sure (approximately) total number of transactions committed and total number of transactions aborted was equal to total number of transactions opened.
Transaction Activity
Opened n/a n/a 1 n/a
Commited n/a n/a 1 n/a
Aborted n/a n/a 0 n/a
Prepared n/a n/a 0 n/a
Maintenance User n/a n/a 0 n/a
Log Extension Wait
Count n/a n/a 2 n/a
Amount of time (ms) n/a n/a 14133 n/a
Longest Wait (ms) n/a n/a 14133 n/a
Average Time (ms) n/a n/a 7066.5 n/a
Schema Cache Lookups
Forward Schema
Count n/a n/a 0 n/a
Total Wait (ms) n/a n/a 0 n/a
Longest Wait (ms) n/a n/a 0 n/a
Average Time (ms) n/a n/a 0.0 n/a
Backward Schema
Count n/a n/a 0 n/a
Sybase, Inc. 2009 Page 40 of 51
Total Wait (ms) n/a n/a 0 n/a
Longest Wait (ms) n/a n/a 0 n/a
Average Time (ms) n/a n/a 0.0 n/a
Truncation Point Movement
Moved n/a n/a 0 n/a
Gotten from RS n/a n/a 1 n/a
Connections to Replication Server
Success n/a n/a 0 n/a
Failed n/a n/a 0 n/a
Network Packet Information
Packets Sent n/a n/a 1 n/a
Full Packets Sent n/a n/a 0 n/a
Largest Packet n/a n/a 175 n/a
Amount of Bytes Sent n/a n/a 175 n/a
Average Packet n/a n/a 175.0 n/a
I/O Wait from RS
Count n/a n/a 2 n/a
Amount of Time (ms) n/a n/a 0 n/a
Longest Wait (ms) n/a n/a 0 n/a
Average Wait (ms) n/a n/a 0.0 n/a
--------------------------------------------------------------------------------
1.3.3.4 Measuring Replication Agent – Various information in relates with how replication agent is keeping up can be measured by.. • Beginning for transaction log can be located in master..sysdatabases table (column =
logptr). isql –U<user> -S<PrimaryDBSrv> -P<pwd>
1> select logptr from master..sysdatabases where name = “DBSCOMMON”
2> go Below is output look like logptr
-----------
19844
Sybase, Inc. 2009 Page 41 of 51
• Truncation points can be queried from master..syslogshold table (column = page)
isql –U<user> -S<PrimaryDBSrv> -P<pwd>
1> select page from master..syslogshold 2> go page
-----------
19845
• Current position of replication agent can be found by executing
“sp_help_rep_agent <db>” (column = current marker)
isql –U<user> -S<PrimaryDBSrv> -P<pwd>
1> sp_help_rep_agent “DBSCOMMON” 2> go
Replication Agent Recovery status
dbname connect dataserver connect database status rs servername rs username
------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------
DBSCOMMON PDSDBS1 DBSCOMMON not active PRLDBS1G PRLDBS1G_ra
Replication Agent Process status
dbname spid sleep status retry count last error
------------------------------ ----------- ------------------------------ ----------- -----------
DBSCOMMON 18 end of log 0 0
Replication Agent Scan status
dbname start marker end marker current marker log recs scanned oldest transaction
------------------------------ ------------------------------ ------------------------------ ------------------------------ ---------------- ------------------------------
DBSCOMMON (19845,18) (19845,22) (19845,22) 0 (-1,0)
Sybase, Inc. 2009 Page 42 of 51
• Last page of the log can calculated using “dbcc pglinkage(<dbid>,<cur_pg>,0,0,0,1)”, where <cur_pg> can be any page (i.e. beginning of log or primary/sec truncation point or current replication position)
Object ID for pages in this chain = 8.
End of chain reached.
2 pages scanned. Object ID = 8. Last page in scan = 19846.
DBCC execution completed. If DBCC printed error messages, contact a user with System Administrator (SA) role.
1.3.3.5 Measuring Queue (Inbound/Outbound) – Use “admin who,sqm” and difference between “Last Seg.block” and “Next Read” to find how replication server is processing the queues. If “Next Read” is greater than equal to (>= ) ‘Last Seg.block” that means there is nothing to process for replication server in the queue.
Sybase, Inc. 2009 Page 43 of 51
1.4 Alerting/Notification – User defined scripts (to take appropriate action/send email etc) can be initiated by configuring Sybase Replication server manager (using Sybase Central) for certain replication event/conditions (i.e. DSI DOWN/SERVER DOWN etc). Similar setup can also be accomplished using Sybase RMS (Replication Monitoring System). Following diagram shows generally used alerting/notification methods
1.4.1 RSM Event monitoring Sybase Central can be used instruct RSM server (may be residing on remote host) to initiate user defined scripts (located on the same host where RSM is running) for specific server events/conditions. The event is a change that occurs in a replication system managed by a specific RSM Server In order to configure RSM event monitoring,
Sybase, Inc. 2009 Page 44 of 51
• Install and configure RSM server (using “rsmgen” utility located in $SYBASE/$SYBASE_RSM/install)
• Using Sybase Central (having Replication Plugin) connect to the newly configured RSM server.
• Add primary, replicate, replication and RSSD server to the RSM server using Sybase Central (Make sure to add RSSD server before adding replication server)
• Then right-click the RSM server and select “Server Events”. The next dialog box show six different events (i.e. six different “Tabs”)
o Server events – Event specific to change in state of the monitoring server (can be ASE or REP server). Select “RSM Domain” for specific “Server”. Following events are available to monitor server events
Active - indicates a server is functioning normally. This option is useful if you want to send an e-mail or pager message when a server begins functioning normally after experiencing a problem.
Quiesced - Indicates a server is quiesced. If you use RSM to quiesce a Replication Server, the Replication Server state becomes Suspect rather than Quiesced because the LTMs are suspended.
Suspect - indicates a server is still running but is experiencing a problem.
Hung - indicates RSM cannot connect to the server because of a connection timeout.
Shutdown - indicates the shutdown command was used to shut down a Replication Server or an LTM.
Dead - indicates a server was shut down using a method other than the Shutdown command; for example, you used the isql command to shut down a server.
Unknown - indicates RSM cannot communicate with another server because of a connectivity problem.
Invalid - indicates RSM encountered an error in critical information files, such as a missing or corrupt stored procedure in the RSSD of a monitored Replication Server.
Once the event is selected from above list, then select “Servers” of which this event needs to be monitor.
o Route events: Event specific in changes of the status of a route. Select “RSM Domain” and “Replication server” for specific route.
o Connection events: Event specific in changes the status of a connection. Select “RSM Domain” and “Replication server” for specific connection.
o Partition events: Event specific in partition thresholds (monitors partitions and raises an event when a partition's size equals or exceeds a specified threshold) and partition state (monitors partitions and raises an event when a partition's state changes to ONLINE, OFFLINE, or DROPPED) changes. Select “RSM Domain” and “Replication server” for partition.
o Queue events: Event specific in queue thresholds (monitors queues for specified Replication Servers and raises an event when a queue's size equals or exceeds the specified threshold) and queue latency (the amount of time that the first block
Sybase, Inc. 2009 Page 45 of 51
has remained at the beginning of each stable queue). Select “RSM Domain” and “Replication server” for partition.
o Database events: Event specific in replication latency. Select “RSM Domain” and “database” (select multiple databases using Cntrl-Key). In the end make sure to provide “Script Location” for each selected event. The script can accomplish at least
Send email notification Send page Write in error log file Insert error into selected database Add partition Re-start server
1.4.2 Scripts Shell or Perl script can used separately to monitor replication system. For example, below sample script can be use to monitor primary database server, replicate database server and replication server.
wrap_rep.csh
#!/bin/csh
#################################################################
# This script is a wrapper
# This script can be placed in startup script
#################################################################
ps -ef | grep -v grep | grep $0
if "$status" == "0" exit
while 1
check_srv PDSDBS1
check_srv PRSDBS1
check_rep_comp PRLDBS1A
sleep 300
end
check_srv.csh
#!/bin/csh
Sybase, Inc. 2009 Page 46 of 51
#################################################################
# This script checks connection to ASE/Rep Srv is OK
#################################################################
set EMAIL_LIST="[email protected]"
set usr=
set pass=
## Check wheather connection to the Srv Can be establish
date > /tmp/$$
isql -U$usr -w132 -S$1 <<EOF >> /tmp/$$
$pass
go
EOF
## IF not send Email and Loop through until connection to the Srv
## gets establish
if "$status" != "0" cat /tmp/$$ | mailx -s"PL Check $1" $EMAIL_LIST
isql -U$usr -w132 -S$1 <<EOF
$pass
go
EOF
while $status != 0
sleep 300
isql -U$usr -w132 -S$1 <<EOF
$pass
go
EOF
end
check_rep_comp.csh
#!/bin/csh
Sybase, Inc. 2009 Page 47 of 51
#######################################################################
# This script checks all components of Rep Srv are UP and running
#######################################################################
set EMAIL_LIST="[email protected]"
set usr=
set pass=
## Check wheather all components of Rep Srv are UP and running
## Loop through until they are UP and running
check_srv $1
isql -U$usr -w132 -S$1 <<EOF | grep -i suspect
$pass
admin health
go
EOF
if "$status" == "0" tail -100 $SYBASE/REP-12_6/install/PRLDBS1A.log| mailx -s"PL Check RepSrv" $EMAIL_LIST
## IF any of the component is DOWN send Email with Last 100 Lines from ErrorLog
## Loop through until they are DOWN
check_srv $1
isql -U$usr -w132 -S$1 <<EOF | grep -i suspect
$pass
admin health
go
EOF
while $status == 0
sleep 300
check_srv $1
isql -U$usr -w132 -S$1 <<EOF | grep -i suspect
$pass
Sybase, Inc. 2009 Page 48 of 51
admin health
go
EOF
end
Sybase, Inc. 2009 Page 49 of 51
1.5 General Troubleshooting
1.5.1 Skipping transaction – This happens when DSI goes down due to bad transactions (i.e. the transaction which errors out in replicate database). In order to continue (i.e. ignoring the current bad transaction) the connection can be resumed using following command isql –U<user> -S<RepSrv> -P<pwd>
1> resume connection to <replicate_dataserver>.<replicate_db> skip tran 2> go Replication server will move the first bad transaction into exception log (located in RSSD) and continue (i.e. resume the connection which was down/suspended) with next transaction in queue. To view the skipped transaction log into respective RSSD (replication command “admin rssd_name” can be used to find the RSSD) and use “rs_helpexception” stored procedure. Once the transaction is reviewed then it can be deleted from exception using “rs_delexception” stored procedure. isql –U<user> -S<RSSDSrv> -P<pwd> -D<RSSD>
1> rs_helpexception 2> go .
.
-- Look for Xact which was logged most recently (i.e. Xact_id)
.
.
1> rs_helpexception <Xact_id>,v 2> go
1.5.2 Dumping stable queue – Dumping (to file/sent to screen) the entire queue or part of queue (by providing particular segment:block:count) can be done using “sysadmin dump_queue” command
1.5.3 Disabling secondary truncation point – To avoid transaction log getting filled in primary due to any replication disaster, following commands at the primary database
Sybase, Inc. 2009 Page 50 of 51
isql –U<user> -S<PrimaryDBSrv> -P<pwd> -D<PrimaryDB>
3> sp_stop_rep_agent <DBName> 4> go .
.
1> dbcc settrunc(ltm,ignore) 2> go
1.5.4 Enabling secondary truncation point – After fixing replication system/server disaster or refreshing primary from backup, following commands can be used to re-enable secondary truncation point isql –U<user> -S<RSSDSrv> -P<pwd> -D<RSSD>
1> rs_zeroltm <PrimaryDBSrv>,<DBName> 5> go isql –U<user> -S<PrimaryDBSrv> -P<pwd> -D<PrimaryDB>
1> dbcc settrunc(ltm,valid) 6> go .
.
1> sp_start_rep_agent <DBName> 2> go
Sybase, Inc. 2009 Page 51 of 51
References • www.sybase.com • Replication Reference Manual • Replication Administration Guide • Replication Troubleshooting Guide • Replication Heterogeneous Replication Guide