fundamentals of transaction systems - part 3: relativity shatters the classical delusion (replicated...

3-1

Valverde Computing

The Fundamentals of Transaction Systems Part 3:Relativity shatters the Classical Delusion(Replicated Database)

C.S. Johnson <[email protected]> video: http://ValverdeComputing.Com social: http://ValverdeComputing.Ning.Com

The Open Source/ SystemsMainframe Architecture

3-2

In the final WICS (Stanford) summer class given by Jim Gray and his associates in 1999 (he received the Turing Award just before this) on Transaction Processing Concepts and Techniques , the section on Replication was given by Phil Bernstein of Microsoft

Bernstein mentioned (but not on the slides) that no system that used replication for the purposes of recovering from a disaster on the primary site, had ever successfully been constructed using anything other than log replication, so we’ll just take that as a proven fact

<http://research.microsoft.com/en-us/um/people/gray/WICS_99_TP/>

17. Reliable Disjoint Async Replication

3-3


I would further add that nothing magical is accomplished by replication, so wormholes in the primary site database and any other inconsistencies will appear in the backup site as well

The implications of this are that an MVCC concurrency-based database, which can only have that concurrency provided by the single threading of the application (since the database itself cannot block the applications from corruption due to write skew), will therefore also only have that replication consistency provided by the application over and above this, if multiple nodes and replication streams are involved

3-4

Since that overarching multi-node application concurrency solution is unlikely to be reliably forthcoming … multi-node primary site databases replicating to multi-node (or otherwise) backup sites can only be made ACID consistent after takeover (or failover) if they are S2PL or SS2PL (strong strict two phase locking) concurrency-based databases

There are currently two S2PL replication products in production use for critical systems: GDPS on IBM databases, primarily for mainframe DB2, and RDF on top of HP Nonstop: and both of these get the job done, insofar as reliably joining DBMS systems through replication


3-5

First, let’s define the replication configuration options (Gray/Reuter 12.6.3): 1-Safe (also called asynchronous), 2-Safe (also called synchronous) and Very Safe: 1-Safe (async) responds to the client after the

primary commits (even if the network or the backup are down, because they can catch up later)

2-Safe (sync) responds to the client after the backup commits, unless the network or the backup are down, then it responds to the client after the primary commits and the backup can catch up later

Very Safe responds to the client after the backup commits, unless the network or the backup are down, then there is no update access at all to the primary copy of the data: it’s all or nothing


3-6

In a massively distributed transaction system containing many transactional database clusters, Disjoint Replication is the first kind of consistent replication: it is point–to-point, and ignores and does not reconcile transactions with respect to other replicated databases

So after two primary clusters fail and their backup clusters take over, the two takeover managers will make differing decisions about which transactions commit and abort, according to how much of the two (or more) log streams made it over separately to each backup cluster before partitioning of the network or failure of the primary clusters occurred


3-7

This is because the input log replication streams will always break at different points: you can end up with distributed tx.A, but not distributed tx.B committing on one cluster, and the opposite on the other cluster, if the commit records replicate just so

The Nonstop RDF subsystem was originally developed in the field to solve a problem for an important account, it was brought inside development and then extended and re-written several times


3-8

Because of its origins, it is (mostly) external to the transaction service (Nonstop TMF), and Nonstop RDF has (mostly) disjoint releases: RDF essentially attaches itself to the

primary database log as an RDF extractor The extractor ships the relevant log records

to an RDF receiver on the backup system, including some schema (DDL) changes from the primary DB (create/drop table, but not create index)

The receiver inserts those records into an RDF-only remote log on the backup system called an ‘image trail’


3-9

Nonstop RDF features (continued): The RDF Updater reads the image

trail and applies the changes to the backup database, including schema (DDL) changes from the primary DB

The updater uses a special high-performance bulk access method to stream-write the updates to the the RM (DP2) cache, which logs the changes, only writing back to the database disk when necessary to steal cache pages (in recovering from the log, you don’t need to WAL, by definition)


3-10

Nonstop RDF features (continued): That bulk access method has ACID

properties and allows the backup database to be queried, but not updated with transactions

Takeover by the backup: Requires no reboot Is to the last completed

transaction Is fully automated (by command) The takeover time is measured, in

production, to be 10-120 seconds


3-11

IBM’s GDPS (Geographically Dispersed Parallel Sysplex) is an architecture (not an out-of-the-box product like RDF) that joins sysplex clusters: For IMS TM and CICS transaction managers For VSAM, IMS/Fastpath/HALDBS, and DB2

databases Using PPRC synchronous (lock-stepped)

mirroring (up to 25 miles to limit the response time hit) and XRC asynchronous remote mirroring (to any distance, like RDF)

Providing physical byte replication (not transactional)


3-12

IBM GDPS features (continued): Disallowing any schema(DDL) changes to the

primary database, and any read-only access to the backup database

Requiring DBMS and application level byte-oriented recovery on the remote site after a crash

Accomplishing failover automation by carefully engineered scripts, maintained by IBM services, to the last completed transaction (if your DB application is transactional)

GDPS failover requires a reboot, thereby trashing any currently executing workloads on the backup sysplex and takes between 30 and 40 minutes for PPRC , and less than two hours for XRC

Gartner T-09-9951, 1/24/01


3-13

Hence, for reliable disjoint async replication (point to point), which is typically over a sufficient distance to forestall disk mirroring, these are the apparent product choices when strict ACID consistency is a requirement after takeover or failover: Vanilla Nonstop RDF with no frills or complication

accomplishes that end GDPS XRC asynchronous remote mirroring

accomplishes that end as an architecture, but needs implementation:

XRC requires IBM services to convert the byte-oriented replication into transactional DBMS recovery

XRC requires IBM services to support the recovery of the 50%-60% of IBM applications that are not transactional


3-14

An Optimal RDBMS would handle disjoint async replication in a manner somewhat similar to that of Nonstop RDF, but with some major differences (more like a follow-on project sometimes referred to as RDF2, that was never pursued): Replication would be deeply integrated into

the transaction system TM, and the pattern of logging by the RM, the user interfaces and the commit protocol and logging by the TM would be part of this integration

The receiver would also directly apply the log records to the database, where possible, instead of deferring that job to the updater at a later time


3-15

An Optimal RDBMS would handle replication (continued): The updater would then have the function of

applying the log to the database when that function was frozen for some reason by the receiver

Part of the reason for image files in the original Nonstop RDF is that the RMs and the TM do not write log records to the log partition in such a way as to localize the decisions of the original updater, meaning localized to that log partition, and sometimes the part of the log containing the needed information is comm-delayed in another partition’s stream, this would be fixed by deeper integration (more on this topic, in the joint replication section)


3-16

An Optimal RDBMS would handle replication (continued): For early release, how an object is protected by

replication would be dependent upon the RM holding it: RMs would be configured as 1-Safe (async) due to the configuration of the log partition they are attached to, and all of the database objects residing with that RM would log to, and replicate to, the configured places (see virtual commit by name in continuous database and joint replication sections)

Since 1-Safe (async) commit has the same response time as a normal commit, distributed transaction commit of non-replicated and 1-Safe RMs would not differ in terms of commit response time: however, the resulting 1-safe databases that had been modified by distributed transactions, would be disjoint after a crash (that distributed magic needs serious architecture work)


3-17

An Optimal RDBMS would handle replication (continued): If ESS (Enterprise Storage System) native

replication functions could be made to perform reliably for the purposes of 1-Safe replication, then that could be used: ESS uses something similar to copy-on-write, to file-split and stream the updates to a remote file as a mirror

If (as I suspect) this native function of some ESS cannot be made to be reliable enough, then Nonstop RDF-style extractor log shipping and receiver updating is just the thing

I presume that ESS native replication would not be sufficient to guarantee 2-safe replication, but I could be wrong


3-18

18. Logical Redo and Volume Autonomy

In section 10.3.1 of Gray/Reuter theDO-UNDO-REDO protocol is discussed: DO: The effects of the original application are –

to transform the old state of the database into the new state with the generation of a log record

UNDO: The effects of the UNDO program are –to transform the new state of the database into the old state using the log record produced by the original application

REDO: The effects of the REDO program are –to transform the old state of the database into the new state using the log record produced by the original application

This means that the log record must (at least) be capable of both UNDO and REDO to support aborting the transaction (Backout on Nonstop) and recovering the file from an archive copy (Rollforward or File Recovery on Nonstop)

3-19


In all databases, there can be four basic types of log records (and lots of ancillary types, like checkpoint records to limit log reading, etc.): Transaction state: values include Active, Prepared,

Committed, Aborting, Forgotten (the tense of these is crucial), along with details, like other clusters/nodes involved in distributed commit and commit/abort flags

Logical undo: the “logical” or “row-oriented” contents of the database before the transaction took place, such that aborting the transaction applies the undo for that transaction against the database to satisfy atomicity and to put things back the way they were … on Nonstop the undo operations (from Backout) are logical and row-oriented and produce as a result CLRs, which are Compensating Log Records (tantamount to physical redo describing the DO, or forward motion of the undo operation, physical redo is described last)

3-20


Log records (continued): Logical redo: the “logical” or “row-oriented”

contents of the database after the transaction took place, sometimes in the same record with the logical undo, this is what the transaction tried to accomplish logically in the first place, so that when (for instance) the table’s btree is broken or the file is marked bad, an archive recovery restores an online (fuzzy) dump of the file from some time before and applies the logical or physical redo from that time forward to recover the file, then applies logical undo for incomplete transactions backwards from the end of the log to put the file back together consistently

3-21


Log records (continued): Physical Redo: the “physical” or “page-

oriented” contents of the database after the transaction took place, the actual physical bytes of some complex change to the structure of the table’s blocks, for instance, on Nonstop a btree block split requires five physical redo blocks in the log: physical redo has the property that after simple unpacking it can be reassembled directly in the cache buffer of the RM (called DP2 on Nonstop) without a lot of re-wiring – so, it’s fast and efficient … on Nonstop the redo operations are physical and page-oriented

3-22


In the Nonstop group of the late 1980s, there was an attempt to do away with physical redo for recovery: Jim Gray’s longtime associate Franco Putzolu (one of the all-time great programmer/designer/architects) wanted to use only logical redo for these reasons related to replication : Fault Containment: Some problems break

physical redo log generation, while the logical redo is fine

Locality Issues: the compactness of a table or index is a local issue, online reorganization to compact a file can be done on each replica, and the replication of the physical redo for that reorganization is costly

3-23


Why logical redo is better (continued): Schema (DDL) Changes: Create/Drop Table,

Create/Drop Index with accompanying sorts (heavy on the physical redo), etc., can be logged and done separately on both sides, and the physical layout being different avoids mirroring a bohr-bug (deterministic) corner case on both sides, preventing some double failures

Manual Repair: when you have to manually repair a broken btree, this is easier to do, when you can do it separately on both sides (tools to hack or repair physical database structures typically do not generate log records)

3-24


The arguments for replicating physical redo to the backup database were as follows: Performance: physical redo allows buffered cache to

be used for every update made on the backup side, and it was unknown how to do that with any logical redo method: so using logical redo would imply that there would be a performance hit for multi-block operations on the backup side, which was not encountered on the primary side, and then there would be the looming possibility of the backup falling successively further behind, and unpredictably increasing the MTR (mean time to repair) after a takeover, with a linear detrimental impact on availability = MTBF/(MTBF + MTR)[for example: doubling the MTR for Nonstop from 30 minutes for double failure cluster crash takes Nonstop availability from 5 ½ nines to 5 nines, so MTR is important]

3-25


Why physical redo is better (continued): Code Simplicity and Reuse: the Nonstop

transaction system’s TM (called TMF) uses physical redo for data volume restart and archive recovery (autorollback and rollforward or file recovery), for those very performance reasons:if replication used logical redo, then a given file recovery execution would face switching back and forth from logical to physical redo, as the history of takeover changed back and forth for the files involved – basically, any state of the system that mattered during recovery would need to be captured in the log as that state changed

That Bird Had Flown: the transaction system’s TM (TMF) had already, at that time, committed to physical redo for several years hence

3-26


So, Nonstop passed on Franco’s ideas and went with the expediency of the status quo: physical redo became the permanent fact of life in recovery and replication, but soon enough … there were some regrets (we should have listened)

Just after that time (late 1980’s), Patrick O’Neill of UMass Boston invented the SB-tree , which solved the vast majority of these problems (IMHO): An SB-tree had a root block that pointed to large

multi-blocks, each of which contained a number of B+-tree pages containing the index blocks (interior nodes) and record blocks (leaf nodes)

The SB-tree - An index-sequential structure for high-performance sequential access Acta Informatica 29, 241-265 (1992)

3-27


SB-tree features (continued): The large multi-blocks were of the size that a

disk head could read in a single sweep across a sector of a hard disk, so they were physically determined to be fast and efficiently sequential in writing and reading performance (one order of magnitude improvement)

When a B+-tree page insert caused a page split, that new page came from the same multi-block as the split page, and the reverse for page merges due to deletes

Same level successive leaf scans came from the same multi-blocks and updates to them were written in one write

Single page accesses were still single page accesses

3-28


SB-tree features (continued): Writing a multi-block gave natural

characteristics, when striping was involved, large serial writes improve the efficiency of striping

The performance positives were assumed at that time (1990) to improve as the trends in memory versus disk pricing and optimization were to persist, which they have

Whatever performance shortcomings there were versus B+-tree implementations that used the disk layout carefully, could be improved by SB-tree implementations that also used the disk layout carefully

3-29


SB-tree features (continued): Performance improvements were across the

board: response time, throughput, processor and disk utilization

The SB-tree would also, I believe, remove most of the multi-block physical redo instances, were it to be used in Franco’s database scenario: such that purely logical redo could be sent to the log, for the cases involving an SB-tree page split with the new page within the same SB-tree block

In the worst case, the rarity of multi-block physical redo instances in such a scenario would tilt the balance, to the view that the performance fears of falling behind on the replication backup side were groundless

3-30


In the best case, there arises a real possibility of a new way of doing things (purely logical): Let’s assume that local TM recovery only stored

logical redo (and of course, logical undo), but no physical redo in the log, with only logical replication being transmitted and performed on the backup side database

Now all database operations done on the primary (create/drop for tables and indexes, database reorganization, maybe even split, merge and move partitions), can be replicated across as DDL commands

Online (fuzzy) dumps could be launched on one or both sides, because the log is logical, not physical: let the file physical images diverge

3-31


In the best case (purely logical), continued The data volume or database disk contains

completed SB-tree multi-blocks or SB-tree partition multi-blocks: if there were no botched block writes (a reasonably smart disk drive with decent power loss circuitry), then there are never any broken B+-trees, because there are no interrupted multi-block operations

Even if that is not totally possible, the number of instances of broken B+-trees dwindles drastically

If the transactional cluster crashes, database volumes could be traversed without the transaction system coming up: this is what I call ”Volume Autonomy”

3-32


In the best case (purely logical), continued If all the data volumes are mirrored, with the

mirrors separated by a geographical distance (say a mile or two across the East River), then after a total system obliteration on the other side of the river and loss of the logs, the database has no broken B+-trees and is still usable

There might be some logical inconsistency because of WAL (write ahead log) not having written back to the data disk in some time, but the B+-trees are intact and traversable

That policy of writing back to the data volume after the WAL could be sped up to make the database disk more up to date (called sharp checkpoints): so the CAB-WAL-WDV, which is only hurried for the first two (because locks are being held), could become hurried for the WDV

3-33


It’s not too difficult to make the case that the advent of increasing numbers of cores, and miles of cheap memory mean that the time for logical redo and volume autonomy are here: an optimal RDBMS could be designed for pure or nearly pure logical redo and attempt to accomplish true volume autonomy

Finally, once both undo and redo are logical, logical rollforward or archive recovery becomes easier to give an SQL-type interface:starting from some point in the past (the correct fuzzy dump is selected, and the rollforward starts from there)apply all redo, except wheremanagers.salary >= $200000 ANDdepartment.type = “S/W development”

and then the simple application of redo will not corrupt the table

3-34

19. Scalable Joint Replication

So, why am I drawing a difference between joint and disjoint replication? Most RDBMS systems today do not use

distributed transactions, because they magnify the wormhole problems by making an even larger system that still must remain single-threaded to protect the defenseless MVCC concurrency RDBMS from the applications: so strapping together a bunch of RDBMS SMPs in a cluster, or a bunch of clusters in a group simply does not make the problems go away, and it makes the solutions you have (sharding and isolating, or complex management of single-threading by the applications) more difficult to implement and maintain

3-35


Joint and disjoint replication (continued): For those MVCC-concurrent RDBMS, replication

is never joint, because there are no distributed transactions with multi-phase commit between transactional clusters (nodes), frankly, because distributed transactions cannot be made consistent

Hence, the joint/disjoint property of replication is only relative to S2PL concurrency multi-phase commit transactions between commit coordinators that can fail separately, leaving the “where’s the parent?” problem after some commit coordinator’s die and some survive, that was discussed with respect to virtual commit by name for continuous database

3-36


There are some RDBMS systems that support joint replication currently, but that are not completely scalable, for one reason or another:

(1) The Geographically Dispersed Parallel Sysplex (IBM GDPS) has a second form of replication (other than XRC, which was discussed as disjoint) called PPRC, that is synchronous and could be used (with appropriate scripts) for joint replication: this would seem to be impossible with long haul XRC

Peer-to-Peer Remote Copy (IBM GDPS PPRC), like XRC, supports CICS and IMS transactions; and supports DB2, IMS and VSAM database (this means the 50-60% of IBM’s applications which are not transactional are supported by PPRC)

3-37


PPRC is lockstepped synchronous data mirroring and is a bit copy of the database to the other side: because it is lockstepped, the application waits for both writes to be updated before continuing (hence, the synchrony)

This (to some degree) can guarantee the atomicity of transactions above the bit level: assuming that the scripts do the right thing after a failover, and distributed transaction management on the backup CICS does the right thing as well: even then it’s hardly conceivable that PPRC would not be disjoint after a failover/recovery, let’s see why …

3-38


PPRC imposes some severe restrictions, yet still has problems: PPRC is not long haul, PPRC can only mirror up to

24.9 miles away: at that distance the round trip for a send/acknowledgement costs 200 µsec, which is 1000 times the delay of a fabric switch or SAN, and this timing limitation is crucial to their disk writing code

PPRC does not do log replication, instead doing raw hardware disk mirroring, so that delay is imposed at every disk write, instead of just transaction group commit time as in log replication, this slows things down by a factor of a hundred, and this is not so in log replication

3-39


PPRC (continued): That 200 µsec cascaded delay makes

everything much, much worse: this is one of the reasons that Phil Bernstein stated that all disaster recovery methods he had seen used log replication at Gray’s last WICS class (Stanford 1999)

The other reason that the PPRC method is toxic to uncoordinated failover (disaster recovery) is that the streams that connect various PPRC database volumes and log volumes to their mirrors all get shredded by failure at different and uncorrelated points in those replication streams: log replication streams are unified and this shredding is easily prevented for log replication

3-40


PPRC (continued): To make this kind of mirroring work, one would

have to (a) generate 64-bit timestamps (any monotonic kind would do) on the primary side, (b) first send them to the database disk replicators, (c) after to the log replicators, (d) to be included in the streaming messages to the backup (and then these could not be raw physical writes)

On the backup, (e) no database disk writes could be applied until a log disk write with a greater timestamp was applied, so that after a disaster, the log writes have the last word: this would work, I believe

3-41


PPRC (continued): PPRC does not do anything other than raw, out of

the can, so that won’t work, especially when the application is not transactional, because then where’s the log? And what goes first?

(2) Hitachi has an option for IBM systems called “TrueCopy” which has an async option that does continuous sorting on the backup side (which cannot scale), and a sync option which can also, I believe, ride beneath PPRC itself, by some magic: that could work if they did it right, but even the method I gave will not scale up well

http://www.hds.com/assets/pdf/wp_117_02_disaster_recovery.pdf

3-42


PPRC (continued): (3) EMC has “Multi-hop, Cascade Copy, Adaptive

Copy” products: and I cannot see how those will survive an uncoordinated failover due to disaster recovery, also, I don’t know enough about the EMC async and sync replication solutions similar to PPRC (in spite of that ignorance, I remain duck-dubious of these for the same reasons as PPRC)

(4) Nonstop RDF has a fairly scalable joint asynchronous replication (1-safe) technique that supports local and distributed transactions in a disaster recovery scenario: this allows systems with multiple primary nodes to replicate with multiple streams to multiple backup nodes

3-43


As discussed before, RDF has an image log that RDF receivers on the backup dump log records to, that are later applied by an RDF updater to the log and the database: in this scenario that is happening on multiple independent nodes

During normal execution on the primary side, there is a network synchronization process that executes on one node and periodically (every 15 seconds or so) updates a synchronization file that has a copy on every primary node in the network, using a single spanning distributed transaction that will thereby create a log record on each node under the same transaction

After a failure of one primary node, all of the other primary nodes essentially commit coordinated suicide by shutting abruptly down and transmitting the rest of their RDF extractor log records to the RDF collectors on the backup nodes: then a mass takeover is initiated

3-44


Upon takeover on one of the backup nodes there is a network master that communicates with all the other backups and discovers how much has made it over on the backup side after sufficient local recovery has been made on every node

From the successive and periodic spanning transaction watermarks a common synchronization point on all the logs is determined and some complete transactions that actually committed on the backup side nodes are undone as well as many incomplete transactions, to put the network database entirely back to a consistent state according to the joint log that was separately transmitted by the RDF extractor-receiver pairs before the takeover, see the patent :

System and method for replication of distributed databases that span multiple primary nodes<http://www.google.com/patents?id=pIYSAAAAEBAJ&dq=6,785,696>

3-45


(5) Nonstop RDF has a lockstepped synchronous (2-safe) option which is built on top of the asynchronous replication method (1-safe) which is all that Nonstop RDF currently supports in a native way

This is implemented by a DoLockStep call, which causes the caller on the primary system to wait until the current position in the log, that is being replicated to the backup system, has traveled through the extractor to the collector and is safely stored in the image log on the backup system

Applications could first commit a mini-batch of asynchronous (1-safe) transactions that get replicated to the backup system with little or no performance hit or delay on the primary side, then execute a DoLockStep call which amortizes the overhead of that one wait, over all of the mini-batch work (much better than waiting for each disk write)

3-46


The only downside is what to do when only part of the mini-batch was completed before an uncoordinated takeover in a disaster recovery scenario: (a) If there are paper records kept and

transmitted (email and fax) before the mini-batch gets done, then this last ugliness could be fixed by hand through a table maintenance interface (not so clean)

(b) The following is cleaner, but more work for the programmer

Single-thread the lockstep mini-batches so that only one mini-batch is going on at a time, by creating a lockstep mini-batch facility (LS-MBF): then for each mini-batch create a unique mini-batch identifier (MBID)

3-47


Lockstep Mini-Batch Facility (LS-MBF), continued: Put an extra column in each row of each table to

be updated under lockstep, called ‘Last-MBID’, and place the MBID retrieved from the LS-MBF into that column of the row being updated, for every updated or inserted row

At the start of the mini-batch, zero out the sequential undo table (also being replicated), which holds the complete undo from the last mini-batch (the LS-MBF could be programmed to do this), then before every row update, insert the undo (before image) of the row (and the table name and row position) at the end of the undo table (for inserts, just the table and row position to be deleted as undo for the insert)

3-48


Lockstep Mini-Batch Facility (LS-MBF), continued: At the completion of the mini-batch the call to

the LS-MBF could insert a completed indicator at end of the undo table and then call DoLockStep

Because of causality, after any disaster recovery on the backup node, the undo table will hold either a completed undo list of updates, or a partial undo list holding everything that needs to be undone to put the database right again

To make the LS-MBF safely single-threaded, either have it return an error or hang when the start call is made, if the last completion call has not been made, then add an operator reset function to the LS-MBF

3-49


Since there is no reason why the Nonstop RDF lockstep would malfunction or cause a problem within the context of the Nonstop multi-node joint replication takeover scheme, I have included it as an option under that umbrella

This means that under Nonstop RDF you can replicate asynchronously (1-safe) and consistently from multiple nodes to multiple nodes in a network (for any distance) and also in the middle of that do lockstep transactions (2-safe) in a distributed context and this will all work

Nonstop RDF’s lockstep is in the patent application :

Method and apparatus for lockstep data replication <http://www.google.com/patents?id=4r2GAAAAEBAJ&dq=2003/0050930>

3-50


There is no reason that I can perceive, why the Nonstop RDF approach which accomplishes lockstep 2-safe (sync) replication on top of 1-safe (async) replication would not work perfectly well on top of anyone’s async replication, thereby accomplishing 2-safe replication (in a DIY manner) with the throughput and response time of 1-safe (async), which is not that much different from non-replicated database performance

As absolutely phenomenal as Nonstop RDF is compared to everything else, it is not integrated into the transaction manager TM (Nonstop TMF), which would allow replication to be accomplished in a more seamless and sleek manner

3-51


An Optimal RDBMS would integrate virtual commit coordination by name and replication into the distributed TM (transaction manager) and RMs (resource managers) to accomplish something entirely new: what are the issues?

First let’s create an example application to demonstrate the issues clearly

A deep space network (DSN) sends all the communications automatically throughout the planetary system, email, voice mail, twitter tweets, news and workflows that get driven remotely to magically do things on the replication backup clusters in a completely ACID and synchronized way

3-52


1 Mercurian communications satellites in Mercury’s shadow (Aphelion=69.82Gm, Perihelion=46.00Gm, Semi-major axis=57.91Gm, Orbital Period=0.241y)16.6 average round trip light minutes from the earth-lunar system: Mercurian Orbiting Solar Observatory, Mercury's Shadow Stationary Laboratory (maintained by plasma rockets powered by Bussard electric fusion generators)

2 Venusian communications satellites (Aphelion=108.9Gm, Perihelion=107.5Gm, Semi-major axis=108.2Gm, Orbital Period=0.615y)16.6 average round trip light minutes from the earth-lunar system: Venusian floating city project, 50 Km above the surface (floating on the heavy carbon dioxide atmosphere), it's the most human environment in the solar system

3-53


3 Earth (Aphelion=152.1Gm, Perihelion=147.1Gm, Semi-major axis=149.6Gm, Orbital Period=1y)

4 Lunar polar deep space network stations - Lunar North and South Pole Command Centers, connected by the Lunar Bunker System Microwave Network (bunker habitats tunneled into the sides of craters and mountains) (Apogee=0.4057Gm, Perigee=0.3631, Semi-major axis=0.3844, Earth Orbital Period=0.0748y)2.56 average round trip light seconds from earth station

5 Near earth, Mars crossing asteroid 1986 DA, 2.3 Km in diameter, metals mining in zero gravity (Aphelion=666.7Gm, Perihelion=173.7Gm, Semi-major axis=420.2Gm, Orbital Period=4.71y)46.7 average round trip light minutes from the earth-lunar system

3-54


6 Martian communications satellites (Aphelion=249.2Gm, Perihelion=206.7Gm, Semi-major axis=227.9Gm, Orbital Period=1.88y)25.3 average round trip light minutes from the earth-lunar system: Mars Terraforming Project

7 Main belt asteroid 216 Kleopatra - precious metal miners at Dogbone station, Kleopatra's shape(217 × 94 × 81 km) comes from a fortunate collision which exposed its metal core to surface miners (Aphelion=523.0Gm, Perihelion=312.5Gm, Semi-major axis=417.8Gm, Orbital Period=4.67y)46.4 average round trip light minutes from the earth-lunar system

3-55

19. Scalable Joint Replication8 Jovian communications satellites (Aphelion=816.5Gm,

Perihelion=740.6Gm, Semi-major axis=778.5Gm, Orbital Period=11.9y)86.6 average round trip light minutes from the earth-lunar system: Io orbiting station studies vulcanism, occasional ground missions, Europa orbiting station studies life in the oceans under ice, no ground missions: risk of infection

9 Saturnian communications satellites (Aphelion=1513Gm, Perihelion=1354Gm, Semi-major axis=1433Gm, Orbital Period=29.7y)159.4 average round trip light minutes from the earth-lunar system: Saturn floating cities construction project, to support the mining operation for Helium-3, which drives the fusion economy (same surface gravity as earth at the human pressure visible cloud layer, the city is buoyed by hydrogen gas dirigibles heated by fusion reactors)

3-56


We can already see a difference that those response times reveal between the DSN and the datacenter: virtual commit coordinator glupdate will be a problem, but that can be handled by finalizing the configuration and disallowing replication takeovers

Now that we have an application, let’s talk about 1-safe (asynchronous) replication (Gray/Reuter 12.6.3) first, because 2-safe (synchronous) waits for everything to propagate in replication, before continuing and that makes things simpler (although it presents its own problems in performance): 1-safe is the most difficult problem, because it loses transactions in an unpredictable way

Issue #1 The most serious problem that would remain to be solved to make 1-safe replication work is the problem of geographical dispersion of the takeover of transaction commit

3-57


After a cluster crash in the ‘datacenter’ role for virtual commit by name, the LSGs containing separate log partitions from the old cluster have all moved to other clusters, but the transaction states are all resolved from ESS (enterprise storage) common access to the single log root containing the transaction state records

For geographically dispersed replicated systems this same effect could be accomplished by having all the primary clusters on one ESS and all the backup clusters on a remote ESS: such that all the virtual commit coordinators would span the same network links between the two ESS: and you could go half one way and half the other in a simple bidirectional replication scheme

3-58


Issue #2 The problem with this is slack: if the whole ESS on one side goes, all the load will be now executing in one datacenter, which means that previously, systems had to be running at<= 35% at peak execution, to guarantee<= 70% at peak execution after the crash: this means > 65% slack before the crash at peak execution

This is not efficient utilization (an understatement) Imagine a network of datacenters: say, eight

datacenters spread around the continent, such that each had a log root and seven log partitions and attached RMs in seven LSGs that would disperse to the seven other datacenters in the case of a loss of the datacenter

3-59


In this scenario, the goal of 70% peak execution and 30% slack after a takeover, would be accomplished through (70/8)*7= 61.25% at peak execution or a performance hit of only 8.75% to accommodate the potential loading after a takeover (you could almost ignore this and run at 70%, or whatever you initially decided for slack)

Issue #3 Some might suggest that we run seven TMs in a cluster, since we’ve undoubtedly separated out the locality in which RMs go to which datacenters: but those who say that cannot be aware of the astonishing scalability win of cluster group commit in clustered database performance

3-60


Throwing cluster group commit away and replacing it with 7-way distributed commit guarantees that the actual distributed commit with the remote clusters in the datacenter will be truly abominable in performance: I would not entertain this solution, even momentarily, due to my many years of working in this area (I just know better than that, you don’t rob Peter to pay Paul)

So, let’s see if we can solve this, without savaging an excellent architecture (in fact, we can)

Issue #4 In the geographically dispersed scenario, the toughest question is “where’s the log root?”

3-61


If you replicated the log root to all seven geographically dispersed backup sites, you would have seven different answers after takeover, as to which transactions commit and abort: that is intolerable !!!

Also intolerable is replicating the log root seven ways 2-safe, which would always have one slowest link and that would guarantee bad performance (although you would have the identical answer 7 times)

If you replicated the log root to a central site (yielding exactly one answer) and the other six takeover sites accessed that same answer there, that would work, but would create MTR problems if some sites had a link down to the central site holding the log root (better, but not solid enough)

3-62


Issue #5 The best answer is to put just enough transaction state into the log partitions, and that means into the replicating LSGs, to sort things out later with minimal impact to performance before

This means inserting new transaction state records in the log partition, not just the log root: in the case of a virtual commit coordinator which has the role of “geographic replication”, instead of “datacenter”

3-63


Issue #6 These are the transaction states, which are stored as records in the log root (the tense is crucial): Active state: only seen in the log if a working

transaction gets caught by the periodic TM checkpoint Prepared state: seen in a transaction which came in

from a remote cluster (parent), and which is in the middle of a 2 or 3 phase commit

Committed state: seen in a transaction which has touched remote clusters (children), and which is at the controlling end of a 2 or 3 phase commit

Aborting state: like I said, aborting Forgotten state: this transaction is now durably

going away

3-64


These are the 10 possible transaction state record transitions in the log root: Active or nil -> forgotten (hurried,

lockrelease after): local commit, neither parents nor children

Active or nil -> prepared (hurried, locks held): distributed commit, definitely having parents, maybe having children

Active or nil -> committed (hurried, lockrelease after): distributed commit, no parents, definitely having children

Active or nil -> aborting (hurried, locks held): maybe local or distributed

3-65


Transaction state record transitions in the log root (continued): Prepared -> forgotten (hurried,

lockrelease after): distributed commit, definitely having parents, but no children

Prepared -> committed (hurried, lockrelease after): distributed commit, definitely having both parents and children

Prepared -> aborting (hurried, locks held): distributed commit, definitely having parents, maybe having children

3-66


Transaction state record transitions in the log root (continued): Committed -> forgotten:

distributed commit, maybe having parents, definitely having children

Aborting -> aborting (hurried, locks held): try, try again to abort

Aborting -> forgotten (hurried, lockrelease after): maybe local or distributed

3-67


Issue #7 The last seen transaction state in the log, has different lock guarantees, which have an impact on takeover at the backup cluster Active: these will get aborted, and you have to

have locks to do that online: we want to do the aborts after the table is up for business (lock reinstatement), and not before, for better MTR

Prepared: these must remain unresolved with locks held after the recovery awaiting the parent’s decision: our best chance of making this work, in a distributed replication scenario with presumed abort, is if the log root is 2-safe on the backup (I think, but we’ll see)

3-68


The different transaction states … impact on takeover (continued): Committed: these must remain committed with

locks released, until the children are all notified: presumed abort is no help here, and if the parent node or child node loses a commit decision in a takeover, this won’t work !!!

(1) This seems to point to something fundamental, which touches upon relativity itself: 1-safe systems, with excellent response time, appear to require locality and centrality in coordinating commit in a single TM … and that is in contrast to 2-safe systems, which reliably allow a completely globally distributed system to coordinate commit with as many TMs as can be clustered together, but with response time issues

3-69


(2) It appears that to have a 1-safe (asynchronous) system with global transaction integrity (one TM), it requires having an implicitly 2-safe (synchronous) system in the middle of it: and this has advantages over 2-safe alone that we can mine with applications

So, we need go no further with this line of reasoning (distributed 1-safe), because a parent can tragically release locks after a commit on the primary or the backup in an uncoordinated way with the prepared child on his primary or his backup, so we need to make a restriction

3-70


Issue #8 Because of the uncontrollable problems with transaction commit coordination (and we haven’t even talked about the raggedy ends of the various replication data streams yet): 1-safe transactions are all local: they cannot touch2-safe or very-safe resources or go distributed

This can and should be handled in two ways: The transactional file system examines the RMs

it touches and will notice when a 2-safe or very-safe or distributed transaction attempts to touch a 1-safe resource, or the other way around, and will abort the transaction with the error: “Distributed, 2-safe or very-safe transactions are prohibited from accessing 1-safe replication resources”

3-71


The second way: If the TM Library, when it flushes local RMs for

commit processing, would keep track of the infection of the transaction by 1-safe resources with a 1-safe flag, then the TM (transaction manager), when it collects the prepared responses can instead abort any distributed or 2-safe or very-safe transactions marked with the 1-safe flag: this covers the case where two processes working on behalf of the transaction might split the accesses of 1-safe and (distributed or 2-safe or very-safe) resources: the TM can guarantee these will abort without any leakage of improper commit

3-72


That simplifies things considerably !! Issue #9 These are the local

transaction states (1-safe), which are stored as records in the log root (the tense is crucial): Active state: only seen in the log

if a working transaction gets caught by the periodic TM checkpoint

Aborting state: like I said, aborting

Forgotten state: this transaction is now durably going away

3-73


These are the 4 possible local transaction state record transitions (1-safe) in the log root: Active or nil -> forgotten (hurried,

lockrelease after): local commit, with neither parents nor children




3-74


Our basic method will be: (1) We will write some transaction state

records into the log partition (during what was previously a group flush of the log partitions for serialization before the group commit write to the log root) … so we just throw the records into that new log partition write buffer

(2) Also, we queue some transaction state records to be written into the log partition before the next group commit write and after this one: after the log partition flush write … we will just transfer those records from the new queuing buffer into the new write buffer for the log partition

3-75


Call them ‘pre-state’ and ‘post-state’ log partition transaction state records

The purpose of this effort is to bracket what we already know is going to be in the log root, and give us some knowledge in the log partition of the LSG after the crash: we will mightily prune down the uncertainty related to transaction states in the LSG

That uncertainty translates into the transactional locks that must be held until we can determine the truth by accessing the log root transaction state records at some later time (after the <= 30 seconds that we need to make the LSG available for business)

3-76

19. Scalable Joint Replication The cost of this enormous improvement to the MTR

(<= 30 seconds if lock reinstatement is enabled) and to the availability of the database system after a crash is a fairly tiny price: If we can, through careful accounting, know which log

partitions are being affected by which transaction state changes in the log root, then the flush for that log partition which is already happening (from the Basic Scalability slides), can include some log records to append to the log partition’s input buffer before flushing that log partition, as hitchhikers

There is a case, where the log pointer that we were asking the log partition to flush to, might have already been exceeded by streaming writes, in which case we will get an extra write on those log partitions when we are replicating geographically

3-77


Some transaction state records are already implicit in the log partition: any RM generated update log record (let’s call it an RM-Update log record) means that the transaction state is implicitly Active in the log root, and any aborting compensation log record (let’s call it an RM-CLR) means that the transaction state is implicitly Aborting in the log root

We need to add three new transaction state records, specifically for the 1-safe log partitions: RM-Ending: written a split second before writing a

committing Forgotten record to the log root RM-Ended: written a little after the committing

Forgotten record is written (if we survived) RM-Aborted: written after an aborting Forgotten

record is written (if we survived)

3-78


The goals for RM recovery after a takeover: In any case, if there are no transaction state

records in the LSG, we will abort every transaction we see: the RM transaction state records prevent that

We want to see an RM-Ending for any Forgotten non-distributed commit, to prevent an abort and to keep the locks

We want to see an RM-Ended for any Forgotten non-distributed commit, to prevent an abort and to release the locks

We want to see an RM-Aborted for any Forgotten non-distributed abort, to prevent re-driving an abort and to release the locks

3-79


Let’s describe the new transaction state record transitions in the log partition:

This new pre-commit log record needs to go into the current commit flush log partition write buffer in the TM: RM-Update -> RM-Ending:

The log root is about to write a Forgotten log record (Abort flags clear, no parents and no children: this is the local commit case), this record will be written to the log partition first: there is a window for this transaction to ultimately get rolled back, before the entire log for this 1-safe transaction becomes 2-safe on the log in the backup clusters

3-80


This new post-commit log record needs to get queued for the log partition in the TM: RM-Update -> RM-Ending

-> RM-Ended:The log root has written a Forgotten log record (Abort flags clear, no children) by the time this is dequeued and written, otherwise we couldn’t be here: there are no network guarantees related to this transaction, so the only relationship is to other replication streams from this primary cluster, call that ‘log resolution’ (explained later)

3-81


This new ‘post-group-commit’ log record needs to get queued (continued): RM-Update -> RM-CLR -> RM-Aborted:

RM-Update -> RM-Ending -> RM-CLR -> RM-Aborted: (corner case)The log root has written a Forgotten log record (Abort flags set): these are the aborted cases, which are not guaranteed to have completed in this LSG until the RM-Aborted log partition record is reached

The next thing we need are low and high water marks for doing some RM-restart functions (called volume recovery on Nonstop, crash recovery for the RM) after takeover on the backup

3-82


Let’s start with the high water marks: the root pointer is the 64-bit count of buffers written to the log root: If there are 100 writes/second (verrry aggressive), 64

bits (less the sign bit) gives us almost 3 billion years to rollover that number

Information regarding the log root and partitions needs to be stored in our configuration files for the transaction system in the cluster

This is not stored in Windows or other registry It’s findable after restart by a-priori code methods These are smallish flat files, probably segmented The files are not transaction protected (the

classic chicken and egg problem) Duplicated by copying to attached storage and

ESS, so that it’s findable at startup and remotely These files are encrypted and check-summed

3-83


High water marks (continued): The root pointer needs to be

guaranteed to monotonically increase (never, ever decrease)

When the cluster TM comes up, it finds the log root and log partitions from the configuration files and by the ‘fixup’ method (discussed previously), comes up with the root pointer value

Before every buffer write (mostly group commit writes) on the primary log root, the root pointer is incremented

3-84


High water marks (continued): At the end of the group commit write, a merge

log record is written to the log root containing all the log pointers for the the log partitions, some having changed because they just did flushes or flush writes, and the merge record should also contain the root pointer, not that we don’t know it, it’s for use in takeover on the backup cluster

The merge log record will then have the highest address in the log root after the commit flush, a greater root pointer value than any transaction state record that was just written to the log root

3-85


High water marks (continued): So, 1-Safe transaction commit writing is like this:

Every time a call is made to queue a non-distributed Forgotten record to the TM buffer for writing to the log root If the Forgotten record’s abort flags are

clear then at the top of this call a call is made to queue an RM-Ending pre-record into the TM buffer for writing to the 1-safe log partition

At the end of this call, if the abort flags are clear a call is made to queue an RM-Ended record into the deferred buffer for post-group-commit writing to the 1-safe log partition, otherwise an RM-Aborted record is queued

3-86

19. Scalable Joint Replication High water marks (continued):

1-Safe transaction commit writing (continued): When the time comes, a call is made to force write

the TM log root buffer to the log root At the top of this call, the root pointer is

incremented and inserted in the merge record at the end of the commit buffer

Then a watermarks record (with the new root pointer value) is inserted at the end of the the TM buffers for writing to the 1-safe log partitions

Then all the log partitions are flush written in parallel, and when those log partition writes complete, the deferred buffers are copied into the TM buffers for writing to the log partition next time

After all the log partitions are flush written, the log root buffer is flush written, and we’re done

3-87


High water marks (continued): The root pointer in the merge and watermarks records

creates a spanning indication of progress for the entire merged log, which can be tracked and recorded

The replication service extractor-receiver pairs will be replying ‘safe’ log pointers and redo applied (root pointer values) from the backup clusters, and those updates will be transmitted to the TM service in the primary cluster

The ‘safe’ log pointer value, is what has been received, whereas the redo applied value is what we guarantee to apply if a takeover immediately occurs, this should be equal to 30 seconds or a minute’s worth of log updates, at the current rate of transmission, beyond the last received high water mark

3-88


High water marks (continued): There is absolutely no benefit (other than looking at

uncommitted data, which will be discussed) to be derived from applying more redo than the transaction system can commit consistently: later it will just have to be rolled back for consistency anyway, with a potentially vastly greater online rollback, holding so many locks that lock escalation to the table or partition or index levels makes more of the database unavailable for a longer duration MTR

The TM service will track the joint minimum for all theredo applied values that have been replied, for all transmitting 1-safe LSGs and the log root (which will always reply that it has applied everything it received): this is the instantaneous measure of joint progress for the later log resolution after a crash (with the hopeful assumption that the backups can takeover)

3-89

19. Scalable Joint Replication High water marks (continued):

Some 1-safe LSGs may be inactive and completely transmitted, which is considered up to date, so their transmitted redo applied is neglected from the joint minimum (exclude commands are also a way of separating odd LSGs out)

The joint minimum 1-safe redo applied becomes the monotonically increasing high water mark root pointer value for the set of 1-safe LSGs spanning the primary cluster and all of the backup clusters

The 1-safe high water mark will also be included in the watermarks record written on the 1-safe log partitions and the merge record written on the log root

Thus, each geographically dispersed 1-safe LSG on a backup cluster will know, at takeover time, the joint serialization level of 2-safe everywhere

3-90

19. Scalable Joint Replication The low water marks are handled in a fairly standard

way (Gray/Reuter 11.3.2): The redo low water mark is the earliest log

partition pointer that is needed in the initial redo pass of RM Restart which reapplies redo from that point to build the RM cache after an RM failure: this is moved up in the log partition by the lazy writes to the data volume in the CAB-WAL-WDV protocol - the more lazy the WDV is, the longer the restart time will be – so this is controllable by a sharper WDV policy

The undo low water mark is the earliest log partition pointer of all the transactions touching an RM that are incomplete at the time of the crash, that is needed in the undo pass to work backwards from the end of the log partition until all transactions are undone – the size of the undo pass is related to the number and size of transactions – this isn’t under TM control

3-91


Low water marks (continued): The redo low water mark is local and is

different on both sides, so we can ignore that in replication

The undo low water mark is semi-global, in that the transactional consistency guarantees are replicated with the log updates to the backup

Thus, the undo low water mark for an RM on a1-safe log partition needs to be included in the watermarks record written on that log partition for every commit flush

Now, everything is in place to do as much as can be done on the backup side to get the 1-safe LSG RMs restarted with the appropriate locks held, to preserve data integrity

3-92


During the steady state of replication, the extractors transmit the log partition updates to the receivers, who apply the updates to the log on the backup cluster, and then ship the redo to continuously rebuild the database cache in the backup RM (you might call this WAL-RDO-WDV, with the lazy writes to the database volume a little less lazy)

Three possible policies or modes of operation occur to me, either for the entire 1-safe LSG on the backup side, or for individual 1-safe RMs on the backup side (using definitions from Gray/Reuter 7.6.2-3)

3-93

19. Scalable Joint Replication Three possible 1-safe policies (continued):

Global committed: the replication receiver applies all the redo (RM-updates and RM-CLRs) to the log partition and the replication updater applies the log partition updates to the RM to continuously rebuild the cache, always staying <= the point of globally guaranteed committed data: this means where the watermark record’s root pointer value<= most recent global high water mark value… after the takeover, redo is applied to the RM to rebuild cache all the way up to the most recently replied redo applied guarantee… then locks will be reinstated from two RM checkpoints prior to the high water mark to <= the redo applied point… to make the database available online in preparation for the online rollback to the final resolution point

3-94


Three possible 1-safe policies (continued): Global uncommitted: During and after the

takeover, the replication receiver applies all of the redo received to the log partition and to the RM to rebuild the cache (beyond the high water mark to the end, such that it is always true that the redo applied = redo received), until all of the redo that will ever arrive has been applied… after the takeover, locks will be reinstated from two RM checkpoints prior to the high water mark to the end of the log partition… to make the database available online in preparation for the (potentially quite lengthy) online rollback to the final resolution point after the takeover

3-95


Three possible 1-safe policies (continued): Local uncommitted: During and after the

takeover, the replication receiver applies all of the redo received to the log partition and to the RM to rebuild the RM cache beyond the high water mark, until all of the redo that will ever arrive has been applied … after the takeover, the RM is recovered to the last transaction committed at the end of the log partition of this LSG

Issue #10 Although the locks can be acquired by scanning the last two RM checkpoints in the log partition, the same is not true for the final transaction states: a transaction could become inactive early in its life and become a longer-running transaction for some reason beyond our control

3-96


The scan for final transaction state of all the outstanding transactions must begin at the undo low water mark for the RM, which is not in the control of the TM and can be quite a ways back there in history

The solution for this is simple: the replication updater will be given the job of continuously maintaining a current list of unresolved transactions for the LSG (which is unified for all RMs by the log partition flush writing): always going up to and never traversing beyond the watermarks record with the root pointer value<= last received high water mark value forglobal committed or global uncommitted policies, or to the end of the log for the local uncommitted policy

3-97

19. Scalable Joint Replication Transactions come into existence by RM-update records,

potentially go into Aborting state via RM-CLR records, attempt to commit via RM-ending records and are discarded from the list via RM-ended and RM-aborted records: at any particular moment, the remaining set (updating, aborting and ending) are the set of unresolved transactions for takeover

If the replication updater dies for some reason, simply restarting it at the undo low water mark will get the unresolved transactions list again

All three 1-safe policies retain the inner 2-safe capabilities: triggers and publish/subscribe interfaces could queue off of the edge of the high water mark for the RM identically in all three cases, allowing workflow and other database queuing mechanisms to work consistently, independent of the policy decided for the application and query access supporting advance reading of the replication stream

3-98


The advance reading capabilities of the local uncommitted policy would not be so important on terrestrial systems, but for readers on Saturn, this would mean 2 hours and 40 minutes earlier access to data that won’t disappear with the terrestrial system

The Global * policies would be more important for workflow, and global application consistency and fairness policies that would disallow some users from getting ahead of others in accessing data (the log partitions could be encrypted to guarantee such a policy, with keys propagated along with the high water mark that enabled the database to function at a time guaranteeing fair access to response), this could keep people in the farthest reaches of the interplanetary system feeling like they’re connected and relevant

3-99


Getting back to restarting the 1-safe RMs after a takeover …

After takeover and all the redo (RM-update and RM-CLR records) has been applied (according to the RM policy), and if lock reinstatement is enabled, then a scan for RM checkpoints to find the locks must be done differently, according to the policy: Local uncommitted: all the redo available was

applied to the RM cache… now the last two RM checkpoints before the end of the log partition are scanned and locks reinstated (only for transactions in the updater’s unresolved transactions list) to bring the RM back online, then a standard RM restart … online rollback (reading backwards in the log from the end and aborting any uncommitted transactions) is immediately done, because there is no coordinated resolution for this policy

3-100

19. Scalable Joint Replication Scan for locks, according to the policy (continued):

global uncommitted: all the redo available was applied to the RM cache before the takeover… now, starting at two RM checkpoints before the watermarks record whose root pointer value<= the last received high water mark (only including locks from the unresolved transaction list for the region before that watermarks record), a lock scan is done to find locks all the way to the end of the log (retaining all of the locks for transactions begun after that watermarks record) and those locks are reinstated to bring the RM back online (the potential for lock escalation and reduced availability is the price for reading uncommitted as a backup and afterwards becoming consistent as a functional primary)… later, after log and commit resolution, an online rollback will be done

3-101

19. Scalable Joint Replication Scan for locks, according to the policy (continued):

global committed: only the redo before the watermarks record whose root pointer was<= most recently received high water mark was applied to the RM cache before the takeover … now, the last redo applied that we replied to the primary is the amount of redo that we actually guaranteed to the primary that we have, so we apply the remaining unapplied redo that is<= to the last replied redo applied value … now, starting at two RM checkpoints before the watermarks record whose root pointer value<= the last received high water mark (only including locks from the unresolved transaction list for the region before that watermarks record), a lock scan is done to find locks all the way to the last replied redo applied before the takeover (retaining all of the locks after that watermarks record) and those locks are reinstated to bring the RM back online … later, after log and commit resolution, an online rollback will be done

3-102


The lock situation for 1-safe databases (other than the local uncommitted policy) that will be rolled back while online is complex: for the global uncommitted policy - there will be many committed transactions that get rolled back, these may be holding the same row locks and table locks, and the online rollback function must deal with this rationally, by inheriting the locks and releasing them at the appropriate times, so that wormholes due to online access by applications are not introduced in reverse (“be as careful getting out as you were careless going in” – Barack Obama)

3-103


All of the incomplete transactions (for global * policies) will get aborted after resolution, so their locks must be held for now, because we don’t know for certain how high the water finally got: that will wait for final log resolution and final commit resolution that comes back from the takeover master (described soon) with the globally merged and finally resolved log endpoint

If there is lock reinstatement, then the database can be brought online with transactions awaiting resolution: no aborts should be done yet, because even the already aborting transactions that were incompletely aborted on the primary may be re-driven again, and you wouldn’t want to re-drive them twice

3-104


The first knowledge of the takeover of the LSG on the backup system comes to the backup virtual commit coordinator (VCC), which executes in the TM of that physical cluster: that arrives either by an automated takeover or by an operator command for this particular named virtual commit coordinator

The first thing the new primary VCC does is to monitor the shutdown of the receiver for the LSG (the updater continues to operate through resolution)

Once the receiver is finished writing the log partition it is merged into the physical log of this new primary system by the new VCC primary’s writing a named VCC takeover log record to the log partition, with the current root pointer for the log root on this cluster: that root pointer will be the basis for the sequencing of the high water mark and redo applied values in the watermarks records from this point on

3-105


Archive dumping and recovery can cross over the takeover threshold, but no functions related to takeover: RM-restarts, online rollback, high water mark and redo applied sequencing from the watermarks log record all stop at the takeover record boundary, because the root pointer in watermarks log records is based on the log root in the former primary cluster

Note that the log partition pointers for archive dumping and recovery, undo and redo low water marks and other log partition related functions contain a log partition-relative pointer and this moves portably along with the log partition: it is a 64-bit counter of buffers written to this log partition by RMs (and the occasional TM buffer write)

3-106


Now that the TM transaction service has completed a takeover for the virtual commit coordinator from the old primary cluster to the backup cluster, that new primary VCC is ready to reinstate the virtual transactions (for the locks we scanned), which are not foreign, but are home to this VCC

Through a trusted interface, the RM reinstates the transactions that are left to be resolved, in a special “Unresolved” transaction state (with abort flags, etc.) … after all RMs have completed this for the LSG, and after final commit resolution, aborts can begin

The cluster TM service which received the replication of the log root, is the takeover master for all the LSGs on the old primary cluster: that TM service has everything it needs to resolve final commit for takeover immediately

3-107


The log root only receives log writes from the TM (transaction manager) and not the RMs: for 1-safe resolution the last completed commit write (the merge record is the end marker) is the end of the log (remember, in 1-safe we always lose some transactions, so we don’t want to be scraping them out of the garbage): this is because the root pointer and the high water mark are for a completed set of transaction state changes, and not for a partial set (it’s epochal)

So, the final high water mark from the merge log record of the last completed commit write on the old primary's log root is written to the log root of the takeover master in a takeover log record

3-108


With that final high water mark, the takeover master scans the last two TM checkpoints (before the final high water mark) in the log root received from the old primary, coming up with the resolutions for the final transaction states of all the 1-safe transactions for the LSGs that were primaried there at the time of the takeover: Active: abort these Aborting: abort these again Forgotten: these got committed or aborted In the case of the global * policies, all the

transactions that started after the final high water mark, also have to be rolled back, and those are in the log in the form of RM-Update records and RM-CLR records

3-109


The takeover master sends these transaction resolutions along with the final high water mark to all the virtual commit coordinators involved

Once those virtual commit coordinators receive the final resolution (and advance the updaters up to that point to add to the unresolved transaction list), they can deal with the “Unresolved” state transactions:

(1) Those that were committing, but got aborted are fine, the other way around is an anomaly

(2) Those remaining for which no resolution is sent go according to the last record seen in the range (don’t read beyond the final high water mark): RM-Aborted -> release the locks, it’s done RM-Ended -> release the locks, it’s done RM-Ending -> this didn’t quite make it, abort it RM-Update -> abort it RM-CLR -> abort it again, the abort didn’t complete

3-110


(3) The global uncommitted policy RMs will have transactions that were begun beyond the final high water mark and all of these must be aborted

(4) The aborts are done by an online rollback process that (5) does the undo scans and resolutions and then (6) inherits the locks and (7) does the aborts and lock releases and then the takeover is complete

When the dead primary revives, or simply reconnects to the network, there is a process for coming to terms with changes to the database made by the new primary, and the loss of 1-safe transactions in the original takeover: that process is called “catch-up after failure” (Gray/Reuter 12.6.4), and is also called reintegration

3-111


Reintegration begins when the original primary tries to connect with the group with a message, which fetches a session mismatch error due to the regroup round that declared it down, then it rejoins the group with the group locker performing a glupdate

The notification, by the group locker, to all of the seven (in our eight cluster scenario) virtual commit coordinators that their brother is alive, causes them to send a message to their sibling to reintegrate

3-112

19. Scalable Joint Replication That 1-safe reintegration will start with a

synchronization: (1) the final high water mark in the log root and the fate

of the transactions that was resolved in the takeover is sent by the primary virtual commit coordinator to the backup sibling

(2) this will cause a crash recovery for all the RMs on the 1-safe LSGs that are resolved at that log root restart point, using those transaction resolutions

(3) after that is complete, extractor-receiver pairs are started up

(4) the log partition updates since the takeover are transmitted to the new backups

(5) having done so the system is back to steady state replication

3-113


This reversed scenario, that of seven primary clusters replicating into a single backup cluster (which Gray/Reuter 12.6 calls a “hub” configuration, and which I call “fan-in”) is probably uncomfortable for the applications’ locality of reference based on “fan-out”, and a coordinated takeover (called a giveover, or giveswitch, or simply by the verb “to primary”) can bring it all back together again, one virtual commit coordinator by one

A giveover command causes the primary side virtual commit coordinator to drain the transactions out of the system, simply by hanging the transactional file system at the transaction begin call, and after a half a minute, informing the extractors to drain their queues and to send a giveover indication, which causes the receivers on the other side to stop receiving, and the extractors to shut down

3-114


Some transactions may be hung, or taking too long to exit: these transactions will abort naturally on takeover

The primary virtual commit coordinator then passes the giveover baton to the sibling (making itself into a backup)

The backup sibling, upon receiving the baton immediately uses the interface to the group locker to declare itself the new primary virtual commit coordinator, and writes a takeover record to the log partition

Then the new primary does a coordinated takeover (simply aborting the outstanding transactions) and reestablishes replication the other way, by putting the RMs online for transaction service, starting up extractors which connect to their receivers on the backup side and putting replication into steady state: and then we’re back online, good as new

3-115


As was mentioned earlier, the DSN (deep space network) cannot do glupdates, due to immense response times, so virtual coordinator relationships are finalized and static, and takeovers are only necessary when the connection goes dark for long periods (so you can manipulate the database when LSG streaming is down)

A 1-safe, local uncommitted policy is indicated on Saturn for a replication stream from Earth-Lunar (159 minute round trip), but a shifting policy might make sense for the miners at asteroid 1986 DA, which has such an eccentric orbit that it crosses the orbits of Mars and the larger mining station at asteroid 216 Kleopatra (Dogbone), and at times 1986 DA has a 3 minute round trip time to the Earth-Lunar system

3-116


Let’s talk about 2-safe and very-safe, these are identical in some important ways: They both require that (when the backup is

connected) all of the database changes be transmitted safely to the backup before releasing the locks after the end of the transaction and notifying the application

They use the same distributed commit protocol They use the same mechanism for replication,

identical to that of 1-safe They use a similar takeover mechanism on the

backup side as used by 1-safe They use the same facility for virtual commit

coordination by name as used by 1-safe They use a similar giveover and coordinated

takeover method as used by 1-safe

3-117


Now, let’s talk about the differences between 2-safe and very-safe: During the time that the replication line to the backup

is up, 2-safe commit only completes when the RM database changes and the log root have successfully been transmitted to the backup

If the line is down, or goes down during a 2-safe transaction, it commits locally with the promise that when the line comes back up, that the backup system will catch-up, before it becomes takeover capable: if the primary fails before the backup is caught up, then that is a double failure and we got caught in a single failure MTR window

In that case, the 2-safe backup has to wait to talk to the primary and cannot takeover until the dead primary comes back to life and the backup catches up

3-118


Differences between 2-safe and very-safe (continued): During the time that the replication line to the

backup is up, very-safe commit only completes when the RM database changes and the log root have successfully been transmitted to the backup

If the line is down, or goes down during a very-safe transaction, it blocks until the line comes back up, so the backup is always caught up: if the primary fails then the backup can take over at any time, with no MTR window for double failure

In any case with very-safe, you can always access the database, acquire locks to update and then do the update, although you will wait to complete the transaction when the line is down

3-119


Differences between 2-safe and very-safe (continued): A 2-safe database lives on the primary

side: on the backup side, the database is never in control of the state … the backup only echoes the truth of the primary as best it can

A very-safe database lives on the backup side: when the backup receives the update, it is done when commit is complete on the backup side: even before the reply to the primary is complete, the backup can takeover with the truth in hand … the primary is the echo of the truth that the backup knew first

3-120


Differences between 2-safe and very-safe (continued): A 2-safe transaction can commit on the primary log

root if the line to the backup is down, but when it is up, the transaction can only be committed after the commit record hits the log root on both the primary and the backup: the log root must be 2-safe as well as the log partitions

A very-safe transaction can only be committed after the commit record hits the log root on both the primary and the backup: the log root must be very-safe as well as the log partitions

Can 2-safe and very-safe share the same log root on the same transaction? I’m certain that both can be distributed, but not so certain they can be distributed together, we’ll see if it’s practical

3-121


Differences between 2-safe and very-safe (continued): Given that the same cluster double failures

cause replication takeovers for both, we can judge the apparent availability holes in both:

2-safe carries on in the face of a comm failure, but that opens an availability hole on takeovers if the backup is unconnected or catching up when the primary dies: to maintain consistency, the backup still needs to talk to the primary to catch-up before it can takeover

Very-safe stalls on a comm failure, but never has a failure window on takeover: the exception is that after the takeover, continuing to execute a very-safe policy would block immediately until the old primary system came back up

3-122


Differences between 2-safe and very-safe (continued): That very-safe takeover stall can be remedied

by operating as 2-safe after a takeover, until the sibling reintegrates, and then switching back to a very-safe policy again

Employing this hybrid option, would then support what is probably the best availability and consistency in database systems that recover from disasters over geographical distance if and only if networks are good enough so as not to constantly block progress (and they are getting better)

Of course, the hybrid option would depend on 2-safe and very-safe transactions being able to share commit … if not, then no hybrid option

3-123


Differences between 2-safe and very-safe (continued): As far as network availability is concerned, one

could get a DSL line or T1 from the phone company, a business cable connection, a Hughes data satellite dish (contract for the business size dish and throughput), and hook into a microwave network: this would always guarantee having at least one functioning network

There is an issue regarding session timers for2-safe systems, because of a prominent race condition affecting the loss of transactions: In Gray/Reuter 12.6.1, he talks about the inability to discern between primary failure and network partition, and the situation hasn’t improved at all

3-124


Differences between 2-safe and very-safe (continued): It is most important that the primary 2-safe

cluster not detect a session outage with the backup before the backup cluster detects the same partition: this could create the situation where the primary starts 2-safe committing locally, because the line is down to the backup, commits some transactions, then fails, then the backup does a takeover thinking that the connection was good and we lose transactions because of it

The solution to this is that the primary session outage detection timer needs to be longer (twice) than the backup timer: this should preclude this race, with some anomaly detection to spot miscues

3-125


Differences between 2-safe and very-safe (continued): A miscue would consist of a

communication between the primary and the backup where the primary has a higher session ID, because the session had been torn down on that side, whereas the backup didn’t notice: this is anathema to serialization between the two, and would mean that the timers need adjustment, maybe even auto-adjustment by incrementing the primary timer on a miscue (up to some reasonable limit)

After a takeover, remember that the roles are reversed and the timers need to reverse as well

3-126


Differences between 2-safe and very-safe (continued): A final issue is 2-safe and very-safe response

time over distance: a solution is to place one or two matching repeater sites just outside the disaster zone (25 or 30 miles) on different paths to the backup site, with the safe response time taken from the earliest repeater reply

A 2/very-safe repeater would employ a facing backup site (using a pass-through extractor-receiver protocol) buffering the log partitions and log root (with no RM to apply redo to) and with the log being extracted to the 2/very-safe backup site: it’s a network appliance for replication, all the logs in the primary datacenter could use the same pass-through sites

3-127

19. Scalable Joint Replication Differences between 2-safe and very-safe

(continued): The repeater in either role would buffer the data

and try to keep the primary connection alive and not lose it: log updates would be considered safe once they hit the disk on the repeater

In either 2-safe or very-safe modes, the repeater would look, for all purposes, like a backup cluster, supporting all the protocols

The relationship of the repeater with the 2/very-safe backup is more like that of an extension: the guarantee is extended out to the repeater and MTR is based on getting those log updates to the backup as quickly as possible

This is why two repeaters with geographical and communications independence is suggested: either repeater can get you the end of the log, whichever is quickest is best

3-128


Now that we know some similarities and differences between 2-safe and very-safe, let’s proceed to try and implement them, starting with Issue #6 from the previous 1-safe discussion (before Issue #6 there is no divergence between 1-safe and 2/very-safe)

We will try to keep 2-safe and very-safe merged in terms of commit, this means that any differences in function must not impact commit ACID properties (atomicity, consistency=serialization, isolation=locking, or durability)

This should work, remember that 2-safe serialization is preserved by the backup playing ‘catch-up’ after a comm failure, this makes it look like very-safe on takeover (which is the critical point), where the 2-safe log root should be identical to the very-safe log root after ‘catch-up’, or 2-safe won’t takeover

3-129


Issue #6 These are the transaction states, which are stored as records in the log root (the tense is crucial): Active state: only seen in the log if a working

transaction gets caught by the periodic TM checkpoint Prepared state: seen in a transaction which came in

from a remote cluster (parent), and which is in the middle of a 2 or 3 phase commit

Committed state: seen in a transaction which has touched remote clusters (children), and which is at the controlling end of a 2 or 3 phase commit

Aborting state: like I said, aborting Forgotten state: this transaction is now durably

going away

3-130


These are the 10 possible transaction state record transitions in the log root: Active or nil -> forgotten (hurried,

lockrelease after): local commit, neither parents nor children

Active or nil -> prepared (hurried, locks held): distributed commit, definitely having parents, maybe having children

Active or nil -> committed (hurried, lockrelease after): distributed commit, no parents, definitely having children


3-131


Transaction state record transitions in the log root (continued): Prepared -> forgotten (hurried,

lockrelease after): distributed commit, definitely having parents, but no children

Prepared -> committed (hurried, lockrelease after): distributed commit, definitely having both parents and children

Prepared -> aborting (hurried, locks held): distributed commit, definitely having parents, maybe having children

3-132


Transaction state record transitions in the log root (continued): Committed -> forgotten:

distributed commit, maybe having parents, definitely having children



3-133


Issue #7 The last seen transaction state in the log root, has different locking guarantees, which have an impact for takeover on the backup cluster: Active: these will get aborted, and you have to

have locks to do that online, for better MTR Prepared: these must remain unresolved with

locks held after the recovery awaiting the parent’s decision

Committed: these must remain committed with locks released, until the children are all notified: presumed abort is no help here, but 2-safe or very-safe parents, children or siblings will not lose a commit decision in a takeover

3-134


The different transaction states … impact on takeover (continued): Aborting: these have their aborts re-driven after a

takeover Forgotten: these will be discarded, any locks will

be released The goals for RM recovery after a takeover should then

be: In any case, if there are no transaction state records

in the LSG, we will abort every transaction we see: the RM transaction state records prevent that

We want to see an RM-Ending for any Forgotten local commit, to prevent an abort and to keep the locks

3-135


The goals for RM recovery after a takeover (continued): We want to see an RM-Preparing for any

Prepared, to prevent an abort and to keep the locks

We want to see an RM-Committing for any parentless Committed, to prevent an abort and to keep the locks

We want to see an RM-Ended (for any Committed, or any childless Prepared -> Forgotten commit, or any Forgotten local commit) to prevent an abort and to release the locks

3-136


The goals for RM recovery after a takeover (continued): We want to see an RM-Aborted for any

Forgotten abort, to prevent re-driving an abort and to release the locks

Let’s describe the transaction state record transitions in the log partition (remember that there are pre-records and post-records): Writing Active or Aborting transaction

state records to the log root does not require any action (they are there in the log partition as RM-Update and RM-CLR records)

3-137


Transaction state record transitions in the log partition (continued): Writing a local (no parents or children) Forgotten

committing record (flags clear) requires an RM-Ending pre-record for this transition:

RM-Update -> RM-Ending

Writing any Prepared record requires anRM-Preparing pre-record for this transition:

RM-Update -> RM-Preparing

3-138


Transaction state record transitions in the log partition (continued): Writing a parentless Committed record requires

an RM-Committing pre-record for this transition:

RM-Update -> RM-Committing

Writing any Committed record requires anRM-Ended post-record for these transitions:

RM-Update -> RM-Preparing -> RM-Committing -> RM-EndedRM-Update -> RM-Committing -> RM-Ended

3-139


Transaction state record transitions in the log partition (continued): Writing a local (no parents or children)

Forgotten committing record (flags clear) requires an RM-Ended post-record for this transition:

RM-Update -> RM-Ending -> RM-Ended

Writing a parented childless Forgotten committing record (flags clear) requires anRM-Ended post-record for this transition:

RM-Update -> RM-Preparing -> RM-Ended

3-140


Transaction state record transitions in the log partition (continued): Writing any Forgotten aborting record (flags

set) requires an RM-Aborted post-record for these five transitions:

RM-Update -> RM-CLR -> RM-Aborted

RM-Update -> RM-Ending -> RM-CLR -> RM-Aborted (corner case)

RM-Update -> RM-Preparing -> RM-CLR -> RM-Aborted

3-141


Transaction state record transitions in the log partition (continued): Writing any Forgotten aborting

record(continued):

RM-Update -> RM-Preparing -> RM-Committing -> RM-CLR -> RM-Aborted (Anomaly)

RM-Update -> RM-Committing -> RM-CLR -> RM-Aborted (corner case)

3-142


Hence, we need to add five new transaction state records, specifically for the 2-safe and very-safe log partitions: RM-Preparing: written a split second before

writing the Prepared record to the log root RM-Committing: written a split second

before writing the Committed record to the log root

RM-Ending and RM-Ended: written a split second before, and sometime after writing the committing Forgotten record to the log root

RM-Aborted: written after the aborting Forgotten record is written

3-143


The root pointer for 2/very-safe is the same as for 1-safe The high water marks for 2/very-safe are the same as

for 1-safe The replication updater handling of the continuous

scanning and maintaining of the current list of unresolved transactions is the same as in 1-safe with the exception that there are more states to deal with (as follows)

Transactions come into existence via RM-update records, abort via RM-CLR records, attempt to prepare viaRM-preparing records, attempt to distributed parentless commit via RM-committing records, attempt to local commit via RM-ending records and are discarded from the list via RM-ended and RM-aborted records: at any particular moment, the remaining set of final states (updating, aborting, preparing, committing and ending) are the set of unresolved transactions for takeover

3-144


The merge log record written to the log root for 2/very-safe is the same as for 1-safe (thank heavens)

The 2/very-safe commit writing is quite different: more transaction state record types written into the log partition, see the next three slides

So, 2/very-Safe transaction commit writing is like this: Every time a call is made to queue a

Prepared record to the TM buffer for writing to the log root:

At the top of this call a call is made to queue an RM-Preparing pre-record into the TM buffer for writing to the 2/very-safe log partition

3-145


So, 2/very-Safe transaction commit writing is like this (continued):

Every time a call is made to queue a Committed record to the TM buffer for writing to the log root: At the top of this call, if the transaction

was parentless (started here) then a call is made to queue an RM-Committing pre-record into the TM buffer for writing to the 2/very-safe log partition

At the end of this call a call is made to queue an RM-Ended post-record into the deferred buffer for post-group-commit writing to the 2/very-safe log partition

3-146


2/very-Safe transaction commit writing (continued): Every time a call is made to queue a Forgotten

record to the TM buffer for writing to the log root: If the Forgotten record’s (abort flags are clear

and there are no parents and no children) then at the top of this call a call is made to queue an RM-Ending pre-record into the TM buffer for writing to the 2/very-safe log partition

At the end of this call, if the (abort flags are clear) Then if (there are no children) a call is made to

queue an RM-Ended post-record into the deferred buffer for post-group-commit writing to the 2/very-safe log partition

Otherwise (abort flags are set) an RM-Aborted post-record is queued into the deferred buffer

3-147

19. Scalable Joint Replication 2/very-Safe transaction commit writing (continued):

When the time comes, a call is made to force write the TM commit buffer to the log root

At the top of this call, the root pointer is incremented and inserted in the merge record at the end of the commit buffer

Then a watermarks record (with a new root pointer value) is inserted at the end of the TM buffers for writing to the 2/very-safe log partitions

Then all the log partitions are flush written in parallel, and when those log partition writes complete, the deferred buffers are copied into the TM buffers for writing to the log partition next time

After all the log partitions are flush written, the log root buffer is flush written, and we’re done

3-148


High water marks (continued): The same ‘safe’ log pointer and redo applied

values are replied for 2/very-safe as in 1-safe: The ‘safe’ log pointer value, is what has been

received, whereas the redo applied value is what we guarantee to apply if a takeover immediately occurs, this should be equal to 30 seconds or a minutes worth of log updates, at the current rate of transmission, beyond the last transmitted high water mark

There is absolutely no benefit (other than looking at uncommitted data, which will not be supported for 2/very-safe) to be derived from applying more redo than the transaction system can commit consistently: for 2/very-safe, there will be only tiny amounts of uncommitted data

3-149


High water marks (continued): Some 2/very-safe LSGs may be inactive … The joint minimum 2/very-safe redo applied

becomes the high water mark for the set of 2/very-safe LSGs …

The 2/very-safe high water mark will be included in the watermarks record written on the 2/very-safe log partitions and themerge record written on the log root

Thus, each geographically dispersed 2-safe LSG on a backup cluster will know, at takeover time, the joint serialization level of 2/very-safe everywhere …

The undo and redo low water marks are the same in 2/very-safe as in 1-safe …

3-150


Thus, the undo low water mark for an RM on a 2/very-safe log partition needs to be included in the watermarks record written on that log partition for every commit flush, same as in 1-safe …

During the 2/very-safe steady state of replication, the extractors transmit the log partition updates to the receivers, who apply the updates to the log on the backup cluster, and then ship the redo to continuously rebuild the database cache in the backup RM, same as in 1-safe …

Only one possible policy or mode of operation occurs to me, for the entire 2/very-safe LSG on the backup side, or for individual 2/very-safe RMs on the backup side … that is different from 1-safe …

3-151


Global committed: the replication receiver applies all the redo (RM-updates and RM-CLRs) to the log partition and the replication updater applies the log to the RM to continuously rebuild the cache… same as 1-safe

… after the takeover, redo is applied to the RM to rebuild cache all the way up to the most recently replied redo applied guarantee… same as 1-safe

… then locks will be reinstated from two RM checkpoints prior to the high water mark to <= the redo applied point … to make the database available online in preparation for the online rollback to the final resolution point … same as 1-safe

The lock scans for the global committed policy… same as 1-safe

3-152


The 2/very-safe global committed policy allows a guaranteed point of redo apply in the log: triggers and publish/subscribe interfaces could queue off of the edge of the high water mark for the RM, … same as 1-safe

Restarting the 2/very-safe RMs after a takeover … same as 1-safe

All of the incomplete 2/very-safe transactions will get aborted after resolution, so their locks must be held for now, because we don’t know for certain how high the water finally got, and even the already aborting transactions that were incomplete may be re-driven again, and you wouldn’t want to do that twice: that will wait for final log resolution and final commit resolution that comes back from the takeover master with the globally merged and finally resolved log endpoint … same as 1-safe

3-153


The 2/very-safe TM transaction service which has already by this time completed a takeover for the virtual commit coordinator… same as 1-safe

The 2/very-safe cluster TM service which received the replication of the log root, is the takeover master… same as 1-safe

3-154


The log root only receives log writes from the TM (transaction manager) and not the RMs: for 2/very-safe resolution the last completed commit write (the merge record is the end marker) is the end of the log root, in the same way that the watermarks record is the end of the log partition as far as commit is concerned: this is because the root pointer and the high water mark are for a completed set of transaction state changes, and not for a partial set (it’s epochal) … almost the same as in 1-safe

So, the final high water mark … written to the log root of the takeover master in a takeover log record… same as in 1-safe

3-155


With that value in hand, the takeover master scans the last two TM checkpoints (before the final high water mark) in the log root received from the old primary, coming up with the resolutions for the final transaction states of all the 2/very-safe transactions for the LSGs that were primaried there at the time of the takeover: Active: abort these Prepared: these hold locks waiting for resolution from the

parent Committed: these hold no locks, waiting to inform all

children Aborting: abort these again Forgotten: these got committed All the transactions that started after the final high water

mark, also have to be rolled back, and those are in the log in the form of RM-Update records and RM-CLR records

3-156

19. Scalable Joint Replication The takeover master sends these transaction

resolutions along with the final high water mark to all the virtual commit coordinators involved

Once those virtual commit coordinators receive the final resolution, they can deal with the “Unresolved” state transactions:

(1) Those that were committing, but got aborted are fine, the other way around is an anomaly

(2) Those remaining for which no resolution is sent go according to the last record seen in the range (don’t read beyond the final high water mark): RM-Aborted -> release the locks, it’s done RM-Ended -> release the locks, it’s done RM-Preparing, RM-Committing, RM-Ending -> these didn’t quite make it, abort them RM-Update -> abort it RM-CLR -> abort it again, the abort didn’t complete

3-157


(3) The aborts are done by anonline rollback process that does the above (4) undo scans and resolutions and then (5) inherits the locks and does the (6) aborts and lock releases and then the takeover is complete

When the dead primary revives, or simply reconnects to the network, there is a process for coming to terms with changes to the database made by the new primary, but not the loss of 2-safe transactions in the original takeover (we don’t lose transactions here): that process is called “catch-up after failure” (Gray/Reuter 12.6.4), and is also called reintegration

Of course, very-safe never needs to catch-up or reintegrate (except during the “hybrid option”)

3-158


We can decide about the “hybrid option” now, it appears that will work just fine !!

The hybrid option is where after a very-safe takeover, the backup becomes 2-safe so that it can do work until the old primary is brought back online by reintegration, then they become very-safe again

2-safe Reintegration begins when the original primary tries to connect with the group with a message, which fetches a session mismatch error due to the regroup round that declared it down, then it rejoins the group with the group locker performing a glupdate … same as 1-safe

The notification, by the group locker, to all of the seven (in our eight cluster scenario) 2-safe virtual commit coordinators that their brother is alive, causes the primary VCC to send a message to the formerly offline sibling to reintegrate … same as 1-safe

3-159

19. Scalable Joint Replication That 2-safe reintegration will start with a synchronization

(note that any hybrid LSGs after takeover are 2-safe): (1) the final high water mark in the log root and the

fate of the transactions that were resolved in the takeover are sent by the primary virtual commit coordinator to the new backup/old primary sibling

(2) this will cause a crash recovery for all the RMs on the 2-safe LSGs on the new backup that are resolved at that log root restart point, using those transaction resolutions

(3) after that is complete, extractor-receiver pairs are started up

(4) the log partition updates since the takeover are transmitted to the new backup… same as in 1-safe up to here

(5) having done so the replication is back to steady state (6) the hybrid LSGs will now switch back to very-safe,

and reverse the comm session outage timers

3-160


The post takeover seven-way fan-in discomfort for the applications, which is alleviated by seven giveover operations on the virtual commit coordinators to get back to a seven-way fan-out configuration: this is the same as in 1-safe …

The giveover operations, receiving the baton and coordinated takeover are all …… same as in the 1-safe descriptions …

An entirely different approach to lock release and transaction end notification to the applications has to be taken for 2/very-safe as opposed to 1-safe and the traditional 3-phase distributed commit: this is because additional waits have to be imposed for the RM changes to move through the extractors and into the receivers and get replied to the TM service

3-161


Without going into the gory details (there will be an extremely lengthy presentation on the life of a transaction that covers these details), the redo applied responses fetched from the receivers by the extractors are frequently reported to the TM via low level datagram service of the TM library

As the TM (1) received the flushed notifications for commit writing (which contained a bit mask for the involved log partitions, and those are known to be 1-safe, 2-safe, very-safe or none) then theTM (2) did the commit writing and in the case of 2/very-safe, the TM (3) held up on lock release (or commit notifications, prepared replies, etc.) until the wave of updates reaches its intended targets

3-162


The knowledge that a given wave of updates has reached its 2/very-safe targets for a given transaction in a commit write on a given cluster, is defined by the 2/very-safe LSGs replied redo applied values having reached the value of the root pointer from the original commit write that wrote the log record (prepared, committed, forgotten) after those updates were flushed

That information is now in the hands of the TM for that cluster … and then distributed commit is the normal process of connecting those flushes together, with one proviso: on a given cluster, it is very much easier to sort out the bodies after a distributed disaster takeover, if the flushes for a cluster to the log and through there to the replication siblings are done before preparing the children (if they are done in parallel it can be sorted out later, but it’s much tougher … that and many other painful optimizations await)

3-163


By following that proviso, everything should work just perfectly

Getting back to the DSN, we have many tools to build our systems with

(1) Where there is a central station on a large planet (i.e., Jupiter) with colonies and research stations near multiple moons (Jovian system), there needs to be a central station that is always linked to earth-lunar … these earth-lunar links should be simple bidirectional 1-safe (local uncommitted policy) uplink and downlink streams that ship ACID guaranteed communications driven by workflow systems using publish and subscribe interfaces from the database

3-164


(2) That central station has streaming to the local outposts … very-safe/hybrid is good for this, because there is no reintegration or loss of transactions: so you have simple bidirectionalvery-safe/hybrid uplink and downlink streams, (because of the short hop), shipping data and workflows

(3) The earth-lunar network needs to survive disasters, because the Deep Space Network (DSN) is hooked together through that hub: communications dishes at the north and south lunar poles can always see (barring solar and planetary eclipse) outpost stations above and below the plane of the ecliptic, and with such a slow and steady rotation that all DSN inbound/outbound communications rout through there

3-165

19. Scalable Joint Replication (4) The earth-lunar connections are based on the satellite

and ground station links to two sets of stationary dishes and antennae near the lunar poles: there are many sets of simple bidirectional very-safe/hybrid uplink and downlink streams from many systems on earth that get workflow queued into the outbound and back from the inbound 1-safe (local uncommitted) bidirectional streams connecting to the deep space network

(5) In the early going, it will make sense to make the lunar communications systems a simple repeater for a very-safe/hybrid system which then directly links the earth systems to the deep space network: as the lunar colony is developed, it becomes obvious that with the slow rotation speed, no atmosphere and colonists in bunkers, that a lunar-based DSN will be simpler, more continuous and survivable than the four deep space communications stations on earth [Canberra (Australia), Goldstone (CA), Madrid (Spain), Byalalu (India)]

3-166


(6) There are many other reasons for this transition to a lunar hub: Lower gravity makes it easier to build the larger

dishes that increase transmission rates: an increase of 10% in size can gain a 50% increase in transmission, and arrays of them can be combined

The side lobes of the antenna waveform are not a problem in transmission, but in receiving weak signals in an atmosphere nearer to the horizon they introduce noise that can cause signal loss

The uplink transmission and downlink received signals can differ in amplitude by a factor of 1024, this requires low-noise amplifiers that like the temperature of the shadows on the moon (-250oF)

http://history.nasa.gov/SP-4227/Chapter07/Chapter07.PDF

3-167


(7) When the lunar station finally becomes the hub and transmission point for the deep space network, the links to earth become bidirectional very-safe/hybrid input and output streams from the many systems on earth that get workflow queued back and forth to deep space

(8) With the potential for a three hour round trip time for communications between opposing earth-lunar and Saturnian stations, lost comm data blocks in the stream requires that the deep space receivers insert blocks into the log, allowing for gaps, with a structured redundancy (channel encoding) error-correcting data transmission strategy, and an escalating missing block request retransmission strategy involving redundant data and multiple error-correcting codes (Turbo codes are composite codes made up of short-constraint-length convolutional codes and a data stream interleaver), of course the data is always as compressed as possible

3-168

20. Bi-Directional Replication Reliable, Scalable, Atomically Consistent

TBD

fundamentals of transaction systems - part 3: relativity shatters the classical delusion (replicated...

Technology