IMS 6217: Distributed Databases
1Dr. Lawrence West, Management Dept., University of Central [email protected]
Distributed Databases (DDB)
• Introduction to DDB
• Three DDB Exercises
– General Electric
– J.C. Penney Stores
– UPS
• Integrating Autonomous Systems
• Replication
• Replication Exercise
• Implementing Replication
IMS 6217: Distributed Databases
2Dr. Lawrence West, Management Dept., University of Central [email protected]
Distributed Databases—Introduction
• What is a Distributed Database (DDB)?
• How many ways can a database be distributed?(And for how many reasons?)
IMS 6217: Distributed Databases
3Dr. Lawrence West, Management Dept., University of Central [email protected]
Distributed Databases—Introduction (cont.)
• A DDB runs on more than one computer (duhhhh)
– All or part of the data is distributed
– Servers may have unique data or copies of data also on other servers
– Some data may be on clients (not servers) for strictly local use
– Servers may be right next to each other (same room or even rack) or around the world
– DDB may share a strong schema (design) that applies to all instances or may have a logical schema that integrates disparate autonomous DB
IMS 6217: Distributed Databases
4Dr. Lawrence West, Management Dept., University of Central [email protected]
Distributed Databases—Introduction (cont.)
• Reasons for distributing data
– Workload distribution for scalability
– Getting data closer to users in a distributed organization
– Autonomous operating elements of a diverse organization have different data needs
• Reporting needs to central HQ
• Coordination needs
IMS 6217: Distributed Databases
5Dr. Lawrence West, Management Dept., University of Central [email protected]
Distributed Databases—Introduction (cont.)
• Three main types/reasons of/for distributing data
1. Diverse distributed organizational divisions have unique data needs with some requirements for centralized coordination and/or reporting
2. Homogeneous distributed organization requires geographically available data but strong needs for centralized management and control
3. Homogeneous organization requires workload distribution to support scalability
• Any combination of these circumstances can exist
IMS 6217: Distributed Databases
6Dr. Lawrence West, Management Dept., University of Central [email protected]
Exercises
• Three Distributed System Design Exercises
– General Electric
– J.C. Penney
– UPS
• Consider
– Corporate control
– Transaction processing
– Reporting
– Data integrity
– Data warehousing
IMS 6217: Distributed Databases
7Dr. Lawrence West, Management Dept., University of Central [email protected]
Exercise #1—The General Electric Case
• Consider the GE product line: Light bulbs, large and small consumer appliances, power generation equipment, jet engines, financial services, railroad locomotives, healthcare equipment http://www.ge.com/products_services/index.html
• Assume that each division has the databases and systems needed for its internal operations
1. What are examples of data needs for GE HQ and what technical capabilities must exist to satisfy the needs?
2. Are there any needs for cross-division data communication? If so, identify technical needs.
IMS 6217: Distributed Databases
8Dr. Lawrence West, Management Dept., University of Central [email protected]
Exercise #2—The J.C. Penney Case
• Consider the operations of the J.C. Penney retail department stores only. More than 1,000 stores throughout US & Puerto Rico.
1. What are examples of data needs for JCP HQ and what technical capabilities must exist to satisfy the needs?
2. Are there unique local needs?
IMS 6217: Distributed Databases
9Dr. Lawrence West, Management Dept., University of Central [email protected]
Exercise #3—The UPS Case
• Consider UPS customer and package tracking systems characterized by millions of daily transactions, package tracking as a competitive tool, worldwide locations, and disconnected (from the network) transactions at the driver delivery transaction.
1. What transactions (if any) are local?
2. What special data sharing must take place?
3. What is the implication of the disconnected drivers?
IMS 6217: Distributed Databases
10Dr. Lawrence West, Management Dept., University of Central [email protected]
Integrating Autonomous Systems
• Consider the GE example with additional conditions
– Diverse nodes have different DBMS software
– Similar tables in different nodes have different structures
• Different column structures
• Different data types for same/similar columns
– Different data integrity requirements
• Will most cross-division DB events be:
– Queries/retrievals?
– Inserts/updates?
IMS 6217: Distributed Databases
11Dr. Lawrence West, Management Dept., University of Central [email protected]
Integrating Autonomous Systems (cont.)
• Data Warehousing—the simple solution
• When corporate needs can be satisfied with a DW solution we can use existing tools
– Ideally suited for disparate data sources
– Transforming disparate data
• Decisions include
– Frequency of updates
– Granularity
– Dimension tables applicable to unique division data
IMS 6217: Distributed Databases
12Dr. Lawrence West, Management Dept., University of Central [email protected]
Integrating Autonomous Systems (cont.)
• Querying (and Updating)—the Not-So-Simple Solution
• The integration solution includes
– Local DB Management Component (LDBMC)
• Manages the local system
– Global Data Dictionary (GDD)
• An integrated logical data model for the entire DDB with location and structure information
– Distributed DB Management Component (DDBMS)
• Provides the interface to the global system for local users
IMS 6217: Distributed Databases
13Dr. Lawrence West, Management Dept., University of Central [email protected]
Integrating Autonomous Systems (cont.)
• Distributed DB Management Component (DDBMS)
– Provides logical DB schema interface or data dictionary
• A one-source view of the entire DDB
• Provides location transparency
– Query Processor (more to come)
– Network-wide concurrency control
• More applicable to distributed homogeneous systems with integrated operations
– Query translation for heterogeneous systems (more)
IMS 6217: Distributed Databases
14Dr. Lawrence West, Management Dept., University of Central [email protected]
Integrating Autonomous Systems (cont.)
• DDBMS Query Processor
– Determines location of data needed to process query
– For one-source queries: Sends query to appropriate local or remote location and returns results
– For multi-source queries:
• Parses query into sub-queries against individual locations
• Receives individual results
• Assembles a result set from the pieces
• May include translation of column elements in heterogeneous systems
IMS 6217: Distributed Databases
15Dr. Lawrence West, Management Dept., University of Central [email protected]
Integrating Autonomous Systems (cont.)
• A wide variety of commercial products are available for managing distributed autonomous systems
• Think of the integration packages available in our Data Warehouse loading packages
– Difference is that translations act in near real time
– Product an integrated schema with query rules
IMS 6217: Distributed Databases
16Dr. Lawrence West, Management Dept., University of Central [email protected]
Replication
• Replication consists of
– A single physical source or master database
– Replicated data in distributed locations to support transactions, reporting, and control
– Tools for distributing data
– Tools for integrating changes to data
• Typically has (mostly) homogeneous data structures and DBMS at all locations
– Not all data at every location
IMS 6217: Distributed Databases
17Dr. Lawrence West, Management Dept., University of Central [email protected]
Replication (cont.)
• Design choices consist of
– What data is distributed to which location
• Filter by row
• Filter by column
– What update rights are available at each location
– How frequently data must be synchronized
– How data is to be synchronized (what replication model is used)
– Data integrity rules at different levels
– Handling data integrity violations
IMS 6217: Distributed Databases
18Dr. Lawrence West, Management Dept., University of Central [email protected]
Replication—The Publisher/Subscriber Model
• Replication is spoken of using print publication terms
– Publisher—the system node that produces data for consumption by other nodes
– Subscriber—the node that receives data
– Article—a database object (table, view, stored procedure, etc.) that is to be published
– Publication—a collection of articles sent and received as one logical unit
• Nodes may be both publishers and subscribers
IMS 6217: Distributed Databases
19Dr. Lawrence West, Management Dept., University of Central [email protected]
Snapshot Replication
• Snapshot Replication is what the name implies—a complete copy of data sent to a subscriber
• Often used as an initial copy of data forming the basis for other kinds of replication
• May be read-only data sent to a subscriber
– Product list with prices
– Authorized vendors
• Out-of-synch data may be acceptable for the period between replications
• Data may change infrequently
IMS 6217: Distributed Databases
20Dr. Lawrence West, Management Dept., University of Central [email protected]
Transactional Replication
• Data is changed at the publisher and propagated to subscribers as changes occur
– E.g., Units in stock at a warehouse
– Customer bank balances
• Data is usually treated as read-only at subscriber but need not be
• Decisions regarding frequency of update messages are required
– Balance workload and network traffic with latency
– Consider the costs of latency
IMS 6217: Distributed Databases
21Dr. Lawrence West, Management Dept., University of Central [email protected]
Merge Replication
• Typically implemented in a Client-Server environment
• Client (subscriber) makes changes to data that are published back to server (publisher)
• Multiple subscribers may be capable of changing the same data
• Data changes may be made offline
• Update frequency also balances workload and network traffic with latency
• Subscriber changes may be legal at the client but create data integrity violations when updated to the server
IMS 6217: Distributed Databases
22Dr. Lawrence West, Management Dept., University of Central [email protected]
Replication (cont.)
• All three types of replication may be implemented in same application
– Snapshot for base-level data with transactional replication for updates
– Snapshot for initial data, merge replication by a subscriber, transactional replication to publish changes to other subscribers
• Clients may easily have data not replicated to the Server
– Employee shift schedules
– Detailed transaction data—only summaries published
IMS 6217: Distributed Databases
23Dr. Lawrence West, Management Dept., University of Central [email protected]
Replication Exercise
• Consider the University Registration DB we used in data warehousing (next slide)
• Assume a state-wide multi-branch university system with a System Central Administration and campuses
• Assume homogeneous DBMS at each location
• Identify:
– Subscription model for each table
– Update rights on each table
– Update frequency for each table
– How much latency will be allowed in each table
IMS 6217: Distributed Databases
24Dr. Lawrence West, Management Dept., University of Central [email protected]
Replication Exercise
Student
StudentIDLname :Resident Y/N
Enrollment
SectionIDStudentIDGrade
Payment
PaymentIDStudentIDPaymentDatePaymentAmt
Section
SectionIDDeptCodeCourseNumSecNumTermYearDaysTimeInstructorIDCapacityBldgIDRoomNum
Instructor
EmployeeIDLastNameFirstNameDeptIDCurrentSalary
SalaryHistory
EmployeeIDStartDateEndDateSalaryStatus
Course
DeptCodeCourseNumCrHrTitleDescriptionLecHrLabHrDeptID
Department
DeprtIDNameChairIDDeptIDOfficeBldgOfficeRm
Textbook
ISBNTitleYear :WholsalePriceListPrice
SectionBook
ISBNSectionIDQtyOrderedQtySold
Room
BldgIDRoomNumTypeCapacity
Building
BldgIDNameFloorsSqFt
FeeSchedule
AcadYearInStateUGCrHrOutStateUGCrHrInStateGrCrHrOutStateGrCrHrInStateDCrHrOutStateDCrHrHealthFeeActivityFee
BrightFutureStat
StudentIDAcadYearPercentCovered
IMS 6217: Distributed Databases
25Dr. Lawrence West, Management Dept., University of Central [email protected]
Replication—Data Integrity
• Many data integrity rules in SQL Server have "Not for replication" options available
– Foreign key
– Identity attribute values
– Triggers
– Constraints
• If you do not publish Suppliers data to individual stores you would turn off referential integrity checking from Products to Suppliers
IMS 6217: Distributed Databases
26Dr. Lawrence West, Management Dept., University of Central [email protected]
Replication—Data Integrity (cont.)
• Publisher (server) may have to detect when a subscriber (client) makes a transaction that is legal to the client but illegal to the server
– Two sales representatives sell the same product that creates a backorder situation
– Two purchasers make purchase on same account that exceeds credit limit
• Violation potential must be identified and incorporated into replication strategy
IMS 6217: Distributed Databases
27Dr. Lawrence West, Management Dept., University of Central [email protected]
Implementing Replication
• SQL Server has a series of Replication Agents that automate replication
– Identify publishers and subscribers
– Build subscriptions, articles, and publications
– Produce snapshots
– Implement transaction replication according to rules
– Implement merge replication notifications
– Manage data integrity rules
• Similar tools available in other systems