Download - IMS 6217: Distributed Databases 1 Dr. Lawrence West, Management Dept., University of Central Florida [email protected] Distributed Databases (DDB) Introduction

IMS 6217: Distributed Databases

1Dr. Lawrence West, Management Dept., University of Central [email protected]

Distributed Databases (DDB)

• Introduction to DDB

• Three DDB Exercises

– General Electric

– J.C. Penney Stores

– UPS

• Integrating Autonomous Systems

• Replication

• Replication Exercise

• Implementing Replication



Distributed Databases—Introduction

• What is a Distributed Database (DDB)?

• How many ways can a database be distributed?(And for how many reasons?)



Distributed Databases—Introduction (cont.)

• A DDB runs on more than one computer (duhhhh)

– All or part of the data is distributed

– Servers may have unique data or copies of data also on other servers

– Some data may be on clients (not servers) for strictly local use

– Servers may be right next to each other (same room or even rack) or around the world

– DDB may share a strong schema (design) that applies to all instances or may have a logical schema that integrates disparate autonomous DB




• Reasons for distributing data

– Workload distribution for scalability

– Getting data closer to users in a distributed organization

– Autonomous operating elements of a diverse organization have different data needs

• Reporting needs to central HQ

• Coordination needs




• Three main types/reasons of/for distributing data

1. Diverse distributed organizational divisions have unique data needs with some requirements for centralized coordination and/or reporting

2. Homogeneous distributed organization requires geographically available data but strong needs for centralized management and control

3. Homogeneous organization requires workload distribution to support scalability

• Any combination of these circumstances can exist



Exercises

• Three Distributed System Design Exercises

– General Electric

– J.C. Penney

– UPS

• Consider

– Corporate control

– Transaction processing

– Reporting

– Data integrity

– Data warehousing



Exercise #1—The General Electric Case

• Consider the GE product line: Light bulbs, large and small consumer appliances, power generation equipment, jet engines, financial services, railroad locomotives, healthcare equipment http://www.ge.com/products_services/index.html

• Assume that each division has the databases and systems needed for its internal operations

1. What are examples of data needs for GE HQ and what technical capabilities must exist to satisfy the needs?

2. Are there any needs for cross-division data communication? If so, identify technical needs.



Exercise #2—The J.C. Penney Case

• Consider the operations of the J.C. Penney retail department stores only. More than 1,000 stores throughout US & Puerto Rico.

1. What are examples of data needs for JCP HQ and what technical capabilities must exist to satisfy the needs?

2. Are there unique local needs?



Exercise #3—The UPS Case

• Consider UPS customer and package tracking systems characterized by millions of daily transactions, package tracking as a competitive tool, worldwide locations, and disconnected (from the network) transactions at the driver delivery transaction.

1. What transactions (if any) are local?

2. What special data sharing must take place?

3. What is the implication of the disconnected drivers?



Integrating Autonomous Systems

• Consider the GE example with additional conditions

– Diverse nodes have different DBMS software

– Similar tables in different nodes have different structures

• Different column structures

• Different data types for same/similar columns

– Different data integrity requirements

• Will most cross-division DB events be:

– Queries/retrievals?

– Inserts/updates?



Integrating Autonomous Systems (cont.)

• Data Warehousing—the simple solution

• When corporate needs can be satisfied with a DW solution we can use existing tools

– Ideally suited for disparate data sources

– Transforming disparate data

• Decisions include

– Frequency of updates

– Granularity

– Dimension tables applicable to unique division data




• Querying (and Updating)—the Not-So-Simple Solution

• The integration solution includes

– Local DB Management Component (LDBMC)

• Manages the local system

– Global Data Dictionary (GDD)

• An integrated logical data model for the entire DDB with location and structure information

– Distributed DB Management Component (DDBMS)

• Provides the interface to the global system for local users




• Distributed DB Management Component (DDBMS)

– Provides logical DB schema interface or data dictionary

• A one-source view of the entire DDB

• Provides location transparency

– Query Processor (more to come)

– Network-wide concurrency control

• More applicable to distributed homogeneous systems with integrated operations

– Query translation for heterogeneous systems (more)




• DDBMS Query Processor

– Determines location of data needed to process query

– For one-source queries: Sends query to appropriate local or remote location and returns results

– For multi-source queries:

• Parses query into sub-queries against individual locations

• Receives individual results

• Assembles a result set from the pieces

• May include translation of column elements in heterogeneous systems




• A wide variety of commercial products are available for managing distributed autonomous systems

• Think of the integration packages available in our Data Warehouse loading packages

– Difference is that translations act in near real time

– Product an integrated schema with query rules



Replication

• Replication consists of

– A single physical source or master database

– Replicated data in distributed locations to support transactions, reporting, and control

– Tools for distributing data

– Tools for integrating changes to data

• Typically has (mostly) homogeneous data structures and DBMS at all locations

– Not all data at every location



Replication (cont.)

• Design choices consist of

– What data is distributed to which location

• Filter by row

• Filter by column

– What update rights are available at each location

– How frequently data must be synchronized

– How data is to be synchronized (what replication model is used)

– Data integrity rules at different levels

– Handling data integrity violations



Replication—The Publisher/Subscriber Model

• Replication is spoken of using print publication terms

– Publisher—the system node that produces data for consumption by other nodes

– Subscriber—the node that receives data

– Article—a database object (table, view, stored procedure, etc.) that is to be published

– Publication—a collection of articles sent and received as one logical unit

• Nodes may be both publishers and subscribers



Snapshot Replication

• Snapshot Replication is what the name implies—a complete copy of data sent to a subscriber

• Often used as an initial copy of data forming the basis for other kinds of replication

• May be read-only data sent to a subscriber

– Product list with prices

– Authorized vendors

• Out-of-synch data may be acceptable for the period between replications

• Data may change infrequently



Transactional Replication

• Data is changed at the publisher and propagated to subscribers as changes occur

– E.g., Units in stock at a warehouse

– Customer bank balances

• Data is usually treated as read-only at subscriber but need not be

• Decisions regarding frequency of update messages are required

– Balance workload and network traffic with latency

– Consider the costs of latency



Merge Replication

• Typically implemented in a Client-Server environment

• Client (subscriber) makes changes to data that are published back to server (publisher)

• Multiple subscribers may be capable of changing the same data

• Data changes may be made offline

• Update frequency also balances workload and network traffic with latency

• Subscriber changes may be legal at the client but create data integrity violations when updated to the server



Replication (cont.)

• All three types of replication may be implemented in same application

– Snapshot for base-level data with transactional replication for updates

– Snapshot for initial data, merge replication by a subscriber, transactional replication to publish changes to other subscribers

• Clients may easily have data not replicated to the Server

– Employee shift schedules

– Detailed transaction data—only summaries published



Replication Exercise

• Consider the University Registration DB we used in data warehousing (next slide)

• Assume a state-wide multi-branch university system with a System Central Administration and campuses

• Assume homogeneous DBMS at each location

• Identify:

– Subscription model for each table

– Update rights on each table

– Update frequency for each table

– How much latency will be allowed in each table



Replication Exercise

Student

StudentIDLname :Resident Y/N

Enrollment

SectionIDStudentIDGrade

Payment

PaymentIDStudentIDPaymentDatePaymentAmt

Section

SectionIDDeptCodeCourseNumSecNumTermYearDaysTimeInstructorIDCapacityBldgIDRoomNum

Instructor

EmployeeIDLastNameFirstNameDeptIDCurrentSalary

SalaryHistory

EmployeeIDStartDateEndDateSalaryStatus

Course

DeptCodeCourseNumCrHrTitleDescriptionLecHrLabHrDeptID

Department

DeprtIDNameChairIDDeptIDOfficeBldgOfficeRm

Textbook

ISBNTitleYear :WholsalePriceListPrice

SectionBook

ISBNSectionIDQtyOrderedQtySold

Room

BldgIDRoomNumTypeCapacity

Building

BldgIDNameFloorsSqFt

FeeSchedule

AcadYearInStateUGCrHrOutStateUGCrHrInStateGrCrHrOutStateGrCrHrInStateDCrHrOutStateDCrHrHealthFeeActivityFee

BrightFutureStat

StudentIDAcadYearPercentCovered



Replication—Data Integrity

• Many data integrity rules in SQL Server have "Not for replication" options available

– Foreign key

– Identity attribute values

– Triggers

– Constraints

• If you do not publish Suppliers data to individual stores you would turn off referential integrity checking from Products to Suppliers



Replication—Data Integrity (cont.)

• Publisher (server) may have to detect when a subscriber (client) makes a transaction that is legal to the client but illegal to the server

– Two sales representatives sell the same product that creates a backorder situation

– Two purchasers make purchase on same account that exceeds credit limit

• Violation potential must be identified and incorporated into replication strategy



Implementing Replication

• SQL Server has a series of Replication Agents that automate replication

– Identify publishers and subscribers

– Build subscriptions, articles, and publications

– Produce snapshots

– Implement transaction replication according to rules

– Implement merge replication notifications

– Manage data integrity rules

• Similar tools available in other systems

Download - IMS 6217: Distributed Databases 1 Dr. Lawrence West, Management Dept., University of Central Florida [email protected] Distributed Databases (DDB) Introduction

Top Related