when & why\'s of denormalization

65
TM 1 Dr. Chen, Business Database Systems Presented By Aliya Saldanha DENORMALISATION PROS AND CONS

Upload: aliya-saldanha

Post on 05-Jul-2015

281 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: When & Why\'s of Denormalization

TM 1Dr. Chen, Business Database Systems

Presented By Aliya Saldanha

DENORMALISATIONPROS AND CONS

Page 2: When & Why\'s of Denormalization

TM 2Dr. Chen, Business Database Systems

OBJECTIVES• Definition of terms.

• Describe the denormalization design process.

• Denormalization Strategies

• A Comparative Case Study

• Know the pros and cons of denormalization

• The Dangerous Illusion

• Conclude

Page 3: When & Why\'s of Denormalization

TM 3Dr. Chen, Business Database Systems

Introduction

• RDBMS design - conceptual and physical modeling levels.

• Conceptual diagrams - precursor to designing relational tables.

• Critical issues- level of system performance, reflected by system response time

Page 4: When & Why\'s of Denormalization

TM 4Dr. Chen, Business Database Systems

Normalization

• The normalized model is a cornerstone for every database system.

• Process of decomposing large, inefficiently structured tables into smaller, more structured tables without losing any data in the process.

There are still times where we denormalize a database to enhance performance

Page 5: When & Why\'s of Denormalization

TM 5Dr. Chen, Business Database Systems

What is normalization?

• A series of steps followed to obtain a database • that is consistent and avoids duplication

• The process passes through fulfilling Normal Forms

• A table is said to be in a certain normal form if it satisfies certain constraints

• KEY POINTS• Each table represents a single subject• Keeps redundancy to a minimum• All attributes are dependent on the primary key• Checks stability and integrity of E-R diagram• Removes Insert, Update, Delete anomalies.

1st Normal Form

2nd Normal Form

3 rd Normal FormBCNF

4 th Normal Form

5 th Normal Form

Normalized relational db model

Relational db model

Page 6: When & Why\'s of Denormalization

TM 6Dr. Chen, Business Database Systems

As normalization progresses…

• The number of Relations required to represent the data of the application being normalized increases.

• The increased number of tables require multiple JOIN’s to combine data from different tables. (more the joins the worse it gets)

• Queries that have a lot of complex joins will require more CPU usage and will adversely affect performance.

Page 7: When & Why\'s of Denormalization

TM 7Dr. Chen, Business Database Systems

Practically speaking

• Queries run slowly.• Reports take too long to print.• On-screen forms take time to populate.• Web pages take too long to populate.• More complicated SQL required for multi-table

queries and joins.• In short, extra work for DBMS can mean slower

applications

Page 8: When & Why\'s of Denormalization

TM 8Dr. Chen, Business Database Systems

Other issues…

• No calculated values. CV’s are a fact of life for all applications, but a normalized DBMS lacks them.

• Non-reproducible Calculations. Application must generate them on the fly as needed. If your application changes over time, you risk not being able to reproduce prior results.

• Join Jungles. When each fact is stored in exactly one place, you it is daunting to pull together everything for a certain query. Making it hard to code, hard to debug, and dangerous to alter.

• Performance. When you face a JOIN jungle you almost always face performance problems.

Page 9: When & Why\'s of Denormalization

TM 9Dr. Chen, Business Database Systems

??before denormalizing

• Can the system achieve acceptable performance without denormalizing?

• Will the performance of the system after denormalizing still be unacceptable?

• Will the system be unreliable due to denormalization? • If the answer to any of these is "yes," avoid

denormalization because any benefit that is accrued will not exceed the cost.

Page 10: When & Why\'s of Denormalization

TM 10Dr. Chen, Business Database Systems

Denormalization and Why?

• Frequently, performance needs dictate very quick retrieval capability for data stored in relational databases.

• To accomplish this, sometimes the decision is made to denormalize the physical implementation.

• Denormalization is the process of putting one fact in numerous places. This speeds data retrieval at the expense of data modification.

Page 11: When & Why\'s of Denormalization

TM 11Dr. Chen, Business Database Systems

Does it mean Un-normalization ?

• ‘Denormalization’ does not mean that anything goes. Denormalization does not mean chaos.

• Un-normalized data model is little or no analysis is performed.

• In short, seek to denormalize a data model that has already been normalized.

Page 12: When & Why\'s of Denormalization

TM 12Dr. Chen, Business Database Systems

DENORMALIZATIONPROCESS

Develop E-R

Refinement &Normalize

Identify candidates

Identifying form

Map to physical schema

Determining integrity effects

Page 13: When & Why\'s of Denormalization

TM 13Dr. Chen, Business Database Systems

Development of Conceptual data model

• E-R/M aims at identifying the entities that are part of the system, the attributes that make up these entities, and the dependencies between entities.

• No Dependency among the attributes – Normalization resolves the functional dependencies between attributes

• Shows Data at rest – Denormalization considers the types of queries and their frequency

1

Page 14: When & Why\'s of Denormalization

TM 14Dr. Chen, Business Database Systems

Refinement and normalization

• The ERD is further refined, in order to resolve the functional dependencies between the attributes of an Entity.

• May lead to splitting of tables to reduce data redundancy.

• Identifying candidates for denormalization• Application performance criteria.• Type of queries to be executed (update/retrieve).• Frequency of queries• Number of rows accessed by each transaction.• Cardinality – 1:1, 1:M• Derived data, Lookup data

2

3

Page 15: When & Why\'s of Denormalization

TM 15Dr. Chen, Business Database Systems

Determine effect on data integrity

• The effect of denormalization is reviewed. • Denormalizing may lead to performance

degradation • Or unacceptable consistency issues.• In such a case Denormalization decision

must be reconsidered

4

Page 16: When & Why\'s of Denormalization

TM 16Dr. Chen, Business Database Systems

Form for denormalized entity

• Identifying what form the denormalized entity may take

• We move down the normal forms ladder of steps.

5

Map conceptual scheme to physical scheme.

• Once the scheme is tested and verified it is implemented.

6

Page 17: When & Why\'s of Denormalization

TM 17Dr. Chen, Business Database Systems

DENORMALIZATIONSTRATEGIES

• Pre joined Tables

• Report Tables

• Mirror Tables

• Split Tables

• Redundant Data

• Repeating Groups

• Derivable Data

• Speed Tables

Page 18: When & Why\'s of Denormalization

TM 18Dr. Chen, Business Database Systems

Pre-joined tables

Two or more tables are joined and the result is stored as another table.

When the cost of joining is prohibitive

Example: Retail store databases

Contain only those columns absolutely necessary for application to meet processing needs.

The pre-joined table must be created periodically using SQL to join the normalized tables.

Page 19: When & Why\'s of Denormalization

TM 19Dr. Chen, Business Database Systems

1:1 Relationships

Page 20: When & Why\'s of Denormalization

TM 20Dr. Chen, Business Database Systems

M:M Relationship

Page 21: When & Why\'s of Denormalization

TM 21Dr. Chen, Business Database Systems

The normalised tables

Page 22: When & Why\'s of Denormalization

TM 22Dr. Chen, Business Database Systems

Denormalised tables

Page 23: When & Why\'s of Denormalization

TM 23Dr. Chen, Business Database Systems

Report Tables

• When specialized critical reports are too costly to generate

• Create table that contains the report.• To be viewed in online environments.• Lot of formatting and data manipulation

Page 24: When & Why\'s of Denormalization

TM 24Dr. Chen, Business Database Systems

Mirror tables

When tables are required concurrently by two different types of environments.

• If online processing and decision support access the same table

• Can duplicate table, use second table for read-only use

• Example: Heavy Online Traffic

• Care must be taken to periodically migrate the foreground data to background tables.

• Performance bottlenecks are resolved.

Page 25: When & Why\'s of Denormalization

TM 25Dr. Chen, Business Database Systems

Split tables

When distinct groups use different parts of a table. - vertically - horizontally. The original table must be available for certain transactions.

Page 26: When & Why\'s of Denormalization

TM 26Dr. Chen, Business Database Systems

Vertical Split

• Attributes are divided between the two tables, primary key put into both tables

• Particularly useful if one group of applications accesses some columns and another group accesses different columns

Example: Many columns of the customer table contain data specific to credit limit assessment, whereas others contain more general contact and customer profiling information

Split the table vertically, one partition containing credit limit information, and the other containing the more general customer details.

Page 27: When & Why\'s of Denormalization

TM 27Dr. Chen, Business Database Systems

Horizontal Split

• Rows are divided between two tables• Usually rows are divided by range of key values

– The operation of UNION ALL, when applied later should not add more rows than contained in the original, un-split tables

• Example: For a large customer table, we might split it into two tables, one for home-based customers, and the other for overseas customers.

Page 28: When & Why\'s of Denormalization

TM 28Dr. Chen, Business Database Systems

Redundant Data

Some columns of other table are made redundant in a given table. To reduce the number of table joins

Use when 1/more columns from one table are accessed whenever data from another table is accessed.

The original column must not be removed from the table.

Best for data that is not updated often.

Example: Consider the DEPARTMENT and EMPLOYEE tables. Queries require the name of the employee's department then the department name column could be carried as redundant data in the EMP table.

Page 29: When & Why\'s of Denormalization

TM 29Dr. Chen, Business Database Systems

Repeating Groups

Another table is created that contains the columns corresponding to every element of group.

• Example A (Customer_No, Balance_period, Balance)

B (Customer_No, Balance_period1, Balance_period2, Balance_period3, Balance_period4, Balance_period5)

Points To Remember The data is rarely or never aggregated, averaged, or compared

within the row The data has a stable number of occurrences The data is usually accessed collectively The data has a predictable pattern of insertion and deletion

Page 30: When & Why\'s of Denormalization

TM 30Dr. Chen, Business Database Systems

Derivable data

Derived data is data not directly stored in the database, but is instead calculated from the data which is stored in the database

• Cost of deriving data using complicated formulae is prohibitive then consider storing the derived data in a column instead of calculating it.

• Example: Score Calculation

– The stored derived data must be updated whenever the underlying data it is based on is changed.

Page 31: When & Why\'s of Denormalization

TM 31Dr. Chen, Business Database Systems

Speed tables

• A speed table is a denormalized version of a hierarchy.

• Every parent has a row for every child that reports to it at any level, either directly or indirectly.

• A speed table optionally carries information such as level within a hierarchy and whether or not the child is at a detail most level within the hierarchy (bottom of tree)

• Used when tree like hierarchy is to be stored in database.• Data is replicated within a speed table to increase the

speed of data retrieval.

Page 32: When & Why\'s of Denormalization

TM 32Dr. Chen, Business Database Systems

NORMALISED

DENORMALIZED

Page 33: When & Why\'s of Denormalization

TM 33Dr. Chen, Business Database Systems

CASE STUDY-Prejoin

A simplified retail exampleBefore denormalization:

sale_id store_id sale_dt …

tx_id sale_id item_id … item_qty sale$

1

M

SALES

SALES_DETAIL

Page 34: When & Why\'s of Denormalization

TM 34Dr. Chen, Business Database Systems

Prejoin Denormalization

A simplified retail example...

After denormalization:

t x _ i d sale_id store_id sale_dt product_id … product_qty $

SALES_AND_DETAILS

Page 35: When & Why\'s of Denormalization

TM 35Dr. Chen, Business Database Systems

SAMPLE QUERY

• Q) What was my total volume between '06-AUG-08'and '06-AUG-09'?

• BEFORE denormalization:

select sum(sales_detail.product_qty)from sales ,sales_detailwhere sales.sale_id = sales_detail.sale_id and

sales.sale_date between TO_DATE('06-AUG-08','DD-Month-YY') and TO_DATE('06-AUG-09','DD-Month-YY');

Page 36: When & Why\'s of Denormalization

TM 36Dr. Chen, Business Database Systems

Page 37: When & Why\'s of Denormalization

TM 37Dr. Chen, Business Database Systems

Sample Query 2

• Q) What was my total volume between '06-AUG-08'and '06-AUG-09'?

• AFTER denormalization:

select sum(product_qty)from sales_and_detailswhere sales_and_details.sale_date between

TO_DATE('06-AUG-08','DD-Month-YY') and TO_DATE('06-AUG-09','DD-Month-YY');

Page 38: When & Why\'s of Denormalization

TM 38Dr. Chen, Business Database Systems

Page 39: When & Why\'s of Denormalization

TM 39Dr. Chen, Business Database Systems

Sample Query 3

• What happens if we ask about the number of “sales” rather than the quantity transacted?

BEFORE denormalization:

select count(*)from sales where sales.sale_date between TO_DATE('06-

AUG-08','DD-Month-YY') and TO_DATE('06-AUG-09','DD-Month-YY');

Page 40: When & Why\'s of Denormalization

TM 40Dr. Chen, Business Database Systems

Page 41: When & Why\'s of Denormalization

TM 41Dr. Chen, Business Database Systems

• What happens if we ask about the number of “sales” rather than the quantity transacted?

• AFTER denormalization:

• select count(distinct sale_id)from sales_and_details where sales_and_details .sale_date between

TO_DATE('06-AUG-08','DD-Month-YY') and TO_DATE('06-AUG-09','DD-Month-YY');

Page 42: When & Why\'s of Denormalization

TM 42Dr. Chen, Business Database Systems

Page 43: When & Why\'s of Denormalization

TM 43Dr. Chen, Business Database Systems

PROS

Convenience• Using calculated values it is far easier for

programmers to generate reports without have generating code to calculate them.

• Saves CPU time.Simple Queries• Each eliminated JOIN is a simpler query that is

easier to get right the first time, easier to debug, and easier to keep correct when changed.

Page 44: When & Why\'s of Denormalization

TM 44Dr. Chen, Business Database Systems

PROS

The Performance Argument• We end up improving performance (speed) because we

need fewer JOINs to retrieve the same number of facts.

The Storage Argument• Data availability the locations where it will be used. • The number of foreign keys are reduced (how separate

tables are related), the number of indexes are reduced (foreign keys are frequently indexed)

Page 45: When & Why\'s of Denormalization

TM 45Dr. Chen, Business Database Systems

CONS

• Leads to data duplication and increases the storage requirement of the database.

• Documenting decisions, ensuring valid data, data migration.

• Having multiple copies leads to synchronization issues.

• Increased update time.

Page 46: When & Why\'s of Denormalization

TM 46Dr. Chen, Business Database Systems

Physically speaking.,.,,

Performance determined entirely at the “physical database level “

– Storage and access methods– Hardware– Physical design– DBMS implementation details– Degree of concurrent access

Page 47: When & Why\'s of Denormalization

TM 47Dr. Chen, Business Database Systems

AN ILLUSIONIn denormalization have an understanding that:1. Higher the normalization, greater the number of tables2. Greater number of tables require more joins 3. Joins slow performance4. Denormalization reduces number of tables and, hence less joins, improved performance.

The problem is that points 2 and 3 are not necessarily true, in which case point 4 does not hold and even if they hold true.

Page 48: When & Why\'s of Denormalization

TM 48Dr. Chen, Business Database Systems

• It is claimed that from the integrity perspective, there are two database design options:

• Fully normalize the database thereby maximizing the simplicity of integrity enforcement;

• Denormalize the database and complicate integrity enforcement.

• According to the illusion argument, the first choice is the better option.

Why, then, the prevailing insistence on the second choice? The argument for denormalization is, of course, based on performance considerations

Page 49: When & Why\'s of Denormalization

TM 49Dr. Chen, Business Database Systems

Conclusion

• In a real-life project, you have to bring back some data redundancy for performance reasons.

• Database design is about efficient data engineering - tradeoffs in design choices , choosing the right design for the performance requirements

• As stated by most database practitioners, denormalization may or may not result in a better performance or a more flexible data structure for users.

• Selective denormalization is usually required. Weigh and decide whether the perceived benefits are worth the effort to maintain the database properly.

• The importance of the present argument between its pros and cons is of a vital importance

Page 50: When & Why\'s of Denormalization

TM 50Dr. Chen, Business Database Systems

References

• [1] G. Lawrence Sanders & Seung kyoon Shin , Denormalization Effects on Performance of RDBMS, Proceedings of the 34th Hawaii International Conference on System Sciences, 2001.

• [2] Denormalization strategies for data retrieval from data warehouses, Seung Kyoon Shina,*, G. Lawrence Sandersb,1a

• [3] Marsha Hanus, To Normalize or Denormalize, That is the Question, Candle Corporation

• [4] Denormalization Guidelines by Craig S. Mullins Published: PLATINUM technology, inc. June 1, 1997

• [5] Douglas B. Bock and John F. Schrage, Department of Computer Management and Information Systems, Southern Illinois University Edwardsville, published in the 1996 Proceedings of the Decision Sciences Institute, Orlando, Florida, November, 1996

• [6] The Dangerous Illusion: Denormalization, Performance and Integrity, Part 1 and Part 2, -Fabian Pascal, DM Review Magazine, July 2002

• [7] Service-Oriented Data Denormalization for Scalable Web Applications, Zhou Wei (Tsinghua University Beijing, China), Jiang Dejun (Tsinghua University), Guillaume Pierre (Vrije Universiteit Amsterdam), Chi-Hung Chi (Tsinghua Univers), Maarten van Steen(Vrije Universiteit Amsterdam);April 21-25, 2008. Beijing, China

• [8] Understanding Normalisation, by Micheal J Hernandez, 2001-2003. • [9] Hierarchical Denormalizing: A Possibility to Optimize the Data Warehouse

Design• By Morteza Zaker, Somnuk Phon-Amnuaisuk, Su-Cheng Haw• [10] How Valuable is Planned Data Redundancy in Maintaining the Integrity

of an Information System through its Database by Eghosa Ugboma , Florida Memorial University

• [11] Introduction to Databases, Database Design and SQL, Zornitsa Zaharieva, CERN

• [12] THE DATA ADMINISTRATION NEWSLETTER – TDAN.com

Page 51: When & Why\'s of Denormalization

THANK YOUTHANK YOU

Page 52: When & Why\'s of Denormalization

TM 52Dr. Chen, Business Database Systems

Anomalies

• Anomalies are inconsistencies in data that occur due to unnecessary redundancy.

• Update anomaly– Some copies of a data item are updated, but others are

not.• Insertion anomaly

– Can’t insert “real” data without also inserting unrelated or “made up” data

• Deletion anomaly– Can’t delete some data without also deleting other,

unrelated data

Page 53: When & Why\'s of Denormalization

TM 53Dr. Chen, Business Database Systems

First Normal Form (1NF)

If a table of data meets the definition of a relation, it is in first normal form.

– Every relation has a unique name.– Every attribute value is atomic (single-valued).– Every row is unique.– Attributes in tables have unique names.– The order of the columns is irrelevant.– The order of the rows is irrelevant.

Page 54: When & Why\'s of Denormalization

TM 54Dr. Chen, Business Database Systems

Second Normal Form (2NF)

• 1NF and no partial functional dependencies.• Partial functional dependency: when one or

more non-key attributes are functionally dependent on part of the primary key.

• Every non-key attribute must be defined by the entire key, not just by part of the key.

• If a relation has a single attribute as its key, then it is automatically in 2NF.

Page 55: When & Why\'s of Denormalization

TM 55Dr. Chen, Business Database Systems

Second normal form 2NF

Student_ID Activity Fee222-22-2020 Swimming 30232-22-2111 Golf 100222-22-2020 Golf 100255-24-2332 Hiking 50

A relation that is not in 2NF

Key: Student_ID, Activity

Activity → Fee

Fee is determined by Activity

ACTIVITY

Student_ID Activity Fee

Page 56: When & Why\'s of Denormalization

TM 56Dr. Chen, Business Database Systems

Student_ID Activity222-22-2020 Swimming232-22-2111 Golf222-22-2020 Golf255-24-2332 Hiking

Activity FeeSwimming 30

Golf 100Hiking 50

Fee

Divide the relation into two relations that now meet 2NF

Student_IDSTUDENT_ACTIVITY

Activity

ACTIVITY_COSTActivity

Key: Student_ID and Activity

Key: Activity

Activity → Fee

Page 57: When & Why\'s of Denormalization

TM 57Dr. Chen, Business Database Systems

Third Normal Form (3NF)

• 2NF and no transitive dependencies• Transitive dependency: a functional

dependency between two or more non-key attributes.

Page 58: When & Why\'s of Denormalization

TM 58Dr. Chen, Business Database Systems

A relation with a transitive dependency

Student_ID Building Fee222-22-2020 Dabney 1200232-22-2111 Liles 1000222-22-5554 The Range 1100255-24-2332 Dabney 1200330-25-7789 The Range 1100

Student_ID

HOUSINGBuilding Fee

Key: Student_ID

Building → Fee

Student_ID → Building→ Fee

Page 59: When & Why\'s of Denormalization

TM 59Dr. Chen, Business Database Systems

Divide the relation into two relations that now meet 3NF

Student_ID

STUDENT_HOUSING

Building

Key: Student_ID

Student_ID → Building

Building FeeDabney 1200

Liles 1000The Range 1100

BUILDING_COST

Building FeeKey: Building

Building → Fee

Page 60: When & Why\'s of Denormalization

TM 60Dr. Chen, Business Database Systems

Third Normal Form (3NF)• In 2NF and every non-key column is mutually

independent – means : Calculations

•Solution: Put calculations in queries and formsOrderDetailsOrderIDItemQuantityPrice

Put expression in text control or in query:=Quantity * Price

Item Quantity Price TotalHammer 2 $10 $20Saw 5 $40 $200Nails 8 $1 $8

Page 61: When & Why\'s of Denormalization

TM 61Dr. Chen, Business Database Systems

BCNF

• 3NF and every determinant is a candidate key.

Page 62: When & Why\'s of Denormalization

TM 62Dr. Chen, Business Database Systems

Student_ID Major Advisor222-22-2020 MIS Leigh232-22-2111 Management Gowan222-22-2020 MIS Roberts222-22-2111 Marketing Reynolds255-24-2332 Marketing Reynolds

A relation where a determinant is not a candidate key

Note: Students can have a double major and have an advisor for each major. An advisor works only with students in their assigned area.

Student_ID

STUDENT_ADVISOR

Advisor MajorPrimary Key: Student_ID, Major

Candidate Key: Student_ID, Advisor

Advisor → Major

Page 63: When & Why\'s of Denormalization

TM 63Dr. Chen, Business Database Systems

Student_ID Advisor222-22-2020 Leigh232-22-2111 Gowan222-22-2020 Roberts222-22-2111 Reynolds255-24-2332 Reynolds

Divide the relation into two relations that meet BCNF

Student_ID

STUDENT_ADVISORKey: Student_ID, Advisor

Advisor MajorLeigh MIS

Gowan ManagementRoberts MISReynolds Marketing

ADVISOR_MAJOR

Advisor Major

Advisor

Key: Advisor

Advisor → Major

Page 64: When & Why\'s of Denormalization

TM 64Dr. Chen, Business Database Systems

Speed Tables

Page 65: When & Why\'s of Denormalization

TM 65Dr. Chen, Business Database Systems