btm 382 database management chapter 2: data models chapter 12.12-13: cap and hadoop chitu okoli...
TRANSCRIPT
BTM 382 Database Management
Chapter 2: Data modelsChapter 12.12-13: CAP and Hadoop
Chitu OkoliAssociate Professor in Business Technology ManagementJohn Molson School of Business, Concordia University, Montréal
1
3
What is a model?
• A model is a simplified way to describe or explain a complex reality
• A model helps people communicate and work simply yet effectively when talking about and manipulating complex real-world phenomena
4
Scientific models
Sources:
http://www.redorbit.com/education/reference_library/space_1/universe/2574692/geocentric_model/
http://hendrianusthe.wordpress.com/2012/06/21/heliocentric-vs-geocentric/
5
Conceptual models
Sources:
http://info563.malagaclasses.info/strategy-it-2/
http://fivewhys.wordpress.com/2012/05/22/business-model-innovation/
6
Importance of Data Models
Communication tool
Give an overall view of the database
Organize data for various users
Are an abstraction for the creation of good database
6
9
The Relational Model
• Uses key concepts from mathematical relations (tables)– “Relational” in “relational model” means “tables”
(mathematical relations), not “relationships”• Table (relations)
– Matrix consisting of row/column intersections• Relations have well defined methods (queries) for
combining their data members– Selecting (reading) and joining (combining) data is defined
based on rigorous mathematical principles• Relational data management system (RDBMS)
– Relations where originally too advanced for 1970s computing power
– As computing power increased, simplicity of the model prevailed
10
The Entity Relationship Model
• Very detailed specification of relationships and their properties
• Enhancement of the relational model– Relations (tables) become entities
• Entity relationship diagram (ERD)– Uses graphic representations to model
database components• Many variations for notation exist; we
will use the Crow’s Foot notation
12
The Object-Oriented Data Model (OODM)
• Addresses “impedance mismatch” problem of the ER model– The ER model’s view of data (tables) and programmers’ view of
data (objects in OOP), is completely different– This mismatch makes database programming painful, especially
for very complex data structures• OODM Uses object-oriented programming concepts to store
data– Objects represent nouns (entities or records)– Objects have attributes (properties or fields) with values (data)– Objects have methods (operations or functions)– Classes group similar objects using a hierarchy and inheritance
• In an OODBMS, the data retrieval and storage closely mirrors the data structures that programmers use, and so programming complex objects is much easier than with the ER model
• More advanced forms support the Extended Relational Data Model, Object/Relational DBMS, and XML data structures
13
OODBMS vs. RDBMS
https://youtu.be/kORTgvfHl4g
15
Explaining Big Data
https://youtu.be/7D1CQ_LOizA
16
Big Data
• Volume– Huge amounts of data (terabytes and
petabytes), especially from the Internet• Velocity
– Organizations need to process the huge amounts of data rapidly, just as with smaller databases
• Variety– Wide variety of data, much of it
unstructured and even changing in structure
16
17
Big data’s solutions and RDBMS’s failure
• Scale up: use more powerful servers– RDBMS is very computing intensive– More data requires much faster, more
capable, expensive computers, and even that’s not good enough for big data
• Scale out: use many cheap distributed servers– RDBMS doesn’t work rapidly with distributed
processing– Consistency is the biggest problem:
guaranteeing consistency (which RDBMS is great at) is slow, too slow for big data
18
What is NoSQL?
https://www.youtube.com/watch?v=qUV2j3XBRHc
19
NoSQL Databases to the Big Data rescue
• “NoSQL” means:– Non-relational or non-RDBMS– Also “Not only SQL”—a few do support SQL
• It is not one model; it is many different models that are not relational
• High scalability– Support distributed database architectures
• High availability– Rapid performance for big data, including unstructured and sparse
data• Fault tolerance
– Continue to work even if some servers in the cluster fail• Geared toward performance rather than transaction consistency• Store data in key-value stores
19
20
Disadvantages of NoSQL
• Complex programming is required– “NoSQL” means you lose the ease-of-use and
structural independence of SQL– There is often no relationship support in the
database—you have to program relationships in code
• There is no transaction integrity support– The data you retrieve at any given moment might be
wrong… but it will eventually become OK– This is the price to pay for rapid performance in a
distributed database
20
21
The CAP theorem for distributed databases
• CAP stands for:– Consistency: All nodes see the same data– Availability: A request always gets a response (success or
failure)– Partition tolerance: Even if a node fails, the system can
still function• A distributed database can guarantee only two of
the three CAP characteristics, never all three at the same time– However, over time, it might be able to provide all three
• NoSQL databases are distributed, and so the CAP theorem restricts them to providing BASE, not ACID
21
22
ACID versus BASE
• A relational database guarantees the ACID properties:– Atomicity, Consistency, Isolated, Durable– In short, a set of SQL statements (called a
transaction) will either all work, or all fail—no half way success, and the result will not corrupt the database
– A price to pay: results might be somewhat slow• NoSQL database only guarantee BASE
properties:– Basically Available, Soft-state, Eventual consistency– In short, at any given moment, not everything might
be consistent, but the database will eventually get consistent
– In return, these imperfect results are delivered fast
23
Table 12.8 – Distributed Database Spectrum
23
Sacrifices availability to ensure consistency and isolation
25
Which data model should you use?
• Hierarchical or network models– Obsolete—no one uses these any longer
• Entity-relationship model– Continuation or enhancement of the relational model– 90% or more of professional database situations
• Object-oriented database– When you have very complex data structures, you need
rapid performance, and it makes business sense• Source: Barry & Associates, Inc
– Data structures are so complex that organizing data as tables causes headaches in programming retrieval and storage
• NoSQL– Vast amounts of unstructured data where you need rapid
performance– Speed is more important than data consistency
26
Sources
• Most of the slides are adapted from Database Systems: Design, Implementation and Management by Carlos Coronel and Steven Morris. 11th edition (2015) published by Cengage Learning. ISBN 13: 978-1-285-19614-5
• Other sources are noted on the slides themselves
26