nosql, xrx and functional programming version 4.0 dan mccreary monday may 6th, 2013
TRANSCRIPT
M
D
NoSQL, XRX and Functional Programming
Version 4.0
Dan McCreary
Monday May 6th, 2013
M
DKelly-McCreary & Associates, LLC
2
Topics• NoSQL
– History– Business Motivation
• XRX (XForms, REST, XQuery)• Functional Programming• NoSQL Concepts
– MapReduce/Hadoop– Key-value stores– CAP Theorem– Object-Relational Mapping
• NoSQL Product Classifications– Key-value stores, graph-stores, bigtable stores and document stores
• Sampling of Solutions– Big Data– Search– High availabily– Agility
• Analysis methods (ATAM)
M
DCopyright Kelly-McCreary & Associates, LLC
3
Background for Dan McCreary
• Bell Labs• NeXT Computer (Steve Jobs)• Owner of 75-person software
consulting firm• US Federal data integration (National
Information Exchange Model NIEM.gov)
• Native XML/XQuery for metadata management since 2006
• Advocate of web standards, NoSQL and XRX systems
M
DCopyright Kelly-McCreary & Associates, LLC
4
Making Sense of NoSQL
• Working with NoSQL since 2006• Co-founders of the NoSQL Now! conference• Authors of Manning book on NoSQL (MEAP now, print July 2013)• Guide for managers with a focus on business benefits• Focus on NoSQL architectural tradeoff analysis• http://manning.com/mccreary
M
DKelly-McCreary & Associates
• How got into NoSQL• Background as of 2006
– Worked for Steve Jobs at NeXT– Object-oriented development Objective-C, Java– Strong focus on object-relational mapping
• DBKit, WebObjects, Hibernate, Oracle, Sybase, MS-SQL Server
– 7 years doing XML (CriMNet), Metadata Registries (ISO/IEC 11179) and Semantics (RDF, OWL etc.)
My Story
5
M
DKelly-McCreary & Associates
6
M
DKelly-McCreary & Associates
7
M
DKelly-McCreary & Associates
8
M
DKelly-McCreary & Associates
9
XForms Rocks!
M
DKelly-McCreary & Associates
10
M
D
Four Translations
11Copyright 2010 Dan McCreary & Associates
• T1 – HTML into Java Objects• T2 – Java Objects into SQL Tables• T3 – Tables into Objects• T4 – Objects into HTML
T1
T4
T2
T3
Object MiddleTier
RelationalDatabaseWeb Browser
M
D
Kurt's Suggestion
store($collection, $file-name, $data)
12Copyright 2010 Dan McCreary & Associates
Web Browser
Save
Web Form
Use a Native XML
Database!
Kurt Cagle
eXist-db
M
D
Zero Translation
• XML lives in the web browser (XForms)• REST interfaces• XML in the database (Native XML, XQuery)• XRX Web Application Architecture• No translation!
13Copyright 2010 Dan McCreary & Associates
Web Browser XML database
XFormsModel
REST Interface
Native XML Database
M
DKelly-McCreary & Associates
14
Database Tradeoff Analysis
My Way• 10,000 lines of code to break
CRV XML document into Java component and use Hibernate to store each element into columns of tables
• 45 SQL inserts store• 20 joins to extract• Six months• Four developers (Java, SQL,
DBA, Project Manager)
Kurt's Way• One line of codes to store
CRB• One day to write• One week to test• 100x increase in agility
M
DKelly-McCreary & Associates
15
The Paradigm Shift
2006
RDBMSs are the onlyway to store enterprise
data
There are many ways to store enterprise data and if you pick the right one your agility can go
up x1,000
M
DKelly-McCreary & Associates
16
Reality Sets In
What I wanted• Objective architecture
analysis• Fairly weigh the pros and
cons of each alternative• Be surrounded by people
that know the strengths and weakness of many alternatives
What I got• Architecture decisions
driven by an RDBMS license• Architecture decisions
made by one person with limited exposure to alternatives
• Architecture decisions made by lack of knowledge
• Architecture by fear of the unknown
M
DKelly-McCreary & Associates
Evaluating Software Architectures: Methods and Case Studies
by Paul Clements, Rick Kazman, and Mark Klein
• Addison-Wesley, 2001
Key Book for ATAM
17
We need "Evaluating Database Architectures"
M
DCopyright Kelly-McCreary & Associates, LLC
18
Three Eras of Databases
• RDBMS for transactions, Data Warehouse for analytics and NoSQL for scalability
RDBMSRDBMS
DataWarehouse
1985-1995
1995-2010 2010-Now
DataWarehouseRDBMS
NoSQL
M
DCopyright Kelly-McCreary & Associates, LLC
19
Sample of NoSQL Jargon
Document orientation
Schema free
MapReduce
Horizontal scaling
Sharding and auto-sharding
Brewer's CAP Theorem
Consistency
Reliability
Partition tolerance
Single-point-of-failure
Object-Relational mapping
Key-value stores
Column stores
Document stores
Memcached
Indexing
B-Tree
Configurable durability
Documents for archives
Functional programming
Document Transformation
Document Indexing and Search
Alternate Query Languages
Aggregates
OLAP
XQuery
MDX
RDF
SPARQL
Architecture Tradeoff Modeling
ATAM
Note that within the context of NoSQL many of these terms have different meanings!
M
DCopyright Kelly-McCreary & Associates, LLC
20
NoSQL – The Big Tent
• "NoSQL" – a label for a "meme" that now encompasses a large body of innovative ideas on data management
• "Not Only SQL"• Focus on non-relational
databases and hybrids• A community were new ideas
are quickly recombined to create innovative new business solutions
http://www.flickr.com/photos/morgennebel/2933723145/
M
DCopyright Kelly-McCreary & Associates, LLC
21
Pressures on SQL Only Systems
SQLOLAP/BI/DataWarehouse
Social Networks
Scalability
AgileSchema
FreeDocument-D
ata
Large Data Sets Reliability
Linked Data
M
DCopyright Kelly-McCreary & Associates, LLC
22
Before NoSQL DB Selection Was Easy!
Does it look like
document?
Use MicrosoftOffice
Use theRDBMS
Start
Stop
No
Yes
M
DCopyright Kelly-McCreary & Associates, LLC
23
An evolving tree of data types
Read MostlyRead/Write
StructuredUnstructured
Transactional
RDBMS BI/DWWeb Crawlers
Documents
Log Files
XML
JSON
Binary
Open Linked Data
Graph
M
DCopyright Kelly-McCreary & Associates, LLC
24
Many Uses of Data
• Transactions (OLTP)• Analysis (OLAP)• Search and Findability• Enterprise Agility• Discovery and Insight• Speed and Reliability• Consistency and Availability
M
D
Before NoSQL
Relational Analytical (OLAP)
25
M
D
After NoSQL
Relational Analytical (OLAP) Key-Value
Column-Family DocumentGraph
key value
key value
key value
key value
26
M
DCopyright Kelly-McCreary & Associates, LLC
27
Three Eras of Enterprise Data
• NoSQL will not replace ERP or BI/DW systems – but they will complement them and also facilitate the integration of unstructured document data
1970s, 80s, 90s 1990s, 2000s
HR
Inventory Sales
FinanceERP
BI/DataWarehouse
ERP
BI/DataWarehouse
OLTP
OLAP NoSQL
Today
Siloed Systems
DocumentsDocuments
ERP Drives BI/DW NoSQL and Documents
M
DCopyright Kelly-McCreary & Associates, LLC
28
Simplicity is a Virtue
• Many modern systems derive their strength by dramatically limiting the features in their system and focus on a specific task
• Simplicity allows database designer to focus on the primary business drivers
Photo from flickr by PSNZ Images
M
DCopyright Kelly-McCreary & Associates, LLC
29
Simplicity is a Design Style
• Focus only on simple systems that solve many problems in a flexible way
• Examples:– Touch screen interfaces– Key/Value data stores
M
DCopyright Kelly-McCreary & Associates, LLC
30
Historical Context
Mainframe Era
• 1 CPU• COBOL and FORTRAN• Punchcards and flat files• $10,000 per CPU hour
Commodity Processors
• 10,000 CPUs• Functional programming• MapReduce "farms"• Pennies per CPU hour
M
D
Two Approaches to Computation
31Copyright 2010 Dan McCreary & Associates
Alonzo ChurchJohn von Neumann
Manage state with a program counter. Make computations act like math functions.
Which is simpler? Which is cheaper? Which will scale to 10,000 CPUs?
1930s and 40s
M
DCopyright Kelly-McCreary & Associates, LLC
32
Standard vs. MapReduce Prices
http://aws.amazon.com/elasticmapreduce/#pricing
John's Way Alonzo's Way
M
DCopyright Kelly-McCreary & Associates, LLC
33
MapReduce CPUs Cost Less!
0
5
10
15
20
25
30
35
40Cost Per CPU Hour (Cents)
http://aws.amazon.com/elasticmapreduce/#pricing
Cust cost from 32 to 6 cents per CPU hour!Perhaps Alanzo was right!
82% CostReduction!
Why? (hint: how "shareable" is this process)
M
DKelly-McCreary & Associates, LLC
34
Its about transformation!
34
M
DKelly-McCreary & Associates, LLC
35
Architectural Tradeoffs
I want a fast car with good mileage.
I want a scaleable database with low cost that runs well on the 1000 CPUs in our data center.
M
DKelly-McCreary & Associates, LLC
36
Recent History
• The term NoSQL became re-popularized around 2009
• Used for conferences of advocates of non-relational databases
• Became a contagious idea "meme"• First of many "NoSQL meetups" in San
Francisco organized by Jon Oskarsson• Conversion from "No SQL" to "Not Only
SQL" in recent year
M
DCopyright Kelly-McCreary & Associates, LLC
37
Key Events on NoSQL Timeline
1998
Carlo StrozziIntros NoSQL
2006
GoogleBigtable
2009
Eric EvansReintroduces Term
2005
GoogleMapReduce
XQuery 1.0Recommendation
NoSQL MeetupsAmazonDynamo
2007 2008 201020042003
LiveJournal'sMemcache
Open SourceBecomes Popular
MDX
M
DCopyright Kelly-McCreary & Associates, LLC
38
RDBMS vs. NoSQL
• NoSQL is real and it’s here to stay
http://www.google.com/trends/explore#q=nosql%2C%20rdbms&date=1%2F2009%2051m&cmpt=q
RDBMS
NoSQL
Google Trends
M
DKelly-McCreary & Associates, LLC
39
NoSQL and Web 2.0 Startups
• Many web 2.0 startups did not use Oracle or MySQL
• They built their own data stores influenced by Amazon’s Dynamo and Google’s Bigtable in order to store and process huge amounts of data
• In the social community or cloud computing applications most of these data stores became open source software
M
DCopyright Kelly-McCreary & Associates, LLC
40
2003 - Memcached
• Originally developed for LiveJournal as a "side-cache"• Allowed databases to put data in a "RAM Cache" in front
of a SQL server to speed up read-mostly operations• Cache in all front-end servers can be shared via a simple
protocol• Dramatically lowers stress on SQL server• Ideal "front end" to SQL systems to increase performance
M
DCopyright Kelly-McCreary & Associates, LLC
41
memcachedb
Partition Select and Updates
• Most frequently read data always stays in RAM• Complex update transactions go directly to the database
memcachedbSelect
SQL "System ofRecord"
InsertUpdateDelete
memcachedb
Sync with memcached protocol
M
DCopyright Kelly-McCreary & Associates, LLC
42
Google MapReduce
• 2004 paper that had huge impact of functional programming on the entire community
• Copied by many organizations, including Yahoo
M
DCopyright Kelly-McCreary & Associates, LLC
43
Google Bigtable Paper
• 2006 paper that gave focus to scaleable databases
• designed to reliably scale to petabytes of
data and thousands of machines
M
DCopyright Kelly-McCreary & Associates, LLC
44
Amazon's Dynamo Paper
• Werner Vogels• CTO - Amazon.com• October 2, 2007• Used to power Amazon's
Web Sites• One of the most
influential papers in the NoSQL movement
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall and Werner Vogels, “Dynamo: Amazon's Highly Available Key-Value Store”, in the Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007.
M
DKelly-McCreary & Associates, LLC
45
2009: the NoSQL "Revolt"
“NoSQLers came to share how they had overthrown the tyranny of slow, expensive relational databases in favor of more efficient and cheaper ways of managing data.”
Computerworld magazine, July 1st, 2009
NoSQL!
M
DCopyright 2008 Dan McCreary & Associates
46
Many Processes Today Are Driven By…
The constraints of yesterday…
Challenge:Ask ourselves the question…Do our current method of solving problems with tabular data…Reflect the storage of the 1950s…Or our actual business requirements?What structures best solve the actual business problem?
M
DCopyright 2008 Dan McCreary & Associates
47
No-Shredding!
• Relational databases take a single hierarchical document and shred it into many pieces so it will fit in tabular structures
• Document stores prevent this shredding
My FormData
M
DCopyright 2008 Dan McCreary & Associates
48
Is Shredding Really Necessary?
• Every time you take hierarchical data and put it into a traditional database you have to put repeating groups in separate tables and use SQL “joins” to reassemble the data
M
D
The Addition of XML Web Services
• T1 – HTML into Java Objects• T2 – Java Objects into SQL Tables• T3 – Tables into Objects• T4 – Objects into HTML• T5 – Objects to XML• T6 – XML to Objects
49
Copyright 2011 Kelly-McCreary & Associates
T1
T3
T2
T4
Object MiddleTier
RelationalDatabaseWeb Browser
T5
Web Service
T6
M
DCopyright Kelly-McCreary & Associates, LLC
50
"The Vietnam of Applications"
• Object-relational mapping has become one of the most complex components of building applications today
• A "Quagmire" where many projects get lost• Many "heroic efforts" have been made to solve
the problem– Java Hibernate Framework– Ruby on Rails
• But sometimes the best way to avoid complexity is to keep your architecture very simple
M
D
"Schema Free"
• Systems that automatically determine how to index data as the data is loaded into the database
• No a priori knowledge of data structure• No need for up-front logical data modeling
– …but some modeling is still critical
• Adding new data elements or changing data elements is not disruptive
• Searching millions of records still has sub-second response time
51Copyright 2010 Dan McCreary & Associates
M
DCopyright Kelly-McCreary & Associates, LLC
52
Schema-Free Integration
"We can easily store the data that we actually get, not the data we thought we would get."
XMLv1
XMLv2
XMLv3
Enterprise Messaging System NoSQLDatabase
M
DCopyright Kelly-McCreary & Associates, LLC
53
Upfront ER Modeling is Not Required
• You do not have to finish modeling your data before you insert your first records
• No Data Definition Language "DDL" is needed
• Metadata is used to create indexes as data arrives
• Modeling becomes a statistical process – write queries to find exceptions and normalize data
• Exceptions make the rules but can still be used
• Data validation can still be done on documents using tools such as XML Schema and business rules systems like Schematron
M
D
Monoculture and Mono-architecture
54Copyright 2011 Kelly-McCreary & Associates
Image Source: Wikipedia
M
DKelly-McCreary & Associates, LLC
55
Eric Evans
“The whole point of seeking alternatives [to RDBMS systems] is that you need to solve a problem that relational databases are a bad fit for.”
Eric EvansRackspace
M
D
Finding the Right Match
56Copyright 2011 Dan McCreary & Associates
Schema-Free
Mature Query Language
Standards Compliant
Use CMU's Architectural Tradeoff and Method (ATAM) Process
M
DCopyright Kelly-McCreary & Associates, LLC
57
ACID Transactions
• Atomic – transaction either complete or fail – no in-between
• Consistent – no reports can ever catch data between states
• Isolated – never allow users to touch data that is incomplete
• Durable – recover committed transactions from partial system failures
M
DCopyright Kelly-McCreary & Associates, LLC
58
BASE
• Base Availability of Soft state Eventually Consistent
• "Eventually Consistent" is not appropriate for many financial reporting systems
M
DKelly-McCreary & Associates, LLC
59
Sharding and Partition Tolerance
• When a database gets full how easy is it for your system to automatically move half the data to new systems?
Time to "Shard"
New Server
Before autoshard the node is near 100% capacity.
After autoshard the nodes are "rebalanced" with each node taking half the load.
copy 1/2 data
M
DKelly-McCreary & Associates, LLC
60
Brewer's CAP Theorem
Consistency
Availability Partition Tolerance
You can not have all three in
so pick two!
M
DKelly-McCreary & Associates, LLC
61
Proof of Brewer's CAP Theorem
• In 2002, Seth Gilbert and Nancy Lynch of MIT, formally proved Brewer's Theorem to be correct
• Focus turned away from trying to get all three to trying to pick the right two for the job
• Many systems are becoming "CAP tunable"
M
DCopyright Kelly-McCreary & Associates, LLC
62
CAP Systems
• CA – Consistent and Available– Example – a single site, single database, no auto-
sharding– If you have to partition, stop, partition, restart
• CP – Consistent with Partition Tolerance– Some data not available when partitioning– What data you have is always consistent
• AP – Available with Partition Tolerance– Always on, but during partitioning, some data may
be inconsistent– After sharding eventual consistency
M
DKelly-McCreary & Associates, LLC
63
CAP Regions
Consistency
Availability Partition Tolerance
CP
AP
CA
NoSQL Innovation
RDBMS
M
DKelly-McCreary & Associates, LLC
64
When does CAP Apply?
Consistency AND Availability
Each transaction can select between Consistency OR Availability depending on
the context
If you have…
A single processor or many processors on a
working network:
Many processors and network failures:
then you get…
M
DCopyright Kelly-McCreary & Associates, LLC
65
Scale Up vs. Scale Out
Scale Up• Make a single CPU as fast as
possible• Increase clock speed• Add RAM• Make disk I/O go faster
Scale Out• Make Many CPUs work
together• Learn how to divide your
problems into independent threads
M
DCopyright Kelly-McCreary & Associates, LLC
66
The Tunable SLA
• NoSQL systems that use many commodity processors can be precisely tuned to meet an organizations service level agreements
Max ReadTime
Max WriteTime
Reads PerSecond
DuplicateCopies
MultipleDatacenters
TransactionGuarantees
Writes PerSecond
$435.50
EstimatedPrice/Month
inputs output
M
DCopyright Kelly-McCreary & Associates, LLC
67
Example: Amazon DynamoDB
• Tune your service to the business requirements (challenging to do with a RDBMS)
TuningKnobs
M
DKelly-McCreary & Associates, LLC
68
"One Size Fits…"
"One Size Does Not Fit All"James Hamilton Nov. 3rd, 2009
http://perspectives.mvdirona.com/CommentView,guid,afe46691-a293-4f9a-8900-5688a597726a.aspx
James works for Amazon Web Services
M
DKelly-McCreary & Associates, LLC
69
NoSQL and Cloud Computing
• High scalability– Especially in the horizontal
direction (multi CPUs)
• Low administration overhead– Simple web page
administration
• Lower fees for MapReduce transformations
M
DCopyright Kelly-McCreary & Associates, LLC
70
Distribution Models
Master-Slave Peer-to-Peer
MasterStandbyMaster
Node Node
Node
Node
Node
Node
Node
requestsrequests
Used only if primary master fails
M
D
The NO-SQL Universe
71
Copyright 2010 Dan McCreary & Associates
Document StoresKey-Value Stores
Graph/Triple Stores
Object StoresColumn-Family Stores
XML
M
DCopyright Kelly-McCreary & Associates, LLC
72
Relational
• Data is usually stored in row by row manner (row store)
• Standardized query language (SQL)• Data model defined before you add
data• Joins merge data from multiple tables• Results are tables• Pros: mature ACID transactions with
fine-grain security controls• Cons: Requires up front data
modeling, does not scale well
Examples:Oracle, MySQL,PostgreSQL,Microsoft SQL Server, IBM DB/2
M
DCopyright Kelly-McCreary & Associates, LLC
73
Analytical (OLAP)
• Based on "Star" schema with central fact table for each event
• Optimized for analysis of read-analysis of historical data
• Use of MDX language to count query "measures" for "categories" of data
• Pros: fast queries for large data• Cons: not optimized for
transactions and updates
Examples:Cognos, Hyperion, Microstrategy, Pentaho, Microsoft, Oracle, Business Objects
M
DCopyright Kelly-McCreary & Associates, LLC
74
Key-Value Stores
• Keys used to access opaque blobs of data
• Values can contain any type of data (images, video)
Pros: scalable, simple API (put, get, delete)
Cons: no way to query based on the content of the value
key value
key value
key value
key value
Examples:Berkley DB, Memcache, DynamoDB, S3, Redis, Riak
M
DCopyright Kelly-McCreary & Associates, LLC
75
Key Value Stores
• A table with two columns and a simple interface– Add a key-value– For this key, give me the
value– Delete a key
• Blazingly fast and easy to scale (no joins)
Key Value
Blob datatype
string datatype
M
DCopyright Kelly-McCreary & Associates, LLC
76
The Locker Metaphor
Key:
Value:An arbitrary container data
M
DCopyright Kelly-McCreary & Associates, LLC
77
No Subset Queries in Key-Value Stores
M
DCopyright Kelly-McCreary & Associates, LLC
78
Types of Key-Value Stores
• Eventually‐consistent key‐value store• Hierarchical key‐value stores• Key-Value stores in RAM• Key-Value stores on disk• High availability key-value store• Ordered key‐value stores• Values that allow simple list operations
M
DCopyright Kelly-McCreary & Associates, LLC
79
Memcached
• Open source in-memory key-value caching system• Make effective use of RAM on many distributed web servers• Designed to speed up dynamic web applications by alleviating database
load• RAM resident key-value store for small chunks of arbitrary data (strings,
objects) from results of database calls, API calls, or page rendering• Simple interface for highly distributed RAM caches• 30ms read times typical• Designed for quick deployment, ease of development• APIs in many languages
M
DCopyright Kelly-McCreary & Associates, LLC
80
Riak
• Open source distributed key-value store with support and commercial versions by Basho
• A "Dynamo-inspired" database• Focus on availability, fault-tolerance,
operational simplicity and scalability• Support for replication and auto-sharding and
rebalancing on failures• Support for MapReduce, fulltext search and
secondary indexes of value tags• Written in ERLANG
M
DCopyright Kelly-McCreary & Associates, LLC
81
Redis
• Open source in-memory key-value store with optional durability
• Focus on high speed reads and writes of common data structures to RAM
• Allows simple lists, sets and hashes to be stored within the value and manipulated
• Many features that developers like– expiration, transactions, pub/sub, partitioning
M
DCopyright Kelly-McCreary & Associates, LLC
82
Amazon DynamoDB
• Amazon DynamoDB• Based around scalable key-value store• Fastest growing product in Amazon's
history• SSD only database service• Focus on throughput not storage and
predictable read and write times• Strong integration with S3 and Elastic
MapReduce
M
DCopyright Kelly-McCreary & Associates, LLC
83
Column-Family
• Key includes a row, column family and column name
• Store versioned blobs in one large table
• Queries can be done on rows, column families and column names
• Pros: Good scale out, versioning• Cons: Cannot query blob content,
row and column designs are critical
Examples:Cassandra, HBase, Hypertable, Apache Accumulo, Bigtable
M
DCopyright Kelly-McCreary & Associates, LLC
84
Column Family (Bigtable)
• The champion of "Big Data"• Excel at highly saleable systems• Tightly coupled with MapReduce• Technically a "sparse matrix" were most cells have
no data• Generating a list of all columns is non-trivial• Examples:
– Google Bigtable– Hadoop HBase– Hypertable
M
DCopyright Kelly-McCreary & Associates, LLC
85
Spreadsheets Use a Row/Column as a Key
• Bigtable systems use a combination of row and column information as part of their key
Column ID
Row ID
M
DCopyright Kelly-McCreary & Associates, LLC
86
Keys Include Family and Timestamps
• Bigtable systems have keys that include not just row and column ID but other attributes
• Column Families are created when a table is created
• Timestamps allows multiple versions of values• Values are just ordered bytes and have no
strongly typed data system
M
DCopyright Kelly-McCreary & Associates, LLC
87
Column Store Concepts
• Preserve the table-structure familiar to RDBMS systems
• Not optimized for "joins"• One row could have millions of
columns but the data can be very "sparse"
• Ideal for high-variability data sets• Colum families allow to query all
columns that have a specific property or properties
• Allow new columns to be inserted without doing an "alter table"
• Trigger new columns on inserts
Col1 Col100000…
M
DCopyright Kelly-McCreary & Associates, LLC
88
Column Families
• Group columns into "Column families"
• Group column families into "Super-Columns"
• Be able to query all columns with a family or super family
• Similar data grouped together to improve speed
Table
SuperCol X
SuperCol Y
Fam1 Fam2
Col-A Col-B
M
DCopyright Kelly-McCreary & Associates, LLC
89
Hadoop/Hbase
• Open source implementation of MapReduce algorithm written in Java
• Initially created by Yahoo– 300 person-years development
• Column-oriented data store• Java interface• HBase designed specifically to work with
Hadoop• High-level query language (Pig)• Strong support by many vendors
M
DCopyright Kelly-McCreary & Associates, LLC
90
Cassandra
• Apache open source column family database supported by DataStax
• Peer-to-peer distribution model• Strong reputation for linear scale out
(millions of writes/second)• Database side security• Written in Java and works well with HDFS
and MapReduce
M
DCopyright Kelly-McCreary & Associates, LLC
91
Netflix
M
DCopyright Kelly-McCreary & Associates, LLC
92
Graph Store
• Data is stored in a series of nodes, relationships and properties
• Queries are really graph traversals• Ideal when relationships between
data is key: – e.g. social networks
• Pros: fast network search, works with public linked data sets
• Cons: Poor scalability when graphs don't fit into RAM, specialized query languages (RDF uses SPARQL)
Examples:Neo4j, AllegroGraph, Bigdata triple store, InfiniteGraph, StarDog
M
DCopyright Kelly-McCreary & Associates, LLC
93
Graph Stores
• Used when the relationship and relationships types between items are critical
• Used for– Social networking queries: "friends of my friends"– Inference and rules engines– Pattern recognition– Used for working with open-linked data
• Automate "joins" of public data
M
DCopyright Kelly-McCreary & Associates, LLC
94
Nodes are "joined" to create graphs
• How do you know that two items reference the same object?
• Node identification – URI or similar structure
M
DCopyright Kelly-McCreary & Associates, LLC
95
Open Linked Data
M
DCopyright Kelly-McCreary & Associates, LLC
96
Neo4J
• Graph database designed to be easy to use by Java developers
• Dual license (community edition is GPL)
• Works as an embedded java library in your application
• Disk-based (not just RAM)• Full ACID
M
DCopyright Kelly-McCreary & Associates, LLC
97
Document Store
• Data stored in nested hierarchies
• Logical data remains stored together as a unit
• Any item in the document can be queried
• Pros: No object-relational mapping layer, ideal for search
• Cons: Complex to implement, incompatible with SQL
Examples:MarkLogic, MongoDB, Couchbase, CouchDB, eXist-db
M
DCopyright Kelly-McCreary & Associates, LLC
98
Document Stores
• Store machine readable documents together as a single blob of data
• Use JSON or XML formats to store documents• Similar to "object stores" in many ways• No shredding of data into tables• Sub-trees and attributes of documents can still be
queried XQuery or other document query languages• Quickly maturing to include ACID transaction support• Lack of object-relational mapping permits agile
development• Fastest growing revenues (MarkLogic, MongoDB,
Couchbase)
M
DCopyright Kelly-McCreary & Associates, LLC
99
Estimated Big Data and NoSQL Sales
Document Stores
M
DCopyright Kelly-McCreary & Associates, LLC
100
Document Structure
<books> is our root element
<books> contain a
sequence of one to many
<book> elements
Each <book> contains
the following sequence of
elements
Darker lines mean "required" and light lines mean optional
elements
Id and title are required
Books have 0 to many author-names
Format and license elements are codes that must be in a fixed list of choicesOnly valid URL
characters
Must be a valid decimal number
M
DCopyright Kelly-McCreary & Associates, LLC
101
MarkLogic
• Native XML database designed to scale to Petabyte data stores
• Leverages commodity hardware• ACID compliant, schema-free document
store• Heavy use by federal agencies, document
publishers and "high-variability" data• Arguably the most successful NoSQL
company
M
DCopyright Kelly-McCreary & Associates, LLC
102
MongoDB
• Open Source JSON data store created by 10gen
• Master-slave scale out model• Strong developer community• Sharding built-in, automatic• Implemented in C++ with many APIs (C++,
JavaScript, Java, Perl, Python etc.)
M
DCopyright Kelly-McCreary & Associates, LLC
103
Couchbase
• Open source JSON document store• Code base separate from CouchDB• Built around memcached• Peer to peer scale out model• Written in C++ and Erlang• Strengths in scale out, replication and
high-availability
M
DCopyright Kelly-McCreary & Associates, LLC
104
CouchDB
• Apache CouchDB• Open source JSON data store• Document Model• Written in ERLANG• RESTful JSON API• Distributed, featuring robust, incremental
replication with bi-directional conflict detection and management
• B-Tree based indexing• Mobile version
M
DCopyright Kelly-McCreary & Associates, LLC
105
eXist
• Open source native XML database• Strong support for XQuery and XQuery
extensions• Heavily used by the Text Encoding Initiative
(TEI) community and XRX/XForms communities
• Integrated Lucene search• Collection triggers and versioning• Extensive XQuery libs (EXPath)• Version 2.0 has replication
M
DCopyright Kelly-McCreary & Associates, LLC
106
Structured Search
• What happens if you retain document structure when you index your documents?
• How can use document structure to increase your precision and relevancy
M
D
Information Retrieval Textbook
Introduction to Information Retrieval
by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze
Cambridge University Press, 2008
107
http://nlp.stanford.edu/IR-book/information-retrieval-book.html
M
D
Table 10.1
RDB search unstructured retrieval
structured retrieval
objects records unstructured documents
trees with text at leaves
model relational model vector space & others ?
main data structure table inverted index ?
queries SQL free text queries ?
108
XML - Table 10.1 and structured information retrieval. SQLRDB (relational database) search, unstructured information retrieval
M
D
Table 10.1 - Revised
RDB search unstructured retrieval
structured retrieval
objects records unstructured documents
trees with text at leaves
model relational model vector space & others XML hierarchy
main data structure table inverted index
trees with node-ids for document ids
queries SQL free text queries XQuery fulltext
109
XML - Table 10.1 and structured information retrieval. SQLRDB (relational database) search, unstructured information retrieval
M
DKelly-McCreary & Associates, LLC
Two Models
"Bag of Words"
• All keywords in a single container• Only count frequencies are stored
with each word
"Retained Structure"
• Keywords associated with each sub-document component
110
'love'
'hate''new'
'fear'
keywords
keywords
keywords
keywords
keywords
keywords
doc-id
M
D
Keywords and Node IDs
• Keywords in the reverse index are now associated with the node-id in every document
111
Node-id
Node-id
Node-id
Node-id
Node-id
Node-id
keywords
keywords
keywords
keywords
keywords
keywords
document-id
M
DCopyright Kelly-McCreary & Associates, LLC
112
Hybrid architectures
• Most real world implementations use some combination of NoSQL solutions
• Example:– Use document stores for data– Use S3 for image/pdf/binary storage– Use Apache Lucene for document index stores– Use MapReduce for real-time index and
aggregate creation and maintenance– Use OLAP for reporting sums and totals
M
DCopyright Kelly-McCreary & Associates, LLC
113
Other Terms
• Hash (Merkle) Trees –where a hash function is used for each leaf and branches up to the root of a file system – fast ways to see if complex documents in tree structures have changed
• B-Trees, B+ Trees – binary trees that are used to create fast searches of large data sets
• Vector Clocks – an algorithm for detecting dependencies between remote computer systems – used to track what systems may have corrupted data
• Bloom Filters – hashing tricks to avoid unnecessary disk operations on large data sets
• Consistent Hashing – A way of converting strings of data into a unique key that can be used for fast comparisons
M
DCopyright Kelly-McCreary & Associates, LLC
114
Tools to Help You Select A System
• ATAM – Architecture Tradeoff Methodology
• CMU developed process to objectively select a system architecture based on business driven use-cases and quality metrics
M
DCopyright Kelly-McCreary & Associates, LLC
115
ATAM Process Flow
BusinessDrivers
QualityAttributes
UserStories
AnalysisArchitecture
PlanArchitecturalApproaches
ArchitecturalDecisions
Tradeoffs
SensitivityPoints
Non-Risks
RisksRisk ThemesDistilled info
Impacts
M
DCopyright Kelly-McCreary & Associates, LLC
116
Sample Use Cases/User Stories
• Effort to Load New Data• Effort to Query Data• Effort to Transform Data• Effort to Create RESTful Web Services
M
D 117
Insert/Select/Publish Comparison
Insert Query CreatePublishing
Web Service
SQL
WebDAV SQL XQuery
JavaTomcatAXISJDBC
Total Effort
XQuery
logicaldatamodeling
SQL
SQL
XQuery
Effort
M
DKelly-McCreary & Associates, LLC
118
Sample Quality Attribute Utility Tree
UtilityUtility
SearchabilitySearchability
XML ImportabilityXML Importability
TransformabilityTransformability
AffordabilityAffordability
SustainabilitySustainability
InteroperabilityInteroperability
Easy to add new XML data. (C, H)
Use OpenSource Software. (H, H)
Use long standing Standards. (VH, H)
Use W3C Standards. (VH, H)
Fulltext search on document data. (H, H)
Easy to transform XML data. (H, H)
StandardsStandards
Ease of ChangeEase of Change Use declarative languages. (VH, H)
SecuritySecurity Prevents unauthorized access. (H, H)
Fulltext SearchFulltext Search
XML SearchXML Search
Custom ScoringCustom Scoring
Drag-and-dropDrag-and-drop
Bulk ImportBulk Import
No License FeesNo License Fees
XQueryXQuery
XSLTXSLT
Web ServicesWeb Services
Fine Grain ControlFine Grain Control
Standards BasedStandards Based
No TranslationNo Translation
Key: (Importance, Score)Important to the Project C=Critical, VH=Very High, H=High, M=MediumArchitectural Score H=High, M=Medium, L=Low
Scoring via XQuery. (M, H)
Role-basedRole-based
Works with W3C Forms. (H, M)
Works with web standards. (VH, H)
No complex languages to learn. (H, H)
Fast searching. (H, H)
Staff training. (H, M)
Collection-based access. (H, M)Centralized security policy. (H, M)
Mashups wtih REST Interfaces. (VH, H)
Batch Import tools. (M, M)
Transform to HTML or PDF. (VH, H)
Customizable by non-programmers. (H, H)
M
DCopyright Kelly-McCreary & Associates, LLC
119
The Hard Part
• Assigning organizational "value" to– Ease of use– Standards– Vendor viability– Open source project viability– Portability
• Example:– WebDAV support
M
DCopyright Kelly-McCreary & Associates, LLC
120
Evolution of Ideas in OpenSource
• How quickly can new ideas be recombined into new database products?• OpenSource software has proved to be the most efficient way to quickly recombine
new ideas into new products
Product A
Product B
Product B
OpenSource
Proprietary SoftwareNew Database Ideas
Schema-free
MapReduceAuto-sharding
New Products
Cloud Computing
The Nature of Technology: What It Is and How It Evolves by W. Brian Arthur
M
DKelly-McCreary & Associates, LLC
121
Start Finish
Using the Wrong Architecture
Credit: Isaac Homelund – MN Office of the Revisor
M
DKelly-McCreary & Associates, LLC
122
Using the Right Architecture
Start Finish
Find ways to remove barriers to empoweringthe non programmers on your team.
M
D 123
If You Give a Kid a Hammer…
• People solve problems using familiar tools
• People develop specific Cognitive Styles* based on training and experience
• What are we teaching the next generation of developers?
* Source: Shoshana Zuboff: In the Age of the Smart Machine (1988)
…the whole world becomes a nail
M
DCopyright Kelly-McCreary & Associates, LLC
124
Cognitive Bias in NoSQL• Anchoring bias - the tendency to produce an estimate near a cue amount "Our managers were
expecting an RDBMS solution so that’s what we gave them"• Availability heuristic - the tendency to estimate that what is easily remembered is more likely
than that which is not. - "I hear that NoSQL does not support ACID"• Bandwagon effect - the tendency to do or believe what others do or believe - "Everyone else
here and in our local area uses RDBMS"• Confirmation bias - the tendency to seek out only that information that supports one's
preconceptions – "We only read posts from the Oracle groups"• Framing effect - the tendency to react to how information is framed, beyond its factual content
"We know of some NoSQL projects that failed"• Gambler's fallacy (aka sunk cost bias) the failure to reset one's expectations based on one's
current situation – "We already paid for our Oracle license…"• Hindsight bias, the tendency to assess one's previous decisions as more efficacious than they
were – "Our last five systems worked on RDBMS solutions".• Halo effect, the tendency to attribute unverified capabilities in a person based on an observed
capability. – "Oracle sells billions of dollars of licenses each year, how could so many people be wrong".
• Representativeness heuristic, the tendency to judge something as belonging to a class based on a few salient characteristics - "Our accounting systems work on RDBMS so why not our product search?"
http://en.wikipedia.org/wiki/Cognitive_bias_mitigation
M
DCopyright Kelly-McCreary & Associates, LLC
125
The Future of the NoSQL Movement
• Will data sets continue to grow at exponential rates?• Will new system options become more diverse?• Will new markets have different demands?• Will some ideas be "absorbed" into existing RDBMS vendors products?• Will the NoSQL community continue to be the place where new
database ideas and products are incubated?• Will the job of doing high-quality architectural tradeoffs analysis become
easier?
Growth Diversity
M
DCopyright Kelly-McCreary & Associates, LLC
126
Making Sense of NoSQL
http://manning.com/mccreary
M
DCopyright Kelly-McCreary & Associates, LLC
127
2013 NoSQL Now!
• Dataversity's NoSQL Conference• August 20-22• San Jose California
M
DCopyright Kelly-McCreary & Associates, LLC
128
Questions
Thank You!
Dan McCreary
President, Kelly-McCreary & Associates
twitter: dmccreary