history of nosql and azure documentdb feature set

NOSQL & DOCUMENTEDBSONER ALTIN

@kahve• Soner ALTIN

• BizDev @T2

• soner.in

• Organise hackathons (t2hackathon.com)

• Strong interest in Led Zeppelin

• [email protected] / [email protected]

http://t2hackathon.com

mailto:[email protected]


HISTORY OF DBMS AND RDBMSDatabase management systems first appeared on the scene in 1960 as computers began to grow in power and speed. In the middle of 1960, there were several commercial applications in the market that were capable of producing “navigational” databases. These navigational databases maintained records that could only be processed sequentially, which required a lot of computer resources and time.

Relational database management systems were first suggested by Edgar Codd in the 1970s. Because navigational databases could not be “searched”, Edgar Codd suggested another model that could be followed to construct a database. This was the relational model that allowed users to “search” it for data. It included the integration of the navigational model, along with a tabular and hierarchical model.

60’s 70’s 80’s 90’s 00’s

A relational database is a digital database whose organization is based on the relational model of data

RDMBS 40 YEARS!

1. A simple way of representing data/ business models

2. An easy-to-use language to retrieve and query that data (SQL)

3. Bulletproof data integrity and security built right into the database without having to rely on application rules and logic.

ACCESS AND STORAGE

▸ It is generally easier to access data that is stored in a relational database. This is because the data in a relational database follows a mathematical model for categorization. Also, once we open a relational database, each and every element of that database becomes accessible, which is not always the case with a normal database (the data elements may need to be accessed individually).

▸ Relational databases are harder to construct, but they are better structured and more secure. They follow the ACID (atomicity, consistency, isolation and durability) model when storing data. The relational database system will also impose certain regulations and conditions that may not allow you to manipulate data in a way that destabilizes the integrity of the system.

PERSISTENCE

REPORTINGTRANSACTIONS SQL

INTEGRATION

3V - VOLUME VARIETY VELOCITY

▸ Five years ago, Amazon found that every 100ms of latency cost them 1% of sales. Google discovered that a half-second increase in search latency dropped traffic by 20%.

▸ The volume of required data handling today is skyrocketing. Facebook houses 1.5 PB (Peta Bytes) of uploaded photos. Google processes 20PB of data each day. Every 60 seconds over 204 million emails are exchanged, 3,600 photos are shared on Instagram and 2 million search queries are processed by Google. RDBMSs struggle in the face of such huge data volumes and RDBMS solutions capable of handling such volumes are extremely expensive.

▸ Big Data also demands collection of an extremely wide variety of data types, but RDBMSs have inflexible schemas. The problem is that Big Data primarily comprises semi-structured data, such as social media sentiment analysis and text mining data, while RDBMSs are more suitable for structured data, such as weblog, sensor and financial data.

▸ In addition, Big Data is accumulated at a very high velocity. Since RDBMSs are designed for steady data retention, rather than for rapid growth, using RDBMSs for Big Data is prohibitively expensive.

60’s 70’s 80’s 90’s 00’s 10’s

TODAY

▸ Developers are working with applications that create massive volumes of new, rapidly changing data types — structured, semi-structured, unstructured and polymorphic data.

▸ Long gone is the twelve-to-eighteen month waterfall development cycle. Now small teams work in agile sprints, iterating quickly and pushing code every week or two, some even multiple times every day.

▸ Applications that once served a finite audience are now delivered as services that must be always-on, accessible from many different devices and scaled globally to millions of users.

▸ Organizations are now turning to scale-out architectures using open source software, commodity servers and cloud computing instead of large monolithic servers and storage infrastructure.

Structured Unstructured Semi-structured

Pre-defined God knows Pre-defined

Relational Non-relational So so

Constant Flexible Easy to change

RDBMS HDFS *

CRM, Travel, Phone numbers Web, Video, Music, Photo Tagging, Comments

%5 %15 %80

No need to scale horizontally Fully scalable Fully scalable

/* * Copyright 2007 Yusuke Yamamoto */ /** * A data interface representing one single status of a user. * * @author Yusuke Yamamoto - yusuke at mac.com */

public interface Status extends Comparable<Status>, TwitterResponse, EntitySupport, java.io.Serializable {

Date getCreatedAt(); long getId(); String getText(); String getSource(); boolean isTruncated(); long getInReplyToStatusId(); long getInReplyToUserId(); String getInReplyToScreenName(); GeoLocation getGeoLocation(); Place getPlace(); boolean isFavorited(); boolean isRetweeted(); int getFavoriteCount(); User getUser(); boolean isRetweet(); Status getRetweetedStatus(); long[] getContributors(); int getRetweetCount(); boolean isRetweetedByMe(); long getCurrentUserRetweetId(); boolean isPossiblySensitive(); String getLang(); Scopes getScopes(); String[] getWithheldInCountries(); long getQuotedStatusId(); Status getQuotedStatus(); }

/* * Copyright 2007 Yusuke Yamamoto */ /** * A data interface representing Basic user information element * * @author Yusuke Yamamoto - yusuke at mac.com */ public interface User extends Comparable<User>, TwitterResponse, java.io.Serializable { long getId(); String getName(); String getScreenName(); String getLocation(); String getDescription(); boolean isContributorsEnabled(); String getProfileImageURL(); String getBiggerProfileImageURL(); String getMiniProfileImageURL(); String getOriginalProfileImageURL(); String getProfileImageURLHttps(); String getBiggerProfileImageURLHttps(); String getMiniProfileImageURLHttps(); String getOriginalProfileImageURLHttps(); boolean isDefaultProfileImage(); String getURL(); boolean isProtected(); int getFollowersCount(); Status getStatus(); String getProfileBackgroundColor(); String getProfileTextColor(); String getProfileLinkColor(); String getProfileSidebarFillColor(); String getProfileSidebarBorderColor(); boolean isProfileUseBackgroundImage(); boolean isDefaultProfile(); boolean isShowAllInlineMedia(); int getFriendsCount(); Date getCreatedAt(); int getFavouritesCount(); int getUtcOffset(); String getTimeZone(); String getProfileBackgroundImageURL(); String getProfileBackgroundImageUrlHttps(); String getProfileBannerURL(); String getProfileBannerRetinaURL(); String getProfileBannerIPadURL(); String getProfileBannerIPadRetinaURL(); String getProfileBannerMobileURL(); String getProfileBannerMobileRetinaURL(); boolean isProfileBackgroundTiled(); String getLang(); int getStatusesCount(); boolean isGeoEnabled(); boolean isVerified(); boolean isTranslator(); int getListedCount(); boolean isFollowRequestSent(); URLEntity[] getDescriptionURLEntities(); URLEntity getURLEntity(); String[] getWithheldInCountries(); }}

/* * Copyright 2007 Yusuke Yamamoto */

/** * A data interface representing one single URL entity. * @author Mocel - mocel at guma.jp */ public interface URLEntity extends TweetEntity, java.io.Serializable {

String getText();

String getURL();

String getExpandedURL();

String getDisplayURL();

int getStart();

int getEnd(); }

/** * @author Yusuke Yamamoto - yusuke at mac.com */ public interface Place extends TwitterResponse, Comparable<Place>, java.io.Serializable { String getName();

String getStreetAddress();

String getCountryCode();

String getId();

String getCountry();

String getPlaceType();

String getURL();

String getFullName();

String getBoundingBoxType();

GeoLocation[][] getBoundingBoxCoordinates();

String getGeometryType();

GeoLocation[][] getGeometryCoordinates();

Place[] getContainedWithIn(); }

https://dev.twitter.com/rest/reference/get/statuses/retweets_of_me

https://dev.twitter.com/rest/reference/get/statuses/retweets_of_me

SCALABILITY

NON RELATIONAL

Provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases

REQUIREMENTS

▸ over 425 million unique users

▸ store 20 TB of JSON document data

▸ available globally to serve all markets

▸ store for 40+ apps / device combinations

▸ under 15 ms writes and single digits ms reads

CONTROL OVER AVAILABILITY

HORIZONTAL SCALABILITY

SIMPLICITY OF DESIGN

BIG DATA

REAL TIME APPLICATIONS

EASIER DEVELOPMENT

SCALABILITY VS FUNCTIONALITYsc

alab

ility

& p

erfo

rman

ce

depth of functionality

rmdbs

nosql

memcachedkey/value store

ECONOMICS

The goal of a business, of course, is to make money, and that’s accomplished by providing more for less. NoSQL databases drastically reduce the need for insanely big machines. Typically, they use clusters of cheap commodity servers to manage exploding data and transaction volumes. The cost-per-gigabyte or transaction/second for NoSQL can be considerably lower than the cost for RDBMSs, thereby dramatically reducing the cost of data processing and storage. Another area of key savings is in manpower. By lowering administrative costs one can free up developers to code new features that will generate more revenue.

bit.ly/fowler-schemaless

http://bit.ly/fowler-schemaless

SCHEMALESS - DATA UPDATE

The documents stored in the database can have varying sets of fields, with different types for each field. One could have the following objects in a single collection:

{ name : “Joe”, x : 3.3, y : [1,2,3] }

{ name : “Kate”, x : “abc” }

{ q : 456 }

Of course, when using the database for real problems, the data does have a fairly consistent structure. Something like the following would be more common:

{ name : “Joe”, age : 30, interests : ‘football’ }

{ name : “Kate”, age : 25 }

One of the great benefits of dynamic objects is that schema migrations become very easy. With a traditional RDBMS, releases of code might contain data migration scripts. Further, each release should have a reverse migration script in case a rollback is necessary. ALTER TABLE operations can be very slow and result in scheduled downtime.

With a schemaless database, 90% of the time adjustments to the database become transparent and automatic. For example, if we wish to add GPA to the student objects, we add the attribute, resave, and all is well – if we look up an existing student and reference GPA, we just get back null. Further, if we roll back our code, the new GPA fields in the existing objects are unlikely to cause problems if our code was well written.

NOSQL

data model performance scalability flexibility complexity

column high high moderate low

document high variable high low

key-value high high high none

graph variable variable high high

NOSQL TYPES

data model examples

column Cassandra, HBase

document DocumentDB, MongoDB, ElasticSearch

key-value Redis, MemcacheDB

graph Neo4J, OrientDB

fully featured RDBMS

transactional processing

rich query

managed as a service

elastic scale

internet accessible http/rest

schema-free data model

arbitrary data formats

schema free query

Relational and hierarchical query of application defined JSON data. Support for SQL queries with transforms, projections and inline evaluation of user defined JavaScript functions (UDFs). Automatic and consistent indexing of all properties.

JavaScript as a modern T-SQL

Transactional execution of application defined stored procedures and triggers directly against database collections. Native JavaScript support eliminating the impedance mismatch between application and database schema.

tunable consistency

Well defined consistency levels to achieve optimal tradeoff between consistency and performance. Four distinct consistency levels for queries and read – Strong, Bounded-Staleness, Session and Eventual. Granular control over consistency, availability and latency

fully managed

Simple to provision and access databases without managing VM or cluster infrastructure. Operated with 99.95% availability and automatically backed up to prevent against regional failures

{ }

PRICING

DocumentDB collections are available in the Standard service tier. Collections are billable entities, each billed hourly, based on the performance level assigned to the collection. Collections are set to one of three performance levels – S1, S2 or S3. You can also dynamically change the performance level of a collection – for example, create an S1 collection, scale up to S3 then back to S2.

TUNABLE CONSISTENCY

type latency performance

strong high low

bounded staleness moderate moderate

session low for session fast for session

eventual low fast

RAPID DEVELOPMENT

No setup cost

Auto scale

High available

No configuration management cost

Integration with all Azure services

SDK support for JavaScript, Java, Node.js, Python, and .NET.

PREPARATION

CONFIGURATION

QUERIES

{ "id": "AndersenFamily", "lastName": "Andersen", "parents": [ { "firstName": "Thomas" }, { "firstName": "Mary Kay"} ], "children": [ { "firstName": "Henriette Thaulow", "gender": "female", "grade": 5, "pets": [{ "givenName": "Fluffy" }] } ], "address": { "state": "WA", "county": "King", "city": "seattle" }, "creationDate": 1431620472, "isRegistered": true }

{ "id": "WakefieldFamily", "parents": [ { "familyName": "Wakefield", "givenName": "Robin" }, { "familyName": "Miller", "givenName": "Ben" } ], "children": [ { "familyName": "Merriam", "givenName": "Jesse", "gender": "female", "grade": 1, "pets": [ { "givenName": "Goofy" }, { "givenName": "Shadow" } ] }, { "familyName": "Miller", "givenName": "Lisa", "gender": "female", "grade": 8 } ], "address": { "state": "NY", "county": "Manhattan", "city": "NY" }, "creationDate": 1431620462, "isRegistered": false }

QUERIES{ "id": "AndersenFamily", "lastName": "Andersen", "parents": [ { "firstName": "Thomas" }, { "firstName": "Mary Kay"} ], "children": [ { "firstName": "Henriette Thaulow", "gender": "female", "grade": 5, "pets": [{ "givenName": "Fluffy" }] } ], "address": { "state": "WA", "county": "King", "city": "seattle" }, "creationDate": 1431620472, "isRegistered": true }

* Operator

SELECT * FROM Families f WHERE f.id = "AndersenFamily"

[{ "Family": { "Name": "WakefieldFamily", "City": "NY" } }]

Where

SELECT {"Name":f.id, "City":f.address.city} AS Family FROM Families f WHERE f.address.city = f.address.state

[ { "givenName": "Jesse" }, { "givenName": "Lisa"} ]

Join SELECT c.givenName FROM Families f JOIN c IN f.children WHERE f.id = 'WakefieldFamily' ORDER BY f.address.city ASC

QUERIES[{ "$1": { "state": "WA", "city": "seattle" }, "$2": { "name": "AndersenFamily" } }]

Nested properties

SELECT { "state": f.address.state, "city": f.address.city }, { "name": f.id } FROM Families f WHERE f.id = "AndersenFamily"

[ { "AreFromSameCityState": false }, { "AreFromSameCityState": true } ]

Scalar expression

SELECT f.address.city = f.address.state AS AreFromSameCityState FROM Families f

ORDER BY

SELECT f.id, f.address.city FROM Families f ORDER BY f.address.city

[ { "id": "WakefieldFamily", "city": "NY" }, { "id": "AndersenFamily", "city": "Seattle" } ]

QUERIES

{ "Type": "Stratovolcano", "Status": "Tephrochronology", "Location": { "type": "Point", "coordinates": [ -121.49, 46.206 ] } }

Geospatial WITH_IN

SELECT v.Type, v.Status, v.Location FROM volcanoes v WHERE ST_WITHIN(v.Location, { "type": "Polygon", "coordinates": [[ [-124.63, 48.36], [-123.87, 46.14], [-122.23, 45.54], [-119.17, 45.95], [-116.92, 45.96], [-116.99, 49.00], [-123.05, 49.02], [-123.15, 48.31], [-124.63, 48.36]]]} )

Geospatial ST_DISTANCE

SELECT v.Elevation, v.Type, v.Region, v["Volcano Name"] FROM volcanoes v WHERE ST_DISTANCE(v.Location, { "type": "Point", "coordinates": [-122.19, 47.36] }) < 100 * 1000 AND v.Type = "Stratovolcano" AND v["Last Known Eruption"] = "Last known eruption from 1800-1899, inclusive"

{ "Elevation": 4392, "Type": "Stratovolcano", "Region": "US-Washington", "Volcano Name": "Rainier" }

LET’S TRY SOME QUERIES

JAVA SPRING APP

TWITTER STREAMING APP

MICROSERVICE

TWITTER STREAMING APP

<DEPENDENCY> <GROUPID>COM.MICROSOFT.AZURE</GROUPID>

<ARTIFACTID>AZURE-DOCUMENTDB</ARTIFACTID> <VERSION>1.5.1</VERSION>

</DEPENDENCY>

BIT.LY/DEVNOT-CODE

http://bit.ly/devnot-code

WE’RE HIRING C# & JAVA DEVELOPERS

[email protected]


history of nosql and azure documentdb feature set

Engineering