data bases - introduction to data science

Introduction to Data Science

Frank Kienle High level introduction to Data Bases

Big Data Landscape

06.09.17 Frank Kienle p. 2

Overview of data sources •  http://www.knuggets.com/datasets/index.html Machine learning data •  UCI Machine Learning Repository: archive.ics.uci.edu Data Shop: the world’s largest repository of learning interaction data •  https://pslcdatashop.web.cmu.edu

Getting Data is not the problem - Very large flavor of Data Sources

06.09.17 Frank Kienle 3

•  Formally, a "database" refers to a set of related data and the way it is organized. •  A database manages data efficiently and allows users to perform multiple tasks

with ease. The efficient access to the data is usually provided by a "database management system" (DBMS)

•  A database management system stores, organizes and manages a large amount of information within a single software application.

•  Use of this system increases efficiency of business operations and reduces overall costs.

•  Different database systems exist which are designed with respect to: •  the data to be stored in the database •  the relationships between the different data elements. Dependencies within the data which can

be modeled by mathematical relations •  the logical structure upon the data on the basis of these relationships. The goal is to arrange

the data into a logical structure which can then be mapped into the storage objects

Database


Databases overview

06.09.17 Frank Kienle 5

Scale up: using more and more main memory Scale out: using more and more computers Definition (m complexity order): Scalability for N data items an algorithms scales with Nm.

E.g polynomial complexity Parallelize it (k nodes): The algorithm scales with Nm/k Goal find algorithms with complexity: N log(N) which relates e.g. with trees (one touch)

Scalability in big data

06.09.17 6 Frank Kienle

CAP theorem


C: consistency (do all applications see all the same data) Any data written to the database must be valid According to all defined rules

A: availability (can I interact with the system In the presence of failures)

P: partitioning If two sections of your system cannot talk to each Other, can they make forward progress on their own -  If not you sacrifice availability -  If so, you might have to sacrifice consistency

Dynamo Riak Voldemort Cassandra CouchDB

Bigtable Hbase Hypertable Megastore Spanner Accumulo

RDBMS

Relational Data Bases

Relational data bases key idea: §  storage and retrieval of large quantities of related data. §  When creating a database you should think about which tables needed and

what relationships exist between the data in your tables.

§  Relational algebra, §  Physical/logical data independence

Think about the design in advance

Relational Data Bases


A database is created for the storage and retrieval of data. we want to be able to INSERT data into the database and we want to be able to SELECT data from the database. A database query language was invented for these tasks called the Structured Query Language,

Structured query language (SQL)


When you can do JOIN’s its good for analytics When a data base does not provide joins the work is it is all up for the users (Leave the work on the client side)

Fundamental of data exploring (joins)


Outer Relational Join (on time stamp)


Timestamp[s] Valueroom[Wa2]

1 30

2 25

5 12

Timestamp[s] ValueHome[Wa2]

1 100

2 78

3 99

4 70

Timestamp[s] ValueRoom[Wa2|

ValueHome[Wa2]

1 30 100

2 25 78

3 NaN 99

4 NaN 70

5 12 NaN

Left Join (on time stamp)


Timestamp[s] Valueroom[Wa2]

1 30

2 25

5 12

Timestamp[s] ValueHome[Wa2]

1 100

2 78

3 99

4 70

Timestamp[s] ValueRoom[Wa2|

ValueHome[Wa2]

1 30 100

2 25 78

5 12 NaN

Storing data efficiently is all about the application

schema less vs. schema

writing centric vs. reading centric

transactional vs. analytics

batch vs. stream

Key-Value object •  A set of key-value pairs

Extensible record (XML or JSON) •  Families of attributes have a schema •  New attributes may be added

•  Many predictive analytics tasks will require a kind of record

•  Many REST APIs will deliver JSON, (YAML, XML) structures •  Example: tweeter feeds Key Value stores (Document store might be a subset) •  No schema, no exposed nesting •  often raw data (scalable to peta bytes) •  on top simple analytics tasks

Different data structure


45777

Ux_78

321-87

Frank Kienle, Germany

Please learn

Random data

key value

JSON Example


Example JSON Twitter feed


The ability to replicate and partition data over many serves •  Sharding: horizontal partitioning of the data set

No query language: a simple API defined Ability to scale operations over many serves

•  Throughput increase •  Due to missing (language) query layer each operation has to design towards the API

Operations have often restrictions to data locality New features can be added dynamically to data records (no fixed schema) Consistency model often weak (no modeling of transaction)

(typical) NoSQL data base features


In-memory database •  primarily relies on main memory for computer data storage •  main purpose is faster analytics on data •  relational or unstructured data structure

•  memory optimized data structures

Main memory database system (MMDB)


Advantage Column-oriented: •  Reading efficiency: more efficient when an aggregate needs to be computed over

many rows but only for a notably smaller subset of all columns of data select col_1,col_2 from table where col_2>5 and col_2<45;

•  Writing efficiency: more efficient when new values of a column are supplied for all rows at once

Advantage row-oriented: •  Reading efficiency: more efficient when many columns of a single row are

required at the same time, and when row-size is relatively small •  Writing efficiency: more efficient when writing a new row if all of the row data is

supplied at the same time, as the entire row can be written with a single disk seek.

Row vs. Column data stores


Processing types


OLTP: On-line Transaction Processing e.g. Business transactions (insert, update, delete)

OLAP: On-line Analytical Processing e.g. complex analytics (aggregating of historical data)

for data analytics a column oriented in-memory data base is a must have


Spanner Idea: Planet scale data base system ….we believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions …

Loose consistency for predictive analytics is horrible

Loose consistency is a no go for prescriptive analytics (dynamic pricing)

Systems should always be designed for usability

Many trends in data bases are going back to data consistency


data bases - introduction to data science

Data & Analytics