tour of big data

31
Tour of Big Data Raymond Yu Socal Code Camp 2013

Upload: raymond-yu

Post on 08-Jul-2015

294 views

Category:

Technology


3 download

DESCRIPTION

Presentation at Southern California Code Camp July 2013 in San Diego. This talk presents you with basic concepts in world of big data and data science, with focus on relational databases, noSQL, MapReduce, machine learning, and data visualization, along with demos of MapReduce in action and Pig on Hadoop. The purpose of this presentation is to get you familiar with terminologies and concepts in data science, and whet  your appetite for further exploration into the world of big data. This presentation is adapted from an online course by Coursera with similar title and scope

TRANSCRIPT

Page 1: Tour of Big Data

Tour of Big DataRaymond Yu

Socal Code Camp 2013

Page 2: Tour of Big Data

About myself

• Sr. Database Architect @ BridgePoint

Education

• Blog www.yutechnet.com

• LinkedIn www.linkedin.com/in/raymondyu1

•@yutechnet

Page 3: Tour of Big Data

About this talk…

7/28/2013yutechnet.com

• Inspired by “Introduction to Data Science”

on Coursera (Bill Howe, UW)

•Guided tour of topics in data science

– MapReduce, Pig

– noSQL

– Machine Learning

– Information Visualization

•Goal

Page 4: Tour of Big Data

Big Data

•Volume

– Size of data

•Velocity

– The latency of data processing relative to the growing

demand of interactivity

•Variety

– The diversity of sources, formats, quality, and structures

Big Data is any data that is expensive to manage and hard to

extract value from. -Michael Franklin

Page 5: Tour of Big Data

Where does big data come from?

• “Data exhaust” from customers

•New censor technologies

• Individually contributed data in massive

scale

•Cheap to keep data

Page 6: Tour of Big Data

Data Science

•Data Preparation (at scale)

•Analytics

•Communication

The ability to take data, understand it, process it,

extract value from it, visualize it, and communicate it

- Hal Varian, Google's Chief Economist

Page 7: Tour of Big Data

Context…

src. Introduction to Data Science course

Page 8: Tour of Big Data

Relational Databases

• SQL as Declarative Language

• Indexes

– Extract small result from big dataset

– Built easily and automatically used when appropriate

•Data consistency

• “Old-style” scalability

Page 9: Tour of Big Data

MapReduce

•Google paper 2004

•Hadoop 2008

•High level programming model for large-

scale parallel data processing

•Divide-and-conquer

•Mapper + Reducer

Page 10: Tour of Big Data

“Hello World” of MapReduce

Count word frequency in millions of documents

Page 11: Tour of Big Data

MapReduce Programming Model

src. Course slide

Page 12: Tour of Big Data

Show me the MapReduce…

•www.jsmapreduce.com

Page 13: Tour of Big Data

MapReduce in Hadoop

Page 14: Tour of Big Data

Pig

• An engine to execute programs on top of Hadoop

• Language layer Pig Latin

• An Apache open source project (http://pig.apache.org)

• Yahoo! 2009

Page 15: Tour of Big Data

Why use Pig?

Page 16: Tour of Big Data

In MapReduce…

Page 17: Tour of Big Data

In Pig Latin

Page 18: Tour of Big Data

Pig System Overview

Page 19: Tour of Big Data

Context…

src. Introduction to Data Science course

Page 20: Tour of Big Data

noSQL definitions

•A term to designate databases which

differ from classic relational databases

– Transactional model

– Data model

•Not much to do with SQL

• “not only SQL”

Page 21: Tour of Big Data

Concepts

• CAP Theorem

– Consistency

– Availability

– Partition Tolerance

• Eventual consistency

Src: blog.beany.co.kr

Page 22: Tour of Big Data

noSQL One-page Overview

Page 23: Tour of Big Data

Let’s walk through a few

•Column definitions

•RDBMS

•Memcache

•Dynamo

•CouchDB

• BigTable (Hbase)

Page 24: Tour of Big Data

noSQL Common Features

• The ability to replicate and partition data over many servers (scale)

• Horizontally scale simple operation throughput over many servers

• A simple API - no query language (no SQL)

• Weaker concurrency model than ACID transactions (no transaction)

• The ability to dynamically add new attributes to data records (no schema)

Page 25: Tour of Big Data

Machine Learning

• Systems that automatically learn programs from data

• Prediction– Given examples of inputs and outputs

– Learn the relationship between them

– Apply the relationship to larger set

• Different from statistics model– Large data set over simple model trumpets small data set

over sophisticated model

Page 26: Tour of Big Data

Bertin’s Visual Attributes

Page 27: Tour of Big Data

Data Encoding Exercise

Page 28: Tour of Big Data
Page 29: Tour of Big Data

Information Visualization

src. http://www.tableausoftware.com/public

Page 30: Tour of Big Data

Closing example

Src. http://commons.wikimedia.org/wiki/File:ElectoralCollege2012.svg

Nate Silverfivethirtyeight.com

Obama’s Data-

Driven Campaign• Massive voter db

• Hadoop as ETL

• Vertica db for slice-

and-dice

Page 31: Tour of Big Data

Questions?