internet infrastructures for big data (verisign's distinguished speaker series)

22
Internet Infrastructures for Big Data Philippe Cudré-Mauroux eXascale Infolab , University of Fribourg Switzerland VeriSign EMEA June 26, 2014 1

Upload: exascale-infolab

Post on 10-May-2015

119 views

Category:

Data & Analytics


0 download

DESCRIPTION

Internet Infrastructures for Big Data Talk given at Verisign's Distinguished Speaker Series, 2014 Prof. Philippe Cudre-Mauroux eXascale Infolab http://exascale.info/

TRANSCRIPT

Page 1: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

Internet Infrastructures for Big Data

Philippe Cudré-Mauroux

eXascale Infolab, University of FribourgSwitzerland

VeriSign EMEAJune 26, 2014

1

Page 2: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

eXascale Infolab

• New lab @ U. of Fribourg, Switzerland• Financed by Swiss Federal State / companies / private

foundations • Big (non-relational) data management

(Volume, Velocity, Variety) (… mostly)

2

Page 3: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

On the Menu Today

• Big Data!– Big Data Buzz– 3 Big Data projects w/ XI & Verisign

3

Page 4: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

Exascale Data Deluge

• Science– Biology– Astronomy– Remote Sensing

• Web companies– Ebay– Yahoo

• Financial services,

retail companies

governments, etc.

© Wired 2009

➡ New data formats➡ New machines➡ Peta & exa-scale datasets➡ Obsolescence of

traditional information infrastructures

4

Page 5: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

Big Data “Central Theorem”

Data+Technology Actionable Insight $$

Reporting, Monitoring, Root Cause Analysis, (User) Modelization, Prediction

5

Page 6: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

6

Big Data Buzz

Between now and 2015, the firm expects big data to create some

4.4 million IT jobs globally; of those, 1.9 million will be in the

U.S. Applying an economic multiplier to that estimate, Gartner

expects each new big-data-related IT job to create work for three

more people outside the tech industry, for a total of almost 6

million more U.S. jobs.

Growth in the Asia Pacific Big Data market is

expected to accelerate rapidly in two to three years

time, from a mere US$258.5 million last year to in

excess of $1.76 billion in 2016, with highest

growth in the storage segment.

Page 7: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

7

Big Data Everywhere!

• The Age of Big Data (NYTimes Feb. 11, 2012)http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world.html

“Welcome to the Age of Big Data. The new megarich of Silicon Valley, first at Google and now Facebook, are masters at harnessing the data of the Web — online searches, posts and messages — with Internet advertising. At the World Economic Forum last month in Davos, Switzerland, Big Data was a marquee topic. A report by the forum, “Big Data, Big Impact,” declared data a new class of economic asset, like currency or gold.”

Page 9: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

9

Big Data Infrastructures

Page 10: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

The 3-Vs of Big Data

• Volume– amount of data

• Velocity– speed of data in and out

• Variety– range of data types and sources

• [Gartner 2012] "Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization"

Coming up: 3 examples from XI

10

Page 11: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

Volume: Fixing the Hadoop Distributed File System

• Hadoop (YARN): “cluster Operating System”• Often synonymous with Big Data• Used everywhere (… even in CH)

11

Page 12: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

HDFS Blocks Placement Strategy

Rack 1 Rack 2

● 1st replica on local node or random node

● 2nd replica on a different node in a different rack

● 3rd replica on a different node in same rack as 2nd replica

➡Not hardware-aware➡Block level rather than file

level

Page 13: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

Solution: Hadaps File Placement

• Assigns weights to DataNodes–I/O-bound jobs finish earlier on new media–CPU-bound jobs finish earlier on new CPUs

• Uses lower utilization servers first• Moves more blocks to newer generations• Operates on file level

Up to 300% performance improvement by activating all nodes

1

A

1

2

B

1

2

C

1

2

D

2

3

E

2

3

F

2

3

2

34

56

7

8

9

Blocks

Weight

123456

789

1 2

3

4

5

6

7 8

9

10

10

10

Page 14: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

Velocity: Real-Time Data Management

• Smart(er) Cities!

– Electricity provisioning– Water Networks

14

Page 15: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

Example: Scalable Anomaly Detection

• Detecting leaks / pipe bursts / contamination in real-time for water distribution networks

15

Page 16: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

Data at each Vertex!

• Spatial + temporal statistical processing (mini-Lisas)

• Stream processing (Storm) + Array processing (SciDB)

16

Page 17: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

Results(anomalies Detected)

17

Page 18: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

Variety: Sharing Data Locally & Globally

• 70+% of the world’s population has no or very limited access to the Web

[Ahmed Shams 2013]

18

Page 19: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

Our Solution: ERS, theEntity Registry System

• Three-tier solution to deploy data-powered apps– Flexible

• Seamlessly reconcile entities in local / ad-hoc / global modes

– Collaborative• Transactional consistency,

data versioning

– Scalable• Bridges, scale-out servers,

tunable consistency

– Open-source• https://github.com/ers-devs

19

Page 20: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

Ongoing Deployments

• Entity-powered apps for the Sugar Learning Platform

• Ambient Assisted Living of elderly persons in tropical environments

20

Page 21: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

Special Thanks to…

• Vincenzo Russo, Benoit Perroud, Matt Thomas, Romain Cholat and the whole Verisign Fribourg office

• Burt Kaliski and his team

• Allison Mankin, Scott Hollenbeck, Debra Anderson & the Internet Infrastructures Grant team

… for their continued support

Page 22: Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

http://exascale.info

Big thanks to the whole XI crew!

Questions?

VeriSign EMEAJune 26, 2014

22