the problem with data -- about data gravity in computing clouds

29
The Problem with Data --about Data gravity in computing clouds 2nd International workshop on Big Data, London, 9 December 2014 Coral Walker, Joerg Fritsch

Upload: joerg-fritsch

Post on 12-Jul-2015

320 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: The Problem with Data -- about data gravity in computing clouds

The Problem with Data

--about Data gravity in computing clouds2nd International workshop on Big Data, London, 9 December 2014

Coral Walker, Joerg Fritsch

Page 2: The Problem with Data -- about data gravity in computing clouds

Utility Computing requires levels of efficacy andabstraction that are not matched by modern computing clouds.

Page 3: The Problem with Data -- about data gravity in computing clouds

Agenda

1. Let’s have a look at Data!

2. Functional Programming Languages, Data Flow Languages or Stream processing to the rescue?

3. Fixing Data?

Page 4: The Problem with Data -- about data gravity in computing clouds

Let’s have a look at Data!

Page 5: The Problem with Data -- about data gravity in computing clouds

The three dimensions of Data• Variety

• Range of data types and sources• ALL data has structure, but structure may not be discovered yet at time of

ingestion

• Velocity• For example: social networking feeds, multi-media streams (audio, video)• 2013: Internet consisted of 640TB of data in motion per minute

• Volume• Big Data because of impressive volume• Map Reduce Framework parallelized analytics Hadoop• Distributed queries

Page 6: The Problem with Data -- about data gravity in computing clouds

The three dimensions of Data

Data can be challenging because of any of these dimensions ora combination of several dimensions.

Page 7: The Problem with Data -- about data gravity in computing clouds

Data Gravity

• Data has gravitational pull. It pulls computation to it. For example: Map Reduce on Hadoop.

• However, Computing Clouds centralize and rationalize computation!

• Focus on Computation goes “all the way down” to the CPU no data centric improvements in the past years (except AES-NI, maybe).

• Implications of Data on the silicon (aka: CPU, ASICS, …) are little investigated.

• But this holds us up!

Page 8: The Problem with Data -- about data gravity in computing clouds

Why is better integration with Data essential?

The lack of harmonization of data and computation is holding back computing clouds from evolving further into utility clouds.

Page 9: The Problem with Data -- about data gravity in computing clouds

Functional Programming Languages, Data Flow Languages or Stream processing to the rescue?

Page 10: The Problem with Data -- about data gravity in computing clouds

Stonebraker’s eight

Eight criteria to excel in processing data in motion

1. Keep data moving

2. SQL on streams

3. Handle Stream imperfections

4. Predictable outcome

5. High availability

6. Stored and Streamed data

7. Distribution and scalability

8. Instantaneous response

Page 11: The Problem with Data -- about data gravity in computing clouds

Stonebraker’s requirement FPL (for example: Haskell) Required add-ons (examples)

Keep data moving Messaging, in-memory computation

SQL on streams FPLs and SQL are declarative Parser, Tokenizer and Interpreter

Handle stream imperfections Currying potentially decouples and space and time

Decoupling across all layers, Tuple Space (?)

Predictable outcome Evaluation eventually ends

High availability Application Containers (?)

Stored and Streamed data Functional Reactive Programming, Map Reduce

Lambda Architecture (?)

Distribution & scalability For example: Currying, Code maintainability

Means of coordination, Tuple Space, LINDA

Instantaneous response

In depth assessment: Functional Programming

Page 12: The Problem with Data -- about data gravity in computing clouds

Dataflow Programming

• Started in the 1970s

• Academic research focuses on Dataflow Programming as abstraction to model parallel programs -- as Dataflow Graph (DFG).

• Data Flow Programming languages are very close to FPLs!

• Commercial research focusing on stream processing models.

Page 13: The Problem with Data -- about data gravity in computing clouds

Stream Processing

• Operate in real-time, for example online advertising, sensor data, multi media streams.

• Time complexity O (N log N).

• Reduces required hardware base and energy consumption computation happens in transit not where data is terminated.

• Supports recursion and machine learning (ML) Map Reduce needs some work around to support recursion and ML.

• Not invasive, no change in programming model Map Reduce required a change to batch mode.

Page 14: The Problem with Data -- about data gravity in computing clouds

•Observation: All programming languages and paradigms have missing pieces and cannot match Stonebraker’s eight.

•Assumption: The eight requirements should be matched by an architecture rather than by a programming language.

Page 15: The Problem with Data -- about data gravity in computing clouds

Fixing Data?

Page 16: The Problem with Data -- about data gravity in computing clouds

Fixing Data by eight Principles? (1)

• Associative lookup should be preferred.

• For example, associative lookup is used in:Data Flow Programming

Content Addressable Memory (CAM) of network switches and routers

Tuple Spaces

• Our architecture is based in a Tuple Space, thus we broaden the applicability of associative lookup.

Page 17: The Problem with Data -- about data gravity in computing clouds

Fixing Data by eight Principles? (2)

• The fabric must be a dynamic scalable distributed system.

• No need top explain :D

• Key requirements:Framework/architecture should be asynchronous (later we will use the UDP

protocol)

Shared nothing

Elastic

“Green”! (not addressed in our research)

Page 18: The Problem with Data -- about data gravity in computing clouds

Fixing Data by eight Principles? (3)

• Next-gen platforms must handle stored data and streamed data.

• Streamed data, for example events, that need to find an encapsulated app to get processed. In this case the encapsulated app, that is data as well, has the higher gravitational pull and attracts event data.

• Stored data, for example larger (file) objects, that have high gravitational pull and bring services temporarily close to them for the time needed to process them.

Page 19: The Problem with Data -- about data gravity in computing clouds

Fixing Data by eight Principles? (4)

• A global name space to virtualize data objects and apps must be provided.

• Dedicated name spaces isolate resources (for example: jailing apps, Linux containers, network name spaces).

• To many dedicated name spaces that may need links that are too expensive. For example: JSON/serialization, (transmission) protocols, etc. there may not even bee asynchronous interaction between name spaces!

Page 20: The Problem with Data -- about data gravity in computing clouds

Fixing Data by eight Principles? (5.1)

• Transmission protocols are the main cost center.

• Data transfer and message passing should be optimistic and based on the UDP protocol preserving the asynchronous character of all components and communications.

• UDP often disputed, --but look yourself (next slide!)

• Shared Nothing, Asynchronous, … remember what we said two slides ago?

Page 21: The Problem with Data -- about data gravity in computing clouds

Fixing Data by eight Principles? (5.2)

Page 22: The Problem with Data -- about data gravity in computing clouds

Fixing Data by eight Principles? (6)

• Declarative programming, such as in SQL and FPLs, is preferred.

• FPLs bring a lot to the table, too much to ignore (see previous section “In Depth assessment: Functional Programming”).

Page 23: The Problem with Data -- about data gravity in computing clouds

Fixing Data by eight Principles? (7.1)

• The emulation of traditional tier-based computing needs to be removed from computing clouds and replaced with a unified fabric.

• Modern computing clouds have no affinity to traditional tier-based computing, but they emulate it.

• Concept of tiers has been around since 1998• Costly serialization (of data) required at every system boundary

latency!• Often depicted w three simple tiers: web server, application server and

data(base)• Many more devices & protocols involved: redundant load balancers,

spanning tree, etc.

Page 24: The Problem with Data -- about data gravity in computing clouds

Fixing Data by eight Principles? (7.2)

• To date: not many alternatives

• Space based architectures• Gigaspaces

• Tibco activespace

• Notion of a one stop shop• Networks L2 Ethernet fabrics

• Networks Integrated packet processing

• Space based architectures and L2 Ethernet fabrics use Associative Lookup See Principle 1

Page 25: The Problem with Data -- about data gravity in computing clouds

Fixing Data by eight Principles? (8)

• Next-generation cloud computing platforms need to deliver abstract services, not limited to web services.

• Limitation to web services would equal that the future platform is Software as a Service (SaaS).

• Consumers need so much more!

Page 26: The Problem with Data -- about data gravity in computing clouds

Pulling at all together: architecture.

Page 27: The Problem with Data -- about data gravity in computing clouds

Thank You

Page 28: The Problem with Data -- about data gravity in computing clouds

Spare slides

Page 29: The Problem with Data -- about data gravity in computing clouds

Functional Programming

Aynschronousoperations

Parallel, multi- & many core support

Elasticity & large scale operations

Secure, multi tenancy,

confidentiality

Immutable Data. Shared nothing. Message passing (e.g. actors) available to re-synchronize processes STM better manageable than locks.

FPLs are inerently parallel. Functions, Closures, Currying Declarative Compiler has freedom to re-arrange “everything”

Elasticity is left to the developer or to the “app engine” Code easily testable & maintainable

No “Safe Haskell” may be a good start.

For example: Haskell