c-mr: continuously executing mapreduce workflows on multi-core processors

C-MR: Continuously Executing MapReduce Workflows on Multi-

Core Processors

Speaker: LIN Qianhttp://www.comp.nus.edu.sg/~linqian

Problem

• Stream applications are often time-critical• Enabling stream support for MapReduce

jobs– Simple for the Map operations– Hard for the Reduce operations

• Continuously executing MapReduce workflows requires a great deal of coordination

C-MR Workflow

• Windows: temporal subdivisions of a stream described by– size (the amount of the stream spanning)– slide (the interval between windows)

C-MR Programming Interface

• Map/Reduce operations

C-MR Programming Interface (cont.1)

• Input/Output streams

C-MR Programming Interface (cont.2)

• Create workflows of continuous MapReduce jobs

C-MR vs. MapReduce

• MapReduce computing nodes receive a set of Map or Reduce tasks and each node must wait for all other nodes to complete their tasks before being allocated additional tasks.

• C-MR uses pull-based data acquisition allowing computing nodes to execute any Map or Reduce workload as they are able. Thus, straggling nodes will not hinder the progress of the other nodes if there is data available to process elsewhere in the workflow.

C-MR Architecture

Stream and Window Management

• The merged output streams are not guaranteed to retain their original orderings.

• Solution: Replicating window-bounding punctuations

Stream and Window Management (cont.1)

A node consumes the punctuation from the sorted input stream-buffer

Replicate that punctuation to the other nodes

After all replicas are received at the intermediate buffer, collect data whose timestamps fall into the applicable interval and materialize them as a window

Operator Scheduling

• Scheduling framework– Execute multiple policies simultaneously– Transition between policies based on

resource availability

• Scheduling policies

Incremental Computation

Output1 = d1 + d2 + d3 + ... + dn

Output2 = d2 + d3 + d4 + ... + dn+1

Output3 = d3 + d4 + d5 + ... + dn+2

Output4 = d4 + d5 + d6 + ... + dn+3

Share the common data subset of computation

Evaluation

• Continuously executing a MapReduce job– Compare with Phoenix++

Evaluation (cont.1)

• Operator scheduling– Oldest data first (ODF)– Best memory trade-off (MEM)– Hybrid utilization of both policies

Evaluation (cont.2)

• Workflow optimization

Evaluation (cont.3)

• Workflow optimization– Latency and throughput

Thank you

Two Properties of Streams

• Unbounded• Accessed sequentially

Hard to be handled using traditional DBMS

Query Operators

• Unbounded stateful operators– maintain state with no upper bound in size

run out of memory

• Blocking operators– read an entire input before emitting a

single output

might never produce a result

• Never use them, or• Use them under a refactoring

Punctuations

• Mark the end of substreams – allowing us to view an infinite stream as a

mixture of finite streams

c-mr: continuously executing mapreduce workflows on multi-core processors

cmr architecture7

cmr workflow windows

infinite stream

stream spanning

window management

problem stream applications

sorted input streambuffer

straggling nodes

Education

sharing, integrating and executing different workflows in...

getting more for less in optimized mapreduce workflows

processing with what is mapreduce? hadoop/mapreduce

rm world 2014: defining and executing process mining...

ramp: a system for capturing and tracing provenance in...

security orchestration, automation and response (soar)...

exploiting interactive workflows on the web · web browser...

mapreduce tuning

mapreduce a distribuovane´...

1. introduction to mapreduce -...

social-networking@edge€¦ · social-networking@edge is a...

the packing server for real-time scheduling of mapreduce...

an introduction to designing, executing and sharing...

mapreduce vs pig | mapreduce pig integration

introduction to mapreduce | mapreduce architecture |...

executing workflows · pdf file(htcondor) phase-space...

data management in large-scale distributed systems -...

mapreduce introduction

mapreduce & hadoop...

mapreduce-mpi library users...