parallel data processingavid.cs.umass.edu › courses › 645 › s2020 › lectures ›...

Parallel Data Processing

Prof. Yanlei Diao

Slides Courtesy of R. Ramakrishnan and J. Gehrke

In50years,computerswillbeintelligent,withastoragecapacityofabout109.

--- AlanTuring,1950

Moore’s Law predicts the CPU and memory capacities to double about every 18 months

If exponential growth continues…

Limitations of Moore’s Law

• Concerns regarding if Moore’s law can continue…• Disk IO bandwidth did not increase much (7200 - 15000 rpm)

• In some domains data grows faster than Moore’s law• New data processing needs grow fast (“democratization of data”)

Synergistic Challenges in Data-Intensive Science and Exascale Computing

!"

#"

$"

%"

&"

'!"

'#"

'$"

'%"

'&"

#!'!" #!''" #!'#" #!'(" #!'$" #!')"

*+,+-,./"0+12+3-+/"4/.-+55./"6+7./8"

CAGR = 72%

CAGR = 60%

CAGR = 36%

CAGR = 20%

Figure 3.1: Projected growth rates for data acquisition and processor/memory capacities. (Figurecourtesy of Kathy Yelick.)

• Zero suppression, or more correctly “below threshold suppression”. Data from each collisionare produced by millions of sensitive devices. The majority of the devices record an energydeposit consistent with noise and inconsistent with the passage of a charged particle. Thesereadings are cut out at a very early stage. This is not a trivial choice. For example a freequark might deposit 11% of the energy expected for a normal charged particle, and wouldvery likely be rendered invisible by the zero suppression. However, discovery of a free quarkwould be a ground-breaking science contribution that could well be worthy of a Nobel prize.

• Selection of “interesting” collisions. This proceeds in a series of steps. The first step is todecide which collisions merit the digitization (and zero suppression) of the analog data in thedetection devices. This produces hundreds of gigabytes per second of digital data flow onwhich further increasingly compute-intensive pattern recognition and selection is performedby embarrassingly parallel “farms” of thousands of computers.

Typically, the group responsible for each major physics analysis topic is given a budget of howmany collisions per second it may collect and provides the corresponding selection algorithms.Physicists are well aware of the dangers of optimizing for the“expected” new physics and beingblind to the unexpected. For real time data reduction, all the computing is necessarily performedvery close to the source of data.

3.1.2 Distributed Data Processing and Analysis

Construction of each LHC detector necessarily required that equipment costing hundreds of millionsof dollars be funded and built by nations around the world and assembled in an underground pit atCERN. The motivation for this consolidation of experimental equipment is clear. However, thereis no corresponding motivation to consolidate computing and storage infrastructure in the samemanner. Instead, for both technical and political reasons, a distributed computing and storageinfrastructure has been the only feasible path to pursue despite some of the complexities involved.

The workflow for HEP scientists involved in an experiment like ATLAS or CMS typically consistsof the following tasks:

1. Reconstruction of likely particle trajectories and energy-shower deposits for each collisionfrom the individual measurements made by the millions of sensitive devices.

12

Another technology breakthrough is…

Parallel Data Processing

Data Analytics for A Social Network

Web Server

Web Server

Web Server

Data Processing Backend

Click Streams:many billion rows/daymany TB/day

Data Loading: High Volume + Transformation

Analysis Queries: Ad targeting, fraud detection, resource provisioning…

User profiles:Background, pics, postings, friends…

Quick lookups and updates: Update your own profile, read friends� profiles, write msgs,…

Some (Old) Numbers about Facebook

>250 billion photos; 350 million photos and 5.8 billion likes a day.

Initial software: PHP + MySQL cluster+ Memcached

One of the largest MySQL cluster (OLTP)• serve images/profiles

1.5 billion daily active users

22% Internet time of Americans

4 petabyte of new data each day

>30,000 servers~180,000 severs or more

Hive, FB’s data warehouse (OLAP), has 300 petabytes of data

Google AlphaGo

Three Forms of Parallelism

Model Parallelism

tasks

…

datadata

Data Parallelism Pipeline Parallelism

Topics

1. Parallel databases (80’s - 90’s)

2. MapReduce (2004 - present)

3. Relational processing on MapReduce

Parallel Databases 101

• Rise of parallel databases: late 80�s • Architecture: shared-nothing systems

– A number of nodes connected by fast Ethernet switches– Inter-machine messages are the only way for communication – But used special-purpose hardware (costly, slow to evolve)– Small scale (hence did not focus on fault tolerance)

• Typical systems– Gamma: U. of Wisconsin Madison– TeraData: Wal-Mart’s 7.5TB sales data in hundreds of machines– Tandem– IBM / DB2– Informix…

Some Parallel (||) Terminology

• Speed-Up– Holds the problem size, grows

the system– Reports serial time/||-time– Ideally, linear

• Scale-Up– Grows both the system and the

problem, reports running time– Ideally, constant

degree of ||-ism

Spee

d-up

Ideal

degree of ||-ism

Scal

e-up

Ideal

Some Parallel (||) Terminology

• Speed-Up– Holds the problem size, grows

the system– Reports serial time/||-time– Ideally, linear

• Utilization– Speed-up / degree of ||-ism– Ideally, constant

degree of ||-ism

Spee

d-up

Ideal

degree of ||-ism

Scal

e-up

Ideal

utiliz

atio

n

Data Partitioning SchemesPartitioning a table:

Range by R.a Hash by R.a Round Robin

A...E F...J K...NO...S T...Z A...E F...J K...NO...S T...Z A...E F...J K...N O...S T...Z

• sequential scan • equality search• equijoins if matching the

hash attributeo range search, o operations that do not

match the hash attr.o can have data skew

• sequential scan • associative search• sorting

o may have data skew

• sequential scan

o useless for other query operations

1) Parallel Scans & Index Lookups

• Scan in parallel, and merge.• Selection queries may not require all sites for range or

hash partitioning.– Want to restrict selection to a few nodes, or restrict “small”

queries to a few nodes.– Indexes can be built at each partition.– What happens during data inserts and lookups?

2) Parallel Sorting

• Sort R by R.b, where R is currently partitioned by R.a

• Idea: – Scan in parallel, and range-partition by R.b as you go.– As tuples come into each node, begin �local� sorting.– Resulting data is sorted, and range-partitioned.– Problem: skew!– Solution: �sample� the data at start to determine

partition boundaries.

Parallel Sort by R.bDenote a range partitioning function as F: A...E F...J K...N O...S T...Z

data partitioned by R.a

1. Run sampling to build an equi-depth histogram and set F for R.b

P P P P P

exchange statistics to build F

P P P P P2. Partition all data by F on R.b

P P P P P3. Local sort

3) Partitioned Join

• For equi-joins, partition the input relations by the join attribute on all nodes, and compute the join locally.

• Can use either range partitioning or hash partitioning, on the join attribute– R and S each are partitioned into n partitions, denoted as

R0, R1, ..., Rn-1 and S0, S1, ..., Sn-1.– Partitions Ri and Si are sent to node i.– Each node locally computes the join using any method.

Dataflow Network for Parallel Join

• Good use of split/merge makes it easier to build parallel versions of sequential join code.

4) Fragment-and-Replicate Join

• Partitioning not possible for non-equi join conditions – E.g., R.A > S.B.

• Use the fragment and replicate technique– 1) Special case: only one relation is partitioned

• R is partitioned; any partitioning technique can be used.• The other relation, S, is replicated across all the nodes. • Node i then locally computes the join of Ri with all of S using

any join technique.• Works well when S is small.

– 2) General case: both R and S are partitioned• Need to replicate all R partitions or all S partitions• Depicted on the next slide

Illustrating Fragment-and-Replicate Join

r0 P0,0

s0 s1 s2

ss3 sm–1

r1

r r2

r3

rn–1Pn–1,m–1

...

P0r0

P1r1r s

P2r2

P3r3

......

P1,0

P2,0

P0,1

P1,1

P2,1

P0,2

P1,2

P0,3

. . .

. . . . .

.

.

.

.

.

(a) Asymmetricfragment and replicate

(b) Fragment and replicate

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

5) Parallel Aggregates

• For each aggregate function, need a decomposition:– Distributive: count(S) = S count(s(i)), ditto for sum()– Algebraic: avg(S) = (S sum(s(i))) / S count(s(i))– Holistic: e.g., median, quantiles

• For group-by aggregation:– How would you implement parallel group by? See slides later..– How do you add aggregation to each group?

parallel data processingavid.cs.umass.edu › courses › 645 › s2020 › lectures ›...

Documents