parallel data processingavid.cs.umass.edu › courses › 645 › s2020 › lectures ›...
TRANSCRIPT
Parallel Data Processing
Prof. Yanlei Diao
Slides Courtesy of R. Ramakrishnan and J. Gehrke
In50years,computerswillbeintelligent,withastoragecapacityofabout109.
--- AlanTuring,1950
Moore’s Law predicts the CPU and memory capacities to double about every 18 months
If exponential growth continues…
Limitations of Moore’s Law
• Concerns regarding if Moore’s law can continue…• Disk IO bandwidth did not increase much (7200 - 15000 rpm)
• In some domains data grows faster than Moore’s law• New data processing needs grow fast (“democratization of data”)
Synergistic Challenges in Data-Intensive Science and Exascale Computing
!"
#"
$"
%"
&"
'!"
'#"
'$"
'%"
'&"
#!'!" #!''" #!'#" #!'(" #!'$" #!')"
*+,+-,./"0+12+3-+/"4/.-+55./"6+7./8"
CAGR = 72%
CAGR = 60%
CAGR = 36%
CAGR = 20%
Figure 3.1: Projected growth rates for data acquisition and processor/memory capacities. (Figurecourtesy of Kathy Yelick.)
• Zero suppression, or more correctly “below threshold suppression”. Data from each collisionare produced by millions of sensitive devices. The majority of the devices record an energydeposit consistent with noise and inconsistent with the passage of a charged particle. Thesereadings are cut out at a very early stage. This is not a trivial choice. For example a freequark might deposit 11% of the energy expected for a normal charged particle, and wouldvery likely be rendered invisible by the zero suppression. However, discovery of a free quarkwould be a ground-breaking science contribution that could well be worthy of a Nobel prize.
• Selection of “interesting” collisions. This proceeds in a series of steps. The first step is todecide which collisions merit the digitization (and zero suppression) of the analog data in thedetection devices. This produces hundreds of gigabytes per second of digital data flow onwhich further increasingly compute-intensive pattern recognition and selection is performedby embarrassingly parallel “farms” of thousands of computers.
Typically, the group responsible for each major physics analysis topic is given a budget of howmany collisions per second it may collect and provides the corresponding selection algorithms.Physicists are well aware of the dangers of optimizing for the“expected” new physics and beingblind to the unexpected. For real time data reduction, all the computing is necessarily performedvery close to the source of data.
3.1.2 Distributed Data Processing and Analysis
Construction of each LHC detector necessarily required that equipment costing hundreds of millionsof dollars be funded and built by nations around the world and assembled in an underground pit atCERN. The motivation for this consolidation of experimental equipment is clear. However, thereis no corresponding motivation to consolidate computing and storage infrastructure in the samemanner. Instead, for both technical and political reasons, a distributed computing and storageinfrastructure has been the only feasible path to pursue despite some of the complexities involved.
The workflow for HEP scientists involved in an experiment like ATLAS or CMS typically consistsof the following tasks:
1. Reconstruction of likely particle trajectories and energy-shower deposits for each collisionfrom the individual measurements made by the millions of sensitive devices.
12
Another technology breakthrough is…
Parallel Data Processing
Data Analytics for A Social Network
Web Server
Web Server
Web Server
Data Processing Backend
Click Streams:many billion rows/daymany TB/day
Data Loading: High Volume + Transformation
Analysis Queries: Ad targeting, fraud detection, resource provisioning…
User profiles:Background, pics, postings, friends…
Quick lookups and updates: Update your own profile, read friends� profiles, write msgs,…
Some (Old) Numbers about Facebook
>250 billion photos; 350 million photos and 5.8 billion likes a day.
Initial software: PHP + MySQL cluster+ Memcached
One of the largest MySQL cluster (OLTP)• serve images/profiles
1.5 billion daily active users
22% Internet time of Americans
4 petabyte of new data each day
>30,000 servers~180,000 severs or more
Hive, FB’s data warehouse (OLAP), has 300 petabytes of data
Google AlphaGo
Three Forms of Parallelism
Model Parallelism
tasks
…
datadata
Data Parallelism Pipeline Parallelism
Topics
1. Parallel databases (80’s - 90’s)
2. MapReduce (2004 - present)
3. Relational processing on MapReduce
Parallel Databases 101
• Rise of parallel databases: late 80�s • Architecture: shared-nothing systems
– A number of nodes connected by fast Ethernet switches– Inter-machine messages are the only way for communication – But used special-purpose hardware (costly, slow to evolve)– Small scale (hence did not focus on fault tolerance)
• Typical systems– Gamma: U. of Wisconsin Madison– TeraData: Wal-Mart’s 7.5TB sales data in hundreds of machines– Tandem– IBM / DB2– Informix…
Some Parallel (||) Terminology
• Speed-Up– Holds the problem size, grows
the system– Reports serial time/||-time– Ideally, linear
• Scale-Up– Grows both the system and the
problem, reports running time– Ideally, constant
degree of ||-ism
Spee
d-up
Ideal
degree of ||-ism
Scal
e-up
Ideal
Some Parallel (||) Terminology
• Speed-Up– Holds the problem size, grows
the system– Reports serial time/||-time– Ideally, linear
• Utilization– Speed-up / degree of ||-ism– Ideally, constant
degree of ||-ism
Spee
d-up
Ideal
degree of ||-ism
Scal
e-up
Ideal
utiliz
atio
n
Data Partitioning SchemesPartitioning a table:
Range by R.a Hash by R.a Round Robin
A...E F...J K...NO...S T...Z A...E F...J K...NO...S T...Z A...E F...J K...N O...S T...Z
• sequential scan • equality search• equijoins if matching the
hash attributeo range search, o operations that do not
match the hash attr.o can have data skew
• sequential scan • associative search• sorting
o may have data skew
• sequential scan
o useless for other query operations
1) Parallel Scans & Index Lookups
• Scan in parallel, and merge.• Selection queries may not require all sites for range or
hash partitioning.– Want to restrict selection to a few nodes, or restrict “small”
queries to a few nodes.– Indexes can be built at each partition.– What happens during data inserts and lookups?
2) Parallel Sorting
• Sort R by R.b, where R is currently partitioned by R.a
• Idea: – Scan in parallel, and range-partition by R.b as you go.– As tuples come into each node, begin �local� sorting.– Resulting data is sorted, and range-partitioned.– Problem: skew!– Solution: �sample� the data at start to determine
partition boundaries.
Parallel Sort by R.bDenote a range partitioning function as F: A...E F...J K...N O...S T...Z
data partitioned by R.a
1. Run sampling to build an equi-depth histogram and set F for R.b
P P P P P
exchange statistics to build F
P P P P P2. Partition all data by F on R.b
P P P P P3. Local sort
3) Partitioned Join
• For equi-joins, partition the input relations by the join attribute on all nodes, and compute the join locally.
• Can use either range partitioning or hash partitioning, on the join attribute– R and S each are partitioned into n partitions, denoted as
R0, R1, ..., Rn-1 and S0, S1, ..., Sn-1.– Partitions Ri and Si are sent to node i.– Each node locally computes the join using any method.
Dataflow Network for Parallel Join
• Good use of split/merge makes it easier to build parallel versions of sequential join code.
4) Fragment-and-Replicate Join
• Partitioning not possible for non-equi join conditions – E.g., R.A > S.B.
• Use the fragment and replicate technique– 1) Special case: only one relation is partitioned
• R is partitioned; any partitioning technique can be used.• The other relation, S, is replicated across all the nodes. • Node i then locally computes the join of Ri with all of S using
any join technique.• Works well when S is small.
– 2) General case: both R and S are partitioned• Need to replicate all R partitions or all S partitions• Depicted on the next slide
Illustrating Fragment-and-Replicate Join
r0 P0,0
s0 s1 s2
ss3 sm–1
r1
r r2
r3
rn–1Pn–1,m–1
...
P0r0
P1r1r s
P2r2
P3r3
......
P1,0
P2,0
P0,1
P1,1
P2,1
P0,2
P1,2
P0,3
. . .
. . . . .
.
.
.
.
.
(a) Asymmetricfragment and replicate
(b) Fragment and replicate
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
5) Parallel Aggregates
• For each aggregate function, need a decomposition:– Distributive: count(S) = S count(s(i)), ditto for sum()– Algebraic: avg(S) = (S sum(s(i))) / S count(s(i))– Holistic: e.g., median, quantiles
• For group-by aggregation:– How would you implement parallel group by? See slides later..– How do you add aggregation to each group?