timestamped binary association table - ieee big data congress 2015

18
Write Optimization using Asynchronous Update on Out-of-Core Column-Store Databases in Map-Reduce Feng Yu, Eric S. Jones Youngstown State University, Youngstown, OH [email protected] , [email protected] Wen-Chi Hou Southern Illinois University, Carbondale, IL [email protected] Youngstown State University

Upload: feng-george-yu

Post on 20-Mar-2017

80 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Timestamped Binary Association Table - IEEE Big Data Congress 2015

Youngstown State University

Write Optimization using Asynchronous Update on Out-of-Core Column-Store Databases in Map-Reduce

Feng Yu, Eric S. Jones

Youngstown State University, Youngstown, OH

[email protected], [email protected]

Wen-Chi Hou

Southern Illinois University, Carbondale, IL

[email protected]

Page 2: Timestamped Binary Association Table - IEEE Big Data Congress 2015

Youngstown State University

Column-Store Databases• The column-store database is also known as columnar

database or column-oriented database• The column-store database fits well into the write-once-and-

read-many environment.– Retrieve only the necessary attributes included in the

query prediction without the need to read the entire tuple.– Works especially well for OLAP and data mining queries– It can reach a higher compression rate and higher reading

speed than row-based databases.

Page 3: Timestamped Binary Association Table - IEEE Big Data Congress 2015

Youngstown State University

Challenge• Optimizing write operations in a column-store database has

always been a challenge.• Data is vertically decomposed into BATs (Binary Association Tables)

and randomly distributed over the storage.• The writing on a column-store database will be significantly delayed by

ad hoc access to large BATs across multiple pages.• Existing works majorly focus on write optimizations in a main-

memory column-store database.

Page 4: Timestamped Binary Association Table - IEEE Big Data Congress 2015

Youngstown State University

BAT Example

Fig. 1 customer Data in Row-Based and Column-Store (BAT) Format

A BUN consists of (oid, value)

Mapping Rules

Relational Data

Column-Store

Page 5: Timestamped Binary Association Table - IEEE Big Data Congress 2015

Youngstown State University

Update on BAT in Map-Reduce

• In a Map-Reduce environment, we assume the update list of OIDs are collected and submitted in a batch1. Map-Reduce Join

BAT LEFT OUTER JOIN UPDATE_LIST ON OID => (BAT combine UPDATE_LIST)• Map-side join: when UPDATE_LIST is small enough to fit into memory• Reduce-side join: when UPDATE_LIST is large enough

2. Projection (Map-Only)FOR each record in (BAT combine UPDATE_LIST)IF UPDATE_LIST attribute is not NULL: output updated valueELSE: output original value

Page 6: Timestamped Binary Association Table - IEEE Big Data Congress 2015

Youngstown State University

Motivation

• Focus: Write-optimization on column-store in Map-Reduce

• Principle: avoid seeking and writing on every change• Solution: Timestamp the newly updated data (TBAT)

– multi-version– no need of index

• Update: AMO (Asynchronous Map-Only) update– the newly updated data is appended to the end of a TBAT

slip in a map-only manner

Page 7: Timestamped Binary Association Table - IEEE Big Data Congress 2015

Youngstown State University

TBAT (Timestamped BAT)

• TBAT in HDFS:struct TBUN{ TIMESTAMP optime, ROWID oid, USER_DEFINED_TYPE attrv}struct TBAT_slip{ TBUN[max_size_per_HDFS_slip] tbuns}

– No need for any global pre-sorting or indexing– ‘attrv’ is can be any user defined type that flexibly

define arbitrary kinds of schema

Page 8: Timestamped Binary Association Table - IEEE Big Data Congress 2015

Youngstown State University

TBAT Example (logical view)

oid float

101 100.00

102 200.00

103 300.00

optime oid float

time1 101 100.00

time1 102 200.00

time1 103 300.00

customer_balance customer_balance

BAT TBAT

Suppose the existing records were inserted in one batch at time1.

Page 9: Timestamped Binary Association Table - IEEE Big Data Congress 2015

Youngstown State University

AMO Update (logical)

Example:Uupdate query on customer table:

update customer set balance=201.00 where id=2Current timestamp is time2 (>time1).

The newest TBUN for 201.00 is appended to the end of TBAT customer_balance

New Data

Old Data

Page 10: Timestamped Binary Association Table - IEEE Big Data Congress 2015

Youngstown State University

Selection after AMO Update

• The data consistency is intact in a TBAT after AMO update.

• Example:– Selection after AOC update:

SELECT balance FROM customer WHERE id=2– Two tuples will be retrieved:

t1=(time1, 102, 200.00)t2=(time2, 102, 201.00)

– Compare the timestamps, time2 > time1. Then 201.00 is returned which is consistent with the last update value.

Page 11: Timestamped Binary Association Table - IEEE Big Data Congress 2015

Youngstown State University

Preliminary Experiment

• Performed on a Cloudera Distributed Hadoop (CDH) version 5.3 cluster – 1 master and 3 slaves– Total HDFS capacity= 310GB (block size = 64MB) – Interconnection is Gigabit Ethernet

• Data sets: 1GB and 10GB random synthetic data in BAT and TBAT.

• Update queries: from 10% to 30% of the original data.

Page 12: Timestamped Binary Association Table - IEEE Big Data Congress 2015

Youngstown State University

Preliminary Experiment Results (cont.)

1GB Update Running Time

Page 13: Timestamped Binary Association Table - IEEE Big Data Congress 2015

Youngstown State University

Preliminary Experiment Results (cont.)

10GB Update Running Time

Page 14: Timestamped Binary Association Table - IEEE Big Data Congress 2015

Youngstown State University

Preliminary Experiment Results (cont.)

Overhead Changing over Data Sets

Page 15: Timestamped Binary Association Table - IEEE Big Data Congress 2015

Youngstown State University

Resource Usage

Page 16: Timestamped Binary Association Table - IEEE Big Data Congress 2015

Youngstown State University

Conclusion

• We introduce a new method called AMO update for write optimization on OOC column-store databases in map-reduce.

• AMO update employs TBAT to improve the update performance with data atomicity guaranteed.

• Significant improvement in running speed of AOC update has been shown in preliminary experiment results.

Page 17: Timestamped Binary Association Table - IEEE Big Data Congress 2015

Youngstown State University

Future Works

• The performance variation of the Map-Reduce selection algorithm on TBAT after different percentages of the file is updated.

• Introduce a distributed local indexing on each TBAT slip in HDFS to improve the global data retrieval performance.

Page 18: Timestamped Binary Association Table - IEEE Big Data Congress 2015

THANK YOU! Feng “George” YuComputer Science and Information Systems

Youngstown State University, Youngstown, [email protected]

Youngstown State University