huhdoop?: uncertain data management on non-relational database systems

40
HUHDOOP? UNCERTAIN DATA MANAGEMENT ON NON-RELATIONAL DATABASE SYSTEMS with memes

Upload: jeff-smith

Post on 17-Dec-2014

907 views

Category:

Technology


1 download

DESCRIPTION

A year of research into uncertain data management, summed up in memes. My apologies in advance.

TRANSCRIPT

Page 1: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

H U H D O O P ?U N C E R TA I N D A TA M A N A G E M E N T O N N O N - R E L A T I O N A L D A TA B A S E S Y S T E M S

with memes

Page 2: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

U N C E R TA I N T Y

Y U NO KNOW CERTAIN DATA POINT?!

Uncertainty is inherent and prevalent

Page 3: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

U N C E R TA I N T Y I N D B M S E S

An active area of research

Largely not discussed, IRL

Mostly focused on the relational model

(and XML)

Page 4: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

H A D O O PB A T S H I T C R A Z Y, B U T I N A G O O D W A Y

Page 5: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

D ATA B A S E S O N H A D O O P

Still need fast random access

Don’t actually want to crunch files all the time

Page 6: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

H B A S E

Column-family database

Part of the stack

Dynamic-ish schemas

Page 7: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

M O D E L O F U N C E R TA I N T Y

Page 8: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

1 - D S E N S O R U N C E R TA I N T Y

U N C E R TA I N I N T E R VA L

Lower Bound Upper Bound

Probability Density Function

Page 9: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

S I M P L E S E N S O R U N C E R TA I N T Y M O D E L

S E N S O R S

F I X E D U N C E R TA I N

R O W K E Y L O W E R U P P E R P D F

Page 10: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

U N C E R TA I N Q U E R I E S

Page 11: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

D I M E N S I O N S

VA L U E - B A S E D E N T I T Y- B A S E D

I N D E P E N D E N TVA L U E S I N G L E

Q U E R YE N T I T Y R A N G E

Q U E R Y

D E P E N D E N TVA L U E S U M

Q U E R Y

E N T I T Y M I N I M U M

Q U E R Y

Page 12: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

VA L U E S I N G L E Q U E R YL I K E S P E L L I N G Y O U R N A M E R I G H T O N T H E S A T S

Page 13: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

VA L U E S I N G L E Q U E R Y

Just grab a single record

In HBase (shell): get 'Sensors', ‘1','Uncertain'

Or in HiveQL: SELECT Lower, Upperper, PDF FROM hive_sensors WHERE id=1;

Page 14: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

VA L U E S U M Q U E R YO N LY H A R D I F Y O U C A N ’ T A D D

Page 15: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

VA L U E S U M Q U E R Y

Simple

In HiveQL: SELECT SUM(Upperper), SUM(Lower) FROM hive_sensors;

Scalable!

But…

Page 16: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

V S U M Q P D F S

Single threaded Java app took 4 hours 23 minutes over only 1,000 records!

10,000 records proved impossible

Page 17: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

V S U M Q S T R AT E G I E S

Just calculate regularly

Cache it in Hive

Reduces latency from 1048 seconds to 8 seconds

Data staleness likely irrelevant for an aggregate of uncertain records

Page 18: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

E N T I T Y R A N G E Q U E R Y4 T I M E S T H E W O R K S A M E N U M B E R O F C R E D I T H O U R S

Page 19: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

E N T I T Y R A N G E Q U E R Y

C L A S S 1

C L A S S 2

C L A S S 3

C L A S S 4

Lower Bound Upper Bound

Page 20: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

Class 1 SELECT Sensor_id, (Upper-10)/(Upper-Lower) AS probability FROM hive_sensors WHERE Upper>=10 AND Upper<=20 and Lower <=10;

Class 2 SELECT Sensor_id, 1 AS probability FROM hive_sensors WHERE Upper<=20 and Lower >=10;

Class 3 SELECT Sensor_id, (20-Lower)/(Upper-Lower) AS probability FROM hive_sensors WHERE Lower>=10 AND Upper>=20 and Lower <=20;

Class 4 SELECT Sensor_id, (20-10)/(Upper-Lower) AS probability FROM hive_sensors WHERE Lower<=10 AND Upper>=20;

E R Q I N H I V E Q L

Page 21: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

SELECT Sensor_id, (Upper-10)/(Upper-Lower) AS probability FROM hive_sensors WHERE Upper>=10 AND Upper<=20 and Lower <=10 UNION ALL SELECT Sensor_id, 1 AS probability FROM hive_sensors WHERE Upper<=20 and Lower >=10 UNION ALL SELECT Sensor_id, (20-Lower)/(Upper-Lower) AS probability FROM hive_sensors WHERE Lower>=10 AND Upper>=20 and Lower <=20 UNION ALL SELECT Sensor_id, (20-10)/(Upper-Lower) AS probability FROM hive_sensors WHERE Lower<=10 AND Upper>=20;

A G G R E G AT E E R Q

Page 22: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

SELECT Sensor_id FROM hive_sensors WHERE (Upper>=10 AND Upper<=20 AND Lower<=10) OR (Upper<=20 and Lower>=10) OR (Lower>=10 AND Upper>=20 and Lower<=20) OR (Lower<=10 AND Upper>=20); !

Reduces to: SELECT * FROM hive_sensors WHERE Upper>=10 AND Lower<=20; * Just the intervals

S I M P L I F I E D E R Q

Page 23: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

E N T I T Y R A N G E Q U E R Y O P T I M I Z AT I O N S

Page 24: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

A R B I T R A R Y R O W K E Y S

S E N S O R S

F I X E D U N C E R TA I N

R O W K E Y L O W E R U P P E R P D F

1 2 3 4 4 2 6 3 U N I F O R M

Page 25: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

N O N - A R B I T R A R Y R O W K E Y S

S E N S O R S

F I X E D U N C E R TA I N

R O W K E Y L O W E R U P P E R P D F

4 2 6 3 1 2 3 4 4 2 6 3 U N I F O R M

Page 26: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

P E R F O R M A N C E

Page 27: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

D ATA I N C O L U M N FA M I L I E S

S E N S O R S

F I X E D U N C E R TA I N _ L O W E R U N C E R TA I N _ U P P E R U N C E R TA I N

R O W K E Y L O W E R L O W E R _ 4 0 U P P E R U P P E R _ 6 0 P D F

1 2 3 4 4 2 1 6 3 1 U N I F O R M

Page 28: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

D ATA I N C O L U M N FA M I L I E S

Have to use column-families, not just columns

Does handle 2-dimensional uncertainty

Bloom filters obviously help

Query syntax gets complicated

Page 29: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

E N T I T Y M I N I M U M Q U E R YL I K E T H E K I N G O F T H E D O W N V O T E D

Page 30: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

E M I N Q H I V E + J Y T H O N I M P L E M E N TAT I O N

... r1 = statement.executeQuery( "SELECT MIN(Upper) FROM hive_sensors;") result = statement.executeQuery( "SELECT * FROM hive_sensors WHERE Lower < {0};”.format(r1)) ...

Page 31: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

E M I N Q P I G I M P L E M E N TAT I O N

test_sensors = load 'hbase://u_1' using org.apache.pig.backend.hadoop.hbase.HBaseStorage(‘Fixed:Sensor_id, Uncertain:Upper, Uncertain:Lower, Uncertain:PDF', '-loadKey true') as (ID:bytearray, Sensor_id:int, up_val:float, down_val:float, pdf:chararray); !grouped = GROUP test_sensors ALL; minup = FOREACH grouped GENERATE MIN(test_sensors.up_val); !inrange = FILTER test_sensors BY (down_val < minup.$0); dump inrange;

Page 32: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

E M I N Q P E R F O R M A N C E

Page 33: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

C A S S A N D R AN O T H A D O O P J U S T U S E F U L

Page 34: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

S E C O N D A R Y I N D E X E S O N C A S S A N D R A

CREATE TABLE sensors ( Sensor_id int, Lower float, Upper float, PDF text, PRIMARY KEY (Sensor_id) ); !CREATE INDEX sensors_down ON sensors (Lower); !CREATE INDEX sensors_up ON sensors (Upper);

Page 35: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

O P E N Q U E S T I O N SI S P E N T A Y E A R O F M Y L I F E O N T H I S S T U F F : A M A !

Page 36: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

R E P O S T: E M I N Q P I G I M P L E M E N TAT I O N

test_sensors = load 'hbase://u_1' using org.apache.pig.backend.hadoop.hbase.HBaseStorage(‘Fixed:Sensor_id, Uncertain:Upper, Uncertain:Lower, Uncertain:PDF', '-loadKey true') as (ID:bytearray, Sensor_id:int, up_val:float, down_val:float, pdf:chararray); !grouped = GROUP test_sensors ALL; minup = FOREACH grouped GENERATE MIN(test_sensors.up_val); !inrange = FILTER test_sensors BY (down_val < minup.$0); dump inrange;

Page 37: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

E M I N Q F I L E - B A S E D R E W R I T E

Using HBase test_sensors = load 'hbase://u_1' using org.apache.pig.backend.hadoop.hbase.HBaseStorage(‘Fixed:Sensor_id, Uncertain:Upper, Uncertain:Lower, Uncertain:PDF', '-loadKey true') as (ID:bytearray, Sensor_id:int, up_val:float, down_val:float, pdf:chararray); Using Files test_sensors = load 'uncertain_data_file' as (ID:bytearray, Sensor_id:int, up_val:float, down_val:float, pdf:chararray);

Page 38: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

E M I N Q & A L L F U L L TA B L E Q U E R I E S

Page 39: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

C R E D I T S

University of Hong Kong Computer Science Department !Reynold Cheng-research supervision, academic instruction Ben Kao-research evaluation !Liu Lu-research, software implementation Wang Zuyao-research, software implementation

Page 40: Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

M E

@jeffksmithjr

toromon.com