H U H D O O P ?U N C E R TA I N D A TA M A N A G E M E N T O N N O N - R E L A T I O N A L D A TA B A S E S Y S T E M S
with memes
U N C E R TA I N T Y
Y U NO KNOW CERTAIN DATA POINT?!
Uncertainty is inherent and prevalent
U N C E R TA I N T Y I N D B M S E S
An active area of research
Largely not discussed, IRL
Mostly focused on the relational model
(and XML)
H A D O O PB A T S H I T C R A Z Y, B U T I N A G O O D W A Y
D ATA B A S E S O N H A D O O P
Still need fast random access
Don’t actually want to crunch files all the time
H B A S E
Column-family database
Part of the stack
Dynamic-ish schemas
M O D E L O F U N C E R TA I N T Y
1 - D S E N S O R U N C E R TA I N T Y
U N C E R TA I N I N T E R VA L
Lower Bound Upper Bound
Probability Density Function
S I M P L E S E N S O R U N C E R TA I N T Y M O D E L
S E N S O R S
F I X E D U N C E R TA I N
R O W K E Y L O W E R U P P E R P D F
U N C E R TA I N Q U E R I E S
D I M E N S I O N S
VA L U E - B A S E D E N T I T Y- B A S E D
I N D E P E N D E N TVA L U E S I N G L E
Q U E R YE N T I T Y R A N G E
Q U E R Y
D E P E N D E N TVA L U E S U M
Q U E R Y
E N T I T Y M I N I M U M
Q U E R Y
VA L U E S I N G L E Q U E R YL I K E S P E L L I N G Y O U R N A M E R I G H T O N T H E S A T S
VA L U E S I N G L E Q U E R Y
Just grab a single record
In HBase (shell): get 'Sensors', ‘1','Uncertain'
Or in HiveQL: SELECT Lower, Upperper, PDF FROM hive_sensors WHERE id=1;
VA L U E S U M Q U E R YO N LY H A R D I F Y O U C A N ’ T A D D
VA L U E S U M Q U E R Y
Simple
In HiveQL: SELECT SUM(Upperper), SUM(Lower) FROM hive_sensors;
Scalable!
But…
V S U M Q P D F S
Single threaded Java app took 4 hours 23 minutes over only 1,000 records!
10,000 records proved impossible
V S U M Q S T R AT E G I E S
Just calculate regularly
Cache it in Hive
Reduces latency from 1048 seconds to 8 seconds
Data staleness likely irrelevant for an aggregate of uncertain records
E N T I T Y R A N G E Q U E R Y4 T I M E S T H E W O R K S A M E N U M B E R O F C R E D I T H O U R S
E N T I T Y R A N G E Q U E R Y
C L A S S 1
C L A S S 2
C L A S S 3
C L A S S 4
Lower Bound Upper Bound
Class 1 SELECT Sensor_id, (Upper-10)/(Upper-Lower) AS probability FROM hive_sensors WHERE Upper>=10 AND Upper<=20 and Lower <=10;
Class 2 SELECT Sensor_id, 1 AS probability FROM hive_sensors WHERE Upper<=20 and Lower >=10;
Class 3 SELECT Sensor_id, (20-Lower)/(Upper-Lower) AS probability FROM hive_sensors WHERE Lower>=10 AND Upper>=20 and Lower <=20;
Class 4 SELECT Sensor_id, (20-10)/(Upper-Lower) AS probability FROM hive_sensors WHERE Lower<=10 AND Upper>=20;
E R Q I N H I V E Q L
SELECT Sensor_id, (Upper-10)/(Upper-Lower) AS probability FROM hive_sensors WHERE Upper>=10 AND Upper<=20 and Lower <=10 UNION ALL SELECT Sensor_id, 1 AS probability FROM hive_sensors WHERE Upper<=20 and Lower >=10 UNION ALL SELECT Sensor_id, (20-Lower)/(Upper-Lower) AS probability FROM hive_sensors WHERE Lower>=10 AND Upper>=20 and Lower <=20 UNION ALL SELECT Sensor_id, (20-10)/(Upper-Lower) AS probability FROM hive_sensors WHERE Lower<=10 AND Upper>=20;
A G G R E G AT E E R Q
SELECT Sensor_id FROM hive_sensors WHERE (Upper>=10 AND Upper<=20 AND Lower<=10) OR (Upper<=20 and Lower>=10) OR (Lower>=10 AND Upper>=20 and Lower<=20) OR (Lower<=10 AND Upper>=20); !
Reduces to: SELECT * FROM hive_sensors WHERE Upper>=10 AND Lower<=20; * Just the intervals
S I M P L I F I E D E R Q
E N T I T Y R A N G E Q U E R Y O P T I M I Z AT I O N S
A R B I T R A R Y R O W K E Y S
S E N S O R S
F I X E D U N C E R TA I N
R O W K E Y L O W E R U P P E R P D F
1 2 3 4 4 2 6 3 U N I F O R M
N O N - A R B I T R A R Y R O W K E Y S
S E N S O R S
F I X E D U N C E R TA I N
R O W K E Y L O W E R U P P E R P D F
4 2 6 3 1 2 3 4 4 2 6 3 U N I F O R M
P E R F O R M A N C E
D ATA I N C O L U M N FA M I L I E S
S E N S O R S
F I X E D U N C E R TA I N _ L O W E R U N C E R TA I N _ U P P E R U N C E R TA I N
R O W K E Y L O W E R L O W E R _ 4 0 U P P E R U P P E R _ 6 0 P D F
1 2 3 4 4 2 1 6 3 1 U N I F O R M
D ATA I N C O L U M N FA M I L I E S
Have to use column-families, not just columns
Does handle 2-dimensional uncertainty
Bloom filters obviously help
Query syntax gets complicated
E N T I T Y M I N I M U M Q U E R YL I K E T H E K I N G O F T H E D O W N V O T E D
E M I N Q H I V E + J Y T H O N I M P L E M E N TAT I O N
... r1 = statement.executeQuery( "SELECT MIN(Upper) FROM hive_sensors;") result = statement.executeQuery( "SELECT * FROM hive_sensors WHERE Lower < {0};”.format(r1)) ...
E M I N Q P I G I M P L E M E N TAT I O N
test_sensors = load 'hbase://u_1' using org.apache.pig.backend.hadoop.hbase.HBaseStorage(‘Fixed:Sensor_id, Uncertain:Upper, Uncertain:Lower, Uncertain:PDF', '-loadKey true') as (ID:bytearray, Sensor_id:int, up_val:float, down_val:float, pdf:chararray); !grouped = GROUP test_sensors ALL; minup = FOREACH grouped GENERATE MIN(test_sensors.up_val); !inrange = FILTER test_sensors BY (down_val < minup.$0); dump inrange;
E M I N Q P E R F O R M A N C E
C A S S A N D R AN O T H A D O O P J U S T U S E F U L
S E C O N D A R Y I N D E X E S O N C A S S A N D R A
CREATE TABLE sensors ( Sensor_id int, Lower float, Upper float, PDF text, PRIMARY KEY (Sensor_id) ); !CREATE INDEX sensors_down ON sensors (Lower); !CREATE INDEX sensors_up ON sensors (Upper);
O P E N Q U E S T I O N SI S P E N T A Y E A R O F M Y L I F E O N T H I S S T U F F : A M A !
R E P O S T: E M I N Q P I G I M P L E M E N TAT I O N
test_sensors = load 'hbase://u_1' using org.apache.pig.backend.hadoop.hbase.HBaseStorage(‘Fixed:Sensor_id, Uncertain:Upper, Uncertain:Lower, Uncertain:PDF', '-loadKey true') as (ID:bytearray, Sensor_id:int, up_val:float, down_val:float, pdf:chararray); !grouped = GROUP test_sensors ALL; minup = FOREACH grouped GENERATE MIN(test_sensors.up_val); !inrange = FILTER test_sensors BY (down_val < minup.$0); dump inrange;
E M I N Q F I L E - B A S E D R E W R I T E
Using HBase test_sensors = load 'hbase://u_1' using org.apache.pig.backend.hadoop.hbase.HBaseStorage(‘Fixed:Sensor_id, Uncertain:Upper, Uncertain:Lower, Uncertain:PDF', '-loadKey true') as (ID:bytearray, Sensor_id:int, up_val:float, down_val:float, pdf:chararray); Using Files test_sensors = load 'uncertain_data_file' as (ID:bytearray, Sensor_id:int, up_val:float, down_val:float, pdf:chararray);
E M I N Q & A L L F U L L TA B L E Q U E R I E S
C R E D I T S
University of Hong Kong Computer Science Department !Reynold Cheng-research supervision, academic instruction Ben Kao-research evaluation !Liu Lu-research, software implementation Wang Zuyao-research, software implementation