exploring the enron email dataset with kiji and hive

28

Upload: wibidata

Post on 30-Jun-2015

1.082 views

Category:

Technology


1 download

DESCRIPTION

Talk given at September 2013 SF Hadoop Users Group by Lee Sheng http://www.meetup.com/hadoopsf/events/136499862/

TRANSCRIPT

Page 1: Exploring the Enron Email Dataset with Kiji and Hive
Page 2: Exploring the Enron Email Dataset with Kiji and Hive

Page 3: Exploring the Enron Email Dataset with Kiji and Hive

●○○○○

●●●

Page 4: Exploring the Enron Email Dataset with Kiji and Hive

●●●

Page 5: Exploring the Enron Email Dataset with Kiji and Hive
Page 6: Exploring the Enron Email Dataset with Kiji and Hive
Page 7: Exploring the Enron Email Dataset with Kiji and Hive

Page 8: Exploring the Enron Email Dataset with Kiji and Hive

●●●

Page 9: Exploring the Enron Email Dataset with Kiji and Hive

CREATE EXTERNAL TABLE emails ( mid STRUCT<ts: TIMESTAMP, value: STRING>, dateLong STRUCT<ts: TIMESTAMP, value: BIGINT>, fromStr STRUCT<ts: TIMESTAMP, value: STRING>, toStr STRUCT<ts: TIMESTAMP, value: STRING>, subject STRUCT<ts: TIMESTAMP, value: STRING>, body STRUCT<ts: TIMESTAMP, value: STRING>,) STORED BY 'org.kiji.hive.KijiTableStorageHandler'WITH SERDEPROPERTIES ( 'kiji.columns' = ‘info:mid[0],info:date[0],info:from[0],info:to[0],’ + ‘info:subject[0],info:body[0]’) TBLPROPERTIES ( 'kiji.table.uri' = ' kiji://.env/enron_email/emails ');

Page 10: Exploring the Enron Email Dataset with Kiji and Hive

SELECT

fromStr.value AS fromStr,

count(1) AS count

FROM emails

GROUP BY fromStr.value

ORDER BY count DESC

LIMIT 10;

Page 11: Exploring the Enron Email Dataset with Kiji and Hive
Page 12: Exploring the Enron Email Dataset with Kiji and Hive

SELECT fromStr.value AS fromStr, trim(splitToStr) AS toStr, count(1) AS countFROM emails LATERAL VIEW explode(split(toStr.value,',')) tos AS splitToStrGROUP BY fromStr.value,trim(splitToStr)ORDER BY count DESCLIMIT 10;

Page 13: Exploring the Enron Email Dataset with Kiji and Hive
Page 15: Exploring the Enron Email Dataset with Kiji and Hive
Page 16: Exploring the Enron Email Dataset with Kiji and Hive

User Emails

Emails Table Sentiment

Producer

Page 17: Exploring the Enron Email Dataset with Kiji and Hive

SELECT ((year(datelong.ts)-1999)*52+weekofyear(datelong.ts)) AS weeknum, avg(sentiment.value) AS avgsentiment, stddev(sentiment.value) AS stddevsentiment, count(1) AS nummessagesFROM emailsWHERE regexp_replace(fromStr.value,".*@","")=="enron.com" GROUP BY ((year(datelong.ts)-1999)*52+weekofyear(datelong.ts));

Page 18: Exploring the Enron Email Dataset with Kiji and Hive
Page 19: Exploring the Enron Email Dataset with Kiji and Hive
Page 20: Exploring the Enron Email Dataset with Kiji and Hive
Page 21: Exploring the Enron Email Dataset with Kiji and Hive
Page 22: Exploring the Enron Email Dataset with Kiji and Hive

SELECT lword AS word, sum(sentiment) AS totalsentimentFROM ( SELECT mid.value AS mid, lower(word) AS lword, sentiment.value AS sentiment FROM emails LATERAL VIEW explode(sentences(body.value)[0]) wds AS word WHERE regexp_replace(fromStr.value,".*@","")=="enron.com") subqueryGROUP BY lwordORDER BY totalsentiment ASC;

Page 23: Exploring the Enron Email Dataset with Kiji and Hive
Page 24: Exploring the Enron Email Dataset with Kiji and Hive
Page 25: Exploring the Enron Email Dataset with Kiji and Hive
Page 28: Exploring the Enron Email Dataset with Kiji and Hive