lancaster ucrel summer school 2017 - big data and nlp

UCREL Summer School |

Presented ByDate

Big Data NLP

Daniel Kershaw27/06/2017


Daniel KershawRecommender SystemSenior Data Scientist

@danjamker

www.danjamker.com

2

About


• Part 1 – 30 Minutes • Big Data (What is it?)• Map Reduce• Spark• Document Similarity

• Part 2 – 1 hour• Downloading Zepplin on Dockers• Read document set, extract data with • Tokenize • Implement Document Similarity• Cosine Similarity between documents

3

Outline


Set up docker:sudo docker pull epahomov/docker-zeppelin

Download Zeppelin Notebook: https://www.dropbox.com/s/161hpz02cafblsg/SDOA.json?dl=0

4

First


Presented ByDate

Part 1 - Big Data and NLP

Daniel Kershaw20th June 2017


640KoughttobeenoughforanyoneBillGates,Microsoft,1981


“There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days”EricSchmidt,Google,2010


Google processes 20 PB a day (2008)Wayback Machine has 3 PB + 100 TB/month (3/2009)Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009)CERN’s Large Hydron Collider (LHC) generates 15 PB a year

How much data?


GoogleBigDataTrend


What is Big Data

Too big to fit in an Excel spreadsheetProfessorStevenWeber,UCBerkeleySchoolofInformation


What is Big Data

Big data means data that cannot fit easily into a standard relational databaseHalVarian,ChiefEconomist,Google


What is Big Data

The term ‘Big Data’ applies toinformation that can't beprocessed or analysed usingtraditional processes or toolsProfessorStevenWeber,UCBerkeleySchoolofInformation


VolumeVelocityVarietyExhaustiveVeracityRelational & IndexicalRelationalFlexible

The Big V’s


WikipediaHansardEnron Email CorpusReddit Data ReleaseTwitter Data Set

Examples of Big / Large Data (NLP )

ScienceDirectCorpusMendeleyUserCatalogsEngineeringVillageUserinteractionlogsFundingdataEVISE


Scaling up Computation

ServersCPUs(Xeon)RAM(32Gb)Disks(2x1Tb)

Rack40- 80Server

NetworkedTogetherUPS(PowerSupply)


Google Data Center image


• How do we split across nodes• Network and data locality

• How do we deal with failures• 1 server fails ever 3 years => 10k nodes would

be about 10 failure a day• How do we deal with slow machines

Programming at Scale


Hadoop

GoogleMapReducepublish2004GoogleFileSystempublish2004


MapperReducer

Map Reduce


MapperReducer

Map Reduce - Mapper

Takesaseriesof<key,value>ProcesseseachtupleOutput’s0ormore<key,value>tuples


MapperReducer

Map Reduce - Reducer

Calledonceforeachunique<key,[value]>IteratesthougheachvalueOutputs0ormoreresultsas<key,value>


Example Code – Word Count


Map Reduce


MapReduce - Overview


• Application need more than on step• Google pipeline was 22 steps• Analytic queries e.g. K-mean 2-5 steps• Iterative queries e.g. page-rank 10-20 steps

• Problems with performance and ease of development

Issues with Hadoop - Complexity


• Multiple map and reduce classes• A lot of boiler plate code• Easy to combine incorrectly

Issues with Hadoop - Usability


• One pass at a time• Must write to HDFS between jobs• Expensive to reuse data• Hand optimize code to combine steps

Issues with Hadoop - Performance


Big Data Processing


Spark


• Resilient distributed datasets (RDD)• Immutable, partitioned collections of objects• Created through parallel transformations (map, filter, groupBy,

join, …) on data in stable storage • Can be cached for effect use• Actions on RDDs• Count, reduce, collect, save, …

Spark Model


Spark vs Hadoop – Data Sharing

Spark

Hadoop


SparkML

val train_data = // RDD of Vector!valmodel = KMeans.train(train_data, k=10)!

// evaluate the model!val test_data = // RDD of Vector!test_data.map(t => model.predict(t)).collect().foreach(println)!


• Interact with data like a table• Inbuilt function to:

• Tokenize• Stop-word removal• TFIDF transformation

Spark Dataframes

Name Age Gender Abstract


Title abstract

keywords

ASJC Title abstract

keywords

ASJC Title_tok


Presented ByDate

Part 2 – Document SimilarityTechnical Workshop

Daniel Kershaw29th June 2017


• Download apache Zepplin• Download datasets• Read datasets• Tokenize and remove stopwords• Read word vectors

39

Outline

UCREL Summer School | 40


• Clone docker image• docker pull epahomov/docker-zeppelin

• Run docker image• docker run -d -p 8080:8080 -p 7077:7077 -p 4040:4040 epahomov/docker-zeppelin

• Goto• localhost:8080

41

Install Apache Zeppelin


Document Embedding Similarity

Apple[0.5,0.6,0.3,0.1,0.6,0.5, 0.5, 0.9,0.9,0.3,0.5,0.4,0.4,0.5, 0.5,]

Wordrepresentedasdensevector

Documentrepresentedassum(mean)ofdensevectorsApple[0.5,0.6,0.3,0.1,0.6,0.5, 0.5, 0.9,0.9,0.3,0.5,0.4,0.4,0.5, 0.5,]

Mac[0.5,0.6,0.3,0.1,0.6,0.5, 0.5, 0.9,0.9,0.3,0.5,0.4,0.4,0.5, 0.5,]

Computer[0.5,0.6,0.3,0.1,0.6,0.5, 0.5, 0.9,0.9,0.3,0.5,0.4,0.4,0.5, 0.5,]

+

+

=Document[0.5,0.6,0.3,0.1,0.6,0.5, 0.5, 0.9,0.9,0.3,0.5,0.4,0.4,0.5, 0.5,]


Download Spark Dependencies


Download Sample Science Direct Corpus


Science Direct Open Access Corpus

ContainsallcontentseenonSDfrontendAvailableonGithub

ExtractPII(documentID)ExtractAbstract

UseElsevierOpensource XMLparser

Extractfieldswithxpath &xquery


Read Documents


Extract Title and Document Abstract


Tokenize and Remove Stop words


Download Word Vectors


Load Word Vectors

word vector

apple [0.2,0.4,0.8]

computer

[0.2,0.4,0.8]

mac [0.2,0.4,0.8]

Google [0.2,0.4,0.8]

this [0.2,0.4,0.8]


DocID Tokens

1 [apple,computer,mac]




5 [apple,computer,mac] DocID Tokens

1 apple

1 computer

1 mac

2 apple

2 computer

Explodethetokens


DocID word

1 apple

1 computer

1 mac

2 apple

2 computer

word vector

apple [0.2,0.4,0.8]

computer

[0.2,0.4,0.8]

mac [0.2,0.4,0.8]

Google [0.2,0.4,0.8]

this [0.2,0.4,0.8]

Joinonwords

Doc ID word vector

1 apple [0.2,0.4,0.8]

1 computer [0.2,0.4,0.8]


Doc ID word vector

1 apple [0.2,0.4,0.8]

1 computer [0.2,0.4,0.8]

GroupbydocumentID,mean thevectors

DocID vector

1 [0.2,0.4,0.8]

2 [0.2,0.4,0.8]

3 [0.2,0.4,0.8]

4 [0.2,0.4,0.8]

5 [0.2,0.4,0.8]


Join word vectors to document


• Cartesian join of documents • Compute cosine similarity between each document

60

Identify similar documents

1 2 3

1 0.4 0.6 0.6

2 0.5 0.4 0.7

3 0.6 0.1 0.3

DocID vector

1 [0.2,0.4,0.8]

2 [0.2,0.4,0.8]

3 [0.2,0.4,0.8]

4 [0.2,0.4,0.8]

5 [0.2,0.4,0.8]

DocID vector

1 [0.2,0.4,0.8]

2 [0.2,0.4,0.8]

3 [0.2,0.4,0.8]

4 [0.2,0.4,0.8]

5 [0.2,0.4,0.8]

Jointoself


Thank youAny questions

61

lancaster ucrel summer school 2017 - big data and nlp

Technology