lancaster ucrel summer school 2017 - big data and nlp
TRANSCRIPT
UCREL Summer School |
Daniel KershawRecommender SystemSenior Data Scientist
@danjamker
www.danjamker.com
2
About
UCREL Summer School |
• Part 1 – 30 Minutes • Big Data (What is it?)• Map Reduce• Spark• Document Similarity
• Part 2 – 1 hour• Downloading Zepplin on Dockers• Read document set, extract data with • Tokenize • Implement Document Similarity• Cosine Similarity between documents
3
Outline
UCREL Summer School |
Set up docker:sudo docker pull epahomov/docker-zeppelin
Download Zeppelin Notebook: https://www.dropbox.com/s/161hpz02cafblsg/SDOA.json?dl=0
4
First
UCREL Summer School |
“There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days”EricSchmidt,Google,2010
UCREL Summer School |
Google processes 20 PB a day (2008)Wayback Machine has 3 PB + 100 TB/month (3/2009)Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009)CERN’s Large Hydron Collider (LHC) generates 15 PB a year
How much data?
UCREL Summer School |
What is Big Data
Too big to fit in an Excel spreadsheetProfessorStevenWeber,UCBerkeleySchoolofInformation
UCREL Summer School |
What is Big Data
Big data means data that cannot fit easily into a standard relational databaseHalVarian,ChiefEconomist,Google
UCREL Summer School |
What is Big Data
The term ‘Big Data’ applies toinformation that can't beprocessed or analysed usingtraditional processes or toolsProfessorStevenWeber,UCBerkeleySchoolofInformation
UCREL Summer School |
VolumeVelocityVarietyExhaustiveVeracityRelational & IndexicalRelationalFlexible
The Big V’s
UCREL Summer School |
WikipediaHansardEnron Email CorpusReddit Data ReleaseTwitter Data Set
Examples of Big / Large Data (NLP )
ScienceDirectCorpusMendeleyUserCatalogsEngineeringVillageUserinteractionlogsFundingdataEVISE
UCREL Summer School |
Scaling up Computation
ServersCPUs(Xeon)RAM(32Gb)Disks(2x1Tb)
Rack40- 80Server
NetworkedTogetherUPS(PowerSupply)
UCREL Summer School |
• How do we split across nodes• Network and data locality
• How do we deal with failures• 1 server fails ever 3 years => 10k nodes would
be about 10 failure a day• How do we deal with slow machines
Programming at Scale
UCREL Summer School |
MapperReducer
Map Reduce - Mapper
Takesaseriesof<key,value>ProcesseseachtupleOutput’s0ormore<key,value>tuples
UCREL Summer School |
MapperReducer
Map Reduce - Reducer
Calledonceforeachunique<key,[value]>IteratesthougheachvalueOutputs0ormoreresultsas<key,value>
UCREL Summer School |
• Application need more than on step• Google pipeline was 22 steps• Analytic queries e.g. K-mean 2-5 steps• Iterative queries e.g. page-rank 10-20 steps
• Problems with performance and ease of development
Issues with Hadoop - Complexity
UCREL Summer School |
• Multiple map and reduce classes• A lot of boiler plate code• Easy to combine incorrectly
Issues with Hadoop - Usability
UCREL Summer School |
• One pass at a time• Must write to HDFS between jobs• Expensive to reuse data• Hand optimize code to combine steps
Issues with Hadoop - Performance
UCREL Summer School |
• Resilient distributed datasets (RDD)• Immutable, partitioned collections of objects• Created through parallel transformations (map, filter, groupBy,
join, …) on data in stable storage • Can be cached for effect use• Actions on RDDs• Count, reduce, collect, save, …
Spark Model
UCREL Summer School |
SparkML
val train_data = // RDD of Vector!valmodel = KMeans.train(train_data, k=10)!
// evaluate the model!val test_data = // RDD of Vector!test_data.map(t => model.predict(t)).collect().foreach(println)!
UCREL Summer School |
• Interact with data like a table• Inbuilt function to:
• Tokenize• Stop-word removal• TFIDF transformation
Spark Dataframes
Name Age Gender Abstract
UCREL Summer School |
Presented ByDate
Part 2 – Document SimilarityTechnical Workshop
Daniel Kershaw29th June 2017
UCREL Summer School |
• Download apache Zepplin• Download datasets• Read datasets• Tokenize and remove stopwords• Read word vectors
39
Outline
UCREL Summer School |
• Clone docker image• docker pull epahomov/docker-zeppelin
• Run docker image• docker run -d -p 8080:8080 -p 7077:7077 -p 4040:4040 epahomov/docker-zeppelin
• Goto• localhost:8080
41
Install Apache Zeppelin
UCREL Summer School |
Document Embedding Similarity
Apple[0.5,0.6,0.3,0.1,0.6,0.5, 0.5, 0.9,0.9,0.3,0.5,0.4,0.4,0.5, 0.5,]
Wordrepresentedasdensevector
Documentrepresentedassum(mean)ofdensevectorsApple[0.5,0.6,0.3,0.1,0.6,0.5, 0.5, 0.9,0.9,0.3,0.5,0.4,0.4,0.5, 0.5,]
Mac[0.5,0.6,0.3,0.1,0.6,0.5, 0.5, 0.9,0.9,0.3,0.5,0.4,0.4,0.5, 0.5,]
Computer[0.5,0.6,0.3,0.1,0.6,0.5, 0.5, 0.9,0.9,0.3,0.5,0.4,0.4,0.5, 0.5,]
+
+
=Document[0.5,0.6,0.3,0.1,0.6,0.5, 0.5, 0.9,0.9,0.3,0.5,0.4,0.4,0.5, 0.5,]
UCREL Summer School | 45
Science Direct Open Access Corpus
ContainsallcontentseenonSDfrontendAvailableonGithub
ExtractPII(documentID)ExtractAbstract
UseElsevierOpensource XMLparser
Extractfieldswithxpath &xquery
UCREL Summer School | 50
Load Word Vectors
word vector
apple [0.2,0.4,0.8]
computer
[0.2,0.4,0.8]
mac [0.2,0.4,0.8]
Google [0.2,0.4,0.8]
this [0.2,0.4,0.8]
UCREL Summer School | 51
DocID Tokens
1 [apple,computer,mac]
2 [apple,computer,mac]
3 [apple,computer,mac]
4 [apple,computer,mac]
5 [apple,computer,mac] DocID Tokens
1 apple
1 computer
1 mac
2 apple
2 computer
Explodethetokens
UCREL Summer School | 52
DocID word
1 apple
1 computer
1 mac
2 apple
2 computer
word vector
apple [0.2,0.4,0.8]
computer
[0.2,0.4,0.8]
mac [0.2,0.4,0.8]
Google [0.2,0.4,0.8]
this [0.2,0.4,0.8]
Joinonwords
Doc ID word vector
1 apple [0.2,0.4,0.8]
1 computer [0.2,0.4,0.8]
UCREL Summer School | 53
Doc ID word vector
1 apple [0.2,0.4,0.8]
1 computer [0.2,0.4,0.8]
GroupbydocumentID,mean thevectors
DocID vector
1 [0.2,0.4,0.8]
2 [0.2,0.4,0.8]
3 [0.2,0.4,0.8]
4 [0.2,0.4,0.8]
5 [0.2,0.4,0.8]
UCREL Summer School |
• Cartesian join of documents • Compute cosine similarity between each document
60
Identify similar documents
1 2 3
1 0.4 0.6 0.6
2 0.5 0.4 0.7
3 0.6 0.1 0.3
DocID vector
1 [0.2,0.4,0.8]
2 [0.2,0.4,0.8]
3 [0.2,0.4,0.8]
4 [0.2,0.4,0.8]
5 [0.2,0.4,0.8]
DocID vector
1 [0.2,0.4,0.8]
2 [0.2,0.4,0.8]
3 [0.2,0.4,0.8]
4 [0.2,0.4,0.8]
5 [0.2,0.4,0.8]
Jointoself