data science: data analysis boot camp textual...

Data Science: Data Analysis Boot CampTextual Analysis

Chuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhD

8 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 2020

1/35

2/35

Intro. Background Hands-on Q & A Conclusion References Files Misc.

Table of contents (1 of 1)

1 Intro.2 Background

Contextualize3 Hands-on

Examples from the textA little silliness

4 Q & A5 Conclusion

6 References

7 Files

8 Misc.Equations

3/35


What are we going to cover?

We’re going to talk about:

Differences between numerical andtextual data analysis.

Define common textual dataanalysis terms and ideas.

Use different textual analysis tools(knn, näıve Bayes, logit, andsupport vector machines)

4/35


Contextualize

Processing textual data is messy.

With numerical data, there are a limited number of ways to getdata ready for analysis:

1 Ignore records that are missing/incomplete

2 Fill in missing values (mean, mode, estimated)

3 Accept incomplete records and adjust the uncertainties

Textual data is harder. Data may be complete, but very hard toget ready for analysis.

5/35


Contextualize

Textual “data wrangling”

There are a few “normal” processing steps to prepare textual datafrom analysis:

1 Change all text to the same case (usually lower case)

2 Remove all non-textual glyphs (punctuation marks and so on)

3 Remove all numbers

4 Remove all “stop words” (stop words are language anddomain specific)

5 Remove all “white space”

6 Apply stemming techniques to what remains

6/35


Contextualize

What does all this mean?

A sentence that starts like this[6]:Text Mining (or Text Analytics) applies analytic tools to learn from collections of text data, like social media,books, newspapers, emails, etc.

Ends up like this:text min text analyt appli analyt tool learn collect text data like social media book newspap email etc

7/35


Contextualize

A few definitions[2]

TF Term Frequency, which measures how frequently a termoccurs in a document.

tf(t, d) = Number of time the term t appears in the documentTotal number of terms in the documentd

IDF Inverse Document Frequency, which measures how importanta term is (whether the term is common or rare across alldocuments).

IDF(t,D) = log N|{d∈D:t∈d}|

D : the corpus, a collection of documentsN : total number of documents in the corpus N = | D || {d ∈ D : t ∈ d} | : number of documents where the term tappears (i.e., tf(t, d) 6= 0).

8/35


Contextualize

TF and IDF for sample string.

The terms:

[1] "analyt" "appli" "book" "collect" "data" "email" "etc"

[8] "learn" "like" "media" "mine" "newspap" "social" "text"

[15] "tool"

The frequency of each term:

[1] 2 1 1 1 1 1 1 1 1 1 1 1 1 3 1

IDF is not a real useful metric with only one document:

weightTfIdf(TermDocumentMatrix(corp1))$v

==>named numeric(0)(See next slide for code.)

9/35


Contextualize

R script to create sample text “normalization”

library(NLP)

library(tm)

a

10/35


Examples from the text

Whats happening in the beginning?

We gather up a predefined set of documents, save them locally,and create a term frequency object:

tempFile

11/35



Afterwards we look at the corpus:

[1] " -- Dumping the object: processed (of type: list, class: VCorpus)"

[2] " -- Dumping the object: processed (of type: list, class: Corpus)"

Metadata: corpus specific: 0, document level (indexed): 0

Content: documents: 2000

[1] " -- Dumping the object:

head(Frequencies[order(Frequencies, decreasing = T)], 5)

(of type: double, class: numeric)"

film movi one like charact

11109 6857 5759 3998 3855

[1] " -- Dumping the object:

head(DocFrequencies[order(DocFrequencies, decreasing = T)], 5)

(of type: double, class: numeric)"

film one movi like charact

1797 1763 1642 1538 1431

We now know the most common terms across the 2,000documents in the corpus.

12/35



Gathering a few corpus statistics.

It is easy to think about how terms and documents create a2-dimensional array.

[1] " -- Dumping the object: moreThanOnce (of type: integer, class: integer)"

[1] 9748

[1] " -- Dumping the object: total (of type: integer, class: integer)"

[1] 30585

[1] " -- Dumping the object: prop (of type: double, class: numeric)"

[1] 0.3187183

[1] " -- Dumping the object: ncol(SparseRemoved) (of type: integer, class: integer)"

[1] 202

[1] " -- Dumping the object: sum(rowSums(as.matrix(SparseRemoved)) == 0)

(of type: integer, class: integer)"

[1] 0

[1] " -- Dumping the object: colnames(SparseRemoved) (of type: character, class: character)"

[1] "act" "action" "actor" "actual" "almost" "along"

Columns that have only one entry are assumed to not be toointeresting.

13/35



Create a dataframe with all the data

quality

14/35



How well do knn classifiers do? (1 of 2)

The code:

Class3n

15/35



How well do knn classifiers do? (2 of 2)

The results:

[1] " -- Dumping the object: confusionMatrix(Class3n,

as.factor(TrainDF$quality)) (of type: list,

class: confusionMatrix)"

Confusion Matrix and Statistics

Reference

Prediction 0 1

0 358 126

1 134 382

Accuracy : 0.74

...

[1] " -- Dumping the object: confusionMatrix(Class5n,

as.factor(TrainDF$quality)) (of type: list,

class: confusionMatrix)"


Reference

Prediction 0 1

0 336 162

1 156 346

Accuracy : 0.682

16/35



How well will a näıve Bayes classifier do? (1 of 2)

The code:

model

17/35



How well will a näıve Bayes classifier do? (2 of 2)

The results:

[1] " -- Dumping the object: confusionMatrix(

as.factor(TrainDF$quality), classifNB)

(of type: list, class: confusionMatrix)"


Reference

Prediction 0 1

0 353 139

1 74 434

Accuracy : 0.787

...


as.factor(TestDF$quality), classifNB)



Reference

Prediction 0 1

0 335 173

1 120 372

Accuracy : 0.707

18/35



How well will logistic regression (logit) do? (1 of 2)

The code:

model

19/35



How well will logistic regression (logit) do? (2 of 2)

The results:[1] " -- Dumping the object: summary(model) (of type: list, class: summary.glm)"

glm(formula = quality ~ lengths, family = binomial)

...

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.6383373 0.1171536 -5.449 5.07e-08 ***

lengths 0.0018276 0.0003113 5.871 4.32e-09 ***

...

[1] " -- Dumping the object: tbl (of type: integer, class: table)"

quality

classif 0 1

0 614 507

1 386 493

...

[1] " -- Dumping the object: confusionMatrix(TrainDF$quality, TrainDF$classif) (of type: list, class: confusionMatrix)"


Reference

Prediction 0 1

0 418 74

1 69 439

Accuracy : 0.857

...

[1] " -- Dumping the object: confusionMatrix(TestDF$quality, TestDF$classif) (of type: list, class: confusionMatrix)"


Reference

Prediction 0 1

0 377 131

1 145 347

Accuracy : 0.724

20/35



How well will a support vector machine (svm) do? (1 of2)

“The support vector classifier is a natural approachfor classification in the two-class setting, if the boundarybetween the two classes is linear.”

James, et al. [1]

The code:

modelSVM

21/35



How well will a support vector machine (svm) do? (2 of2)

The results:


TrainDF$quality, classifSVMtrain)



Reference

Prediction 0 1

0 449 43

1 38 470

Accuracy : 0.919

...


TestDF$quality, classifSVMtest)



Reference

Prediction 0 1

0 378 130

1 146 346

Accuracy : 0.724

22/35


A little silliness

Looking at term frequency in a PDF.

We will do a few things:

1 Read text directly from aPDF.

2 “Normalize” the text.

3 Look at the text in differentways.

(From the file:chapter-13-textual-silliness.R)

Attached file.

23/35


A little silliness

Same image.

Attached file.

24/35


A little silliness

Look at text frequency as a B&W word cloud

Attached file.

25/35


A little silliness

Look at text frequency as a color word cloud

Attached file.

26/35


A little silliness

More colorful examples from Romeo and Juliet

Attached file (wordCloud.pdf).

27/35


Q & A time.

Q: How do you catch a uniquerabbit?A: Unique up on it!

Q: How do you catch a tamerabbit?A: The tame way!

28/35


What have we covered?

Compared and contrastednumerical and textual data analysisProvided a few numericaldefinitions (TF, IDF) that arefundamental to textual analysisApplied different textual analysistools and techniques (knn, näıveBayes, logit, and support vectormachine)Looked at different graphical waystextual data could be displayed

Next: Serial vs. parallel processing

29/35


References (1 of 2)

[1] Gareth James, Daniela Witten, Trevor Hastie, and RobertTibshirani, An Introduction to Statistical Learning, vol. 6,Springer, 2013.

[2] TF-IDF Staff, What does tf-idf mean?,http://www.tfidf.com/, 2017.

[3] Wikipedia Staff, Logistic function,https://en.wikipedia.org/wiki/Logistic_function,2017.

[4] , Naive Bayes classifier, https://en.wikipedia.org/wiki/Naive_Bayes_classifier,2017.

http://www.tfidf.com/https://en.wikipedia.org/wiki/Logistic_functionhttps://en.wikipedia.org/wiki/Naive_Bayes_classifierhttps://en.wikipedia.org/wiki/Naive_Bayes_classifier

30/35


References (2 of 2)

[5] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar,Introduction to Data Mining, Pearson Education India, 2006.

[6] G Williams, Hands-On Data Science with R: Text Mining,2016.

31/35


Files of interest

1 Revised textual analysis

script2 Silly textual analysis

script3 PDF file used with silly

textual analysis script

4 R library script file

5 Other ways to display word

clouds

6 Code snippets

rm(list=ls())

library(lattice)library(ggplot2)library(NLP)library(tm)library(class)library(caret)library(e1071)library(topicmodels)library(qdapDictionaries)library(qdapRegex)

library(qdapTools)library(RColorBrewer)library(qdap)library(psych)

source("library.R")

assignBinary threshold]

Hands-On Data Science with R

Text Mining

[email protected]

10th January 2016

Visit http://HandsOnDataScience.com/ for more Chapters.

Text Mining (or Text Analytics) applies analytic tools to learn from collections of text data,like social media, books, newspapers, emails, etc. The goal can be considered to be similar tohumans learning by reading such material. However, using automated algorithms we can learnfrom massive amounts of text, very much more than a human can. The material could consist ofmillions of newspaper articles to perhaps summarise the main themes and to identify those thatare of most interest to particular people. Or we might be monitoring twitter feeds to identifyemerging topics that we might need to act upon, as it emerges.

The required packages for this chapter include:

library(tm) # Framework for text mining.

library(qdap) # Quantitative discourse analysis of transcripts.

library(qdapDictionaries)

library(dplyr) # Data wrangling, pipe operator %>%().

library(RColorBrewer) # Generate palette of colours for plots.

library(ggplot2) # Plot word frequencies.

library(scales) # Include commas in numbers.

library(Rgraphviz) # Correlation plots.

As we work through this chapter, new R commands will be introduced. Be sure to review thecommand’s documentation and understand what the command does. You can ask for help usingthe ? command as in:

?read.csv

We can obtain documentation on a particular package using the help= option of library():

library(help=rattle)

This chapter is intended to be hands on. To learn effectively, you are encouraged to have Rrunning (e.g., RStudio) and to run all the commands as they appear here. Check that you getthe same output, and you understand the output. Try some variations. Explore.

Copyright © 2013-2015 Graham Williams. You can freely copy, distribute,or adapt this material, as long as the attribution is retained and derivativework is provided under the same license.

http://HandsOnDataScience.com/

http://creativecommons.org/licenses/by-nc-sa/4.0/

Data Science with R Hands-On Text Mining

1 Getting Started: The Corpus

The primary package for text mining, tm (Feinerer and Hornik, 2015), provides a frameworkwithin which we perform our text mining. A collection of other standard R packages add valueto the data processing and visualizations for text mining.

The basic concept is that of a corpus. This is a collection of texts, usually stored electronically,and from which we perform our analysis. A corpus might be a collection of news articles fromReuters or the published works of Shakespeare. Within each corpus we will have separate docu-ments, which might be articles, stories, or book volumes. Each document is treated as a separateentity or record.

Documents which we wish to analyse come in many different formats. Quite a few formats aresupported by tm (Feinerer and Hornik, 2015), the package we will illustrate text mining with inthis module. The supported formats include text, PDF, Microsoft Word, and XML.

A number of open source tools are also available to convert most document formats to text files.For our corpus used initially in this module, a collection of PDF documents were converted to textusing pdftotext from the xpdf application which is available for GNU/Linux and MS/Windowsand others. On GNU/Linux we can convert a folder of PDF documents to text with:

system("for f in *.pdf; do pdftotext -enc ASCII7 -nopgbrk $f; done")

The -enc ASCII7 ensures the text is converted to ASCII since otherwise we may end up withbinary characters in our text documents.

We can also convert Word documents to text using anitword, which is another applicationavailable for GNU/Linux.

system("for f in *.doc; do antiword $f; done")

Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 1 of 46

Draft Only

Generated 2016-01-10 10:00:58+11:00

https://secure.wikimedia.org/wikipedia/en/wiki/corpus


1.1 Corpus Sources and Readers

There are a variety of sources supported by tm. We can use getSources() to list them.

getSources()

## [1] "DataframeSource" "DirSource" "URISource" "VectorSource"

## [5] "XMLSource" "ZipSource"

In addition to different kinds of sources of documents, our documents for text analysis will comein many different formats. A variety are supported by tm:

getReaders()

## [1] "readDOC" "readPDF"

## [3] "readPlain" "readRCV1"

## [5] "readRCV1asPlain" "readReut21578XML"

## [7] "readReut21578XMLasPlain" "readTabular"

## [9] "readTagged" "readXML"


Draft Only

Generated 2016-01-10 10:00:58+11:00


1.2 Text Documents

We load a sample corpus of text documents. Our corpus consists of a collection of researchpapers all stored in the folder we identify below. To work along with us in this module, youcan create your own folder called corpus/txt and place into that folder a collection of textdocuments. It does not need to be as many as we use here but a reasonable number makes itmore interesting.

cname


## ai02.txt 2 PlainTextDocument list



....


Draft Only

Generated 2016-01-10 10:00:58+11:00


1.3 PDF Documents

If instead of text documents we have a corpus of PDF documents then we can use the readPDF()reader function to convert PDF into text and have that loaded as out Corpus.

docs


1.4 Word Documents

A simple open source tool to convert Microsoft Word documents into text is antiword. Theseparate antiword application needs to be installed, but once it is available it is used by tm toconvert Word documents into text for loading into R.

To load a corpus of Word documents we use the readDOC() reader function:

docs


2 Exploring the Corpus

We can (and should) inspect the documents using inspect(). This will assure us that data hasbeen loaded properly and as we expect.

inspect(docs[16])

##

## Metadata: corpus specific: 0, document level (indexed): 0

## Content: documents: 1

##

## [[1]]

##

## Metadata: 7

## Content: chars: 44776

viewDocs % extract2(n) %>% as.character() %>% writeLines()}viewDocs(docs, 16)

## Hybrid weighted random forests for

## classifying very high-dimensional data

## Baoxun Xu1 , Joshua Zhexue Huang2 , Graham Williams2 and

## Yunming Ye1

## 1

##

....


Draft Only

Generated 2016-01-10 10:00:58+11:00


3 Preparing the Corpus

We generally need to perform some pre-processing of the text data to prepare for the text anal-ysis. Example transformations include converting the text to lower case, removing numbers andpunctuation, removing stop words, stemming and identifying synonyms. The basic transformsare all available within tm.

getTransformations()

## [1] "removeNumbers" "removePunctuation" "removeWords"

## [4] "stemDocument" "stripWhitespace"

The function tm map() is used to apply one of these transformations across all documents withina corpus. Other transformations can be implemented using R functions and wrapped withincontent transformer() to create a function that can be passed through to tm map(). We willsee an example of that in the next section.

In the following sections we will apply each of the transformations, one-by-one, to remove un-wanted characters from the text.


Draft Only

Generated 2016-01-10 10:00:58+11:00


3.1 Simple Transforms

We start with some manual special transforms we may want to do. For example, we might wantto replace “/”, used sometimes to separate alternative words, with a space. This will avoid thetwo words being run into one string of characters through the transformations. We might alsoreplace “@” and “|” with a space, for the same reason.

To create a custom transformation we make use of content transformer() to create a functionto achieve the transformation, and then apply it to the corpus using tm map().

toSpace


3.2 Conversion to Lower Case

docs


3.3 Remove Numbers

docs


3.4 Remove Punctuation

docs


3.5 Remove English Stop Words

docs


3.6 Remove Own Stop Words

docs


3.7 Strip Whitespace

docs


3.8 Specific Transformations

We might also have some specific transformations we would like to perform. The examples heremay or may not be useful, depending on how we want to analyse the documents. This is reallyfor illustration using the part of the document we are looking at here, rather than suggestingthis specific transform adds value.

toString


3.9 Stemming

docs


4 Creating a Document Term Matrix

A document term matrix is simply a matrix with documents as the rows and terms as the columnsand a count of the frequency of words as the cells of the matrix. We use DocumentTermMatrix()to create the matrix:

dtm


5 Exploring the Document Term Matrix

We can obtain the term frequencies as a vector by converting the document term matrix into amatrix and summing the column counts:

freq


6 Distribution of Term Frequencies

# Frequency of frequencies.

head(table(freq), 15)

## freq

## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

## 2381 1030 503 311 210 188 134 130 82 83 65 61 54 52 51

tail(table(freq), 15)

## freq

## 483 544 547 555 578 609 611 616 703 709 776 887 1366 1446 3101

## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

So we can see here that there are 2381 terms that occur just once.


Draft Only

Generated 2016-01-10 10:00:58+11:00


7 Conversion to Matrix and Save to CSV

We can convert the document term matrix to a simple matrix for writing to a CSV file, forexample, for loading the data into other software if we need to do so. To write to CSV we firstconvert the data structure into a simple matrix:

m


8 Removing Sparse Terms

We are often not interested in infrequent terms in our documents. Such “sparse” terms can beremoved from the document term matrix quite easily using removeSparseTerms():

dim(dtm)

## [1] 46 6508

dtms


9 Identifying Frequent Items and Associations

One thing we often to first do is to get an idea of the most frequent terms in the corpus. We usefindFreqTerms() to do this. Here we limit the output to those terms that occur at least 1,000times:

findFreqTerms(dtm, lowfreq=1000)

## [1] "data" "mine" "use"

So that only lists a few. We can get more of them by reducing the threshold:

findFreqTerms(dtm, lowfreq=100)

## [1] "accuraci" "acsi" "adr" "advers" "age"

## [6] "algorithm" "allow" "also" "analysi" "angioedema"

## [11] "appli" "applic" "approach" "area" "associ"

## [16] "attribut" "australia" "australian" "avail" "averag"

## [21] "base" "build" "call" "can" "care"

## [26] "case" "chang" "claim" "class" "classif"

....

We can also find associations with a word, specifying a correlation limit.

findAssocs(dtm, "data", corlimit=0.6)

## $data

## mine induct challeng know answer

## 0.90 0.72 0.70 0.65 0.64

## need statistician foundat general boost

## 0.63 0.63 0.62 0.62 0.61

## major mani come

....

If two words always appear together then the correlation would be 1.0 and if they never appeartogether the correlation would be 0.0. Thus the correlation is a measure of how closely associatedthe words are in the corpus.


Draft Only

Generated 2016-01-10 10:00:58+11:00


10 Correlations Plots

accuraci acsi adr

advers

age

algorithm

allowalso

analysi

angioedema

appli applic

approach

area

associ

attribut

australia

australian avail

averag

base

build

call can

care

case

chang

claim

class

classif

classifi

cluster

collect

combin common compar

comput

condit

confer

consid consist

contain cost

csiro

current

data

databas dataset

day

decis

plot(dtm,

terms=findFreqTerms(dtm, lowfreq=100)[1:50],

corThreshold=0.5)

Rgraphviz (Hansen et al., 2016) from the BioConductor repository for R (bioconductor.org) isused to plot the network graph that displays the correlation between chosen words in the corpus.Here we choose 50 of the more frequent words as the nodes and include links between wordswhen they have at least a correlation of 0.5.

By default (without providing terms and a correlation threshold) the plot function chooses arandom 20 terms with a threshold of 0.7.


Draft Only

Generated 2016-01-10 10:00:58+11:00


11 Correlations Plot—Options

accuraci acsi adr

advers

age

algorithm

allowalso

analysi

angioedema

appli applic

approach

area

associ

attribut

australia

australian avail

averag

base

build

call can

care

case

chang

claim

class

classif

classifi

cluster

collect

combin common compar

comput

condit

confer

consid consist

contain cost

csiro

current

data

databas dataset

day

decis

plot(dtm,

terms=findFreqTerms(dtm, lowfreq=100)[1:50],

corThreshold=0.5)


Draft Only

Generated 2016-01-10 10:00:58+11:00


12 Plotting Word Frequencies

We can generate the frequency count of all words in a corpus:

freq %

ggplot(aes(word, freq)) +

geom_bar(stat="identity") +

theme(axis.text.x=element_text(angle=45, hjust=1))

0

1000

2000

3000

algori

thm can

cluste

rda

ta

datas

etfea

tur

metho

dmi

nemo

del

patte

rn rule se

ttre

eus

e

word

freq


Draft Only

Generated 2016-01-10 10:00:58+11:00


13 Word Clouds

effectconsidergovern

adrcomponlocatcare

environ

sciencsegment

timewhole miner

movedefin

begin

taxa

t

endvarieti element

complexpremium

databas

right

market

can

result

threeinclud

mean

thusarchitectur

term

simpl

frequentclaim

call

streamangioedema

donohostahel

individuthreshold

mml

monitor

report

howevbase

due

observ

advers

normal

withoutdiseas

esophag

pressreview

classifi

output

advanc

instal

transformset

tar

outlierentiti

theo

ri

interv

kdd

two

known

high

caa

knowledg

hiddeninduct

basic direct

conclus

access

rattl

rather

distinct

datasetless

receivmonth

busi startlarger

methodolog

show

behaviour

day

new

parallel

introduct

visual

technolog

spot

demonstr

will

version

increas

csiro

resp

ons

machin

geograph

split

type

subsequ

count

var

discovanoth

variablrespect

pmml

sequenc

sampl

unex

pect

interpret

oper

existconsid

parti

t

purpos

visualis

run

typic

forward

yearnode

limit

linux

distribut

hot

act

episod

help

author

acsimani

consequ

predictensembl

seri

ace

item

now

illustr

calcul

usag

insurca

se

list

region

gis

multipl

specif

difficult

breiman

log

intellig

line

understand

applide

sign

issu

condit

paramet

sinc

general

detect

conferfemal

input

next

variousmap

success

rank

made intrus

algorithm

refer

provid

ann

neur

al

tree

subspac

mathemat

within

form

reco

rdgraphic

http

deriv

project

confid

group

eventutar

overal

smyth

engin

determin

periodupon

reduc

vect

or

valu

key

muchweight

matrix

tradit

tabl

recent

prop

ort

exposur

employ

introduc

divers

attribut

categor

drug

volapproach

contextdata

user

becom

summari

hadi

queensland

distanc

probabl

hazardproblem

tmip

altern

rnn

health

occur

even

portf

olio

quit

plot

networkjournal

inform

system

interest pro

ceed

error

higher

worker

frequenc

layer

real

exclus

cand

id

cost

plane

factorcustom

addit

alendron

cluster

hepat

complet

focu

s

common

build

deploy

postdischarg

control

gnuallow

appear

definit

evalu

descript

smaller

averagstage

studi

disc

over

i

subset

patientoccurr

publish

represent

describ

advantag

age

interfac

differrandom

futu

r

one

therefor

open

make

medic

compar

use

william

som

choos

prepar

supp

still

optim

adm

iss

discuss

chen

milli

on

natur

artifici

preprocess

patte

rn

offic

score

softwar

minimum

fit

must

strength

ieee

generat

practicview

degr

e

doctor

targ

et

prior

previous gender

outcom

cove

r

scheme

rule

train

hybridcodepa

ge

area

fig

situat

assess

categori

statist

prune

part

temporwindow point

program

found

produc

suitabl

mbs

characterist

popul

mine

match

research

standard

pbs

dimension

hospitsepar

medicar

togeth

singl

forest

address

way

nation

collect

servic

posit

built

densiti

techniqu

order

applic

tool

miss

depend

met

hod

loca

l

analysi

consist

relat

four

expect

link

comparison

grow

accord

experi

regress

rang australia

exam

pl

suppa

final

mutarc risk

expert

certa

in

mod

elleverag

copyright

number

step

creat

residualleverag

structur

interesting

actual

extract

connect

possibl

graham

high

light

length

lead

class

continu

australian

contain

acm

import

deliv

specifi

decis

domain

clinic

remain

field

univers

test

estim

firstsee

learn

action

export

implement

major

shown

cart

yes

select

nugget

industri

evolutionari

second

captur

insight

languag

literatur

interact

avail

piatetskyshapirosuggest

size

also

debian

administr

managnote

labo

rato

ri

unit

objectstate

chang

experiment

find

messag

explor

space

follo

w

independ

tem

plat

intern

patholog

work

total

like

transact

rare

investig

drg

ratio

spatial

functiongive

usual

simpli

reaction

tune

activ valid

goal

sever

propos

correspond

reason

good

huang

appropri

clearcorrel

prototyp

aim

initi

file

cca

inhibitor

global

accuraci

current

fraud

support

least

developmay

packag

obtain

detail

need

remov

commonwealth

involv

task

best

equat

particular

paper

might

section

process

emerg

benefit

rate

public

comput

preadmiss

wellabl

regular

indic

among

polici

better

present

figuroftengi

ven

larg

identifi

asso

ci

chosen

igi

concept

orig

in

similar

abst

ract

effici

idea

orga

nis

classif

text

sourc

mea

sur

world

canberra

low

perform

main

top

gain

automat

repres

construct

analys

search

framework

signific

level

featur sequenti

requir

improvcombin

small

take

gap

We can generate a word cloud as an effective alternative to providing a quick visual overview ofthe frequency of words in a corpus.

The wordcloud (?) package provides the required function.

library(wordcloud)

set.seed(123)

wordcloud(names(freq), freq, min.freq=40)

Notice the use of set.seed() only so that we can obtain the same layout each time—otherwisea random layout is chosen, which is not usually an issue.


Draft Only

Generated 2016-01-10 10:00:58+11:00


13.1 Reducing Clutter With Max Words

providservicfigur

casepatient

databaspopul

valu

data

set

distribut

dataperform

prob

lem

knowledg

function

describ

min

edeveloprecord

processalso

drug

risk

number

first

paper

can

discoveri

one

includ

user

kdd

cluster

map

research

inform

statist

generat

treeforest

learn

randomsourc

may

associ

computidentifi

present

systemdecis

studi

mea

sur

support

tabl

group

sequenc

algorithm

time classde

tect

health

interest

appr

oach

analysi

relat

select

set

variabl two

techniqu

larg

base

structur

williamclassif

event

perio

d

similarnew

section

tempor

differtest

rule

insur

method

pattern

model

high

will

outlier

mani

work

train

applic

general

use

examplresult

featur

To increase or reduce the number of words displayed we can tune the value of max.words=. Herewe have limited the display to the 100 most frequent words.

set.seed(142)

wordcloud(names(freq), freq, max.words=100)


Draft Only

Generated 2016-01-10 10:00:58+11:00


13.2 Reducing Clutter With Min Freq

larg

evalu

importrandom

like

section

unde

rsta

nd

includsubset

methodindividu

perform

combin

stepevent workattribut selectform

averagyear

effect

univers

clas

s

link

repres

call

reaction

supp

ortfunction

baseuserexplorvariabl

measurrank

informtabl

stat

ist

stud

i

analysi areadistanctool

level

outlier

servic

experi

time

data

search

cost

william

subspac

patientshowimplement

make

similar

databas

consist

record

lead unexpect

exis

t

propos

utar

angioedematempor

patternstage

intellig

figurdistribut

condit

result

discussaccuraci

drug

australia

identifi

dom

ain

appli

hospit

generalmani

number

perio

d

paper

case

featurbuild

process

describ

hot

technolog

sequencratio

insur

day

well

given

kdduse

rnn

order

within

node

target

applic

two

chang

discov

error

total

transact

smal

l

graham

acsi

fig

need

expect

ofte

n

learn

multipl

structur

interest

current

claim

new

dete

ct

journalobserv

report

allow

sizeone

adr

provid

approachhigh

can

find

classif

window

weight

modelcompar

neural

differ

mine

requir

threeaustralian

particular

vectorsourc

healthsingl

indicentiti

system

valucommon

advers

pmml

occur

cluster

exampl

state

increas

howev

rattl

task

hybridusual

network

defin

mean

regress

unit

dataset

rule

visualrelat

algorithm

follow

page

intern

http

type

point

expe

rtproceed

effici

machin

open

avail

collect

object

csiro

discoveritrain

generat

scienc

map

consid

confer

may

tree

packag

refer

sampl

will

group

forest

popul

also

polici

comput

interesting

firstoper

episod

associ

problem

medic

decis

set

interv

signific

predict

classifi

research

contain

knowledg

risk

present

developcare

nugget

techniqu

estim

test

age

spot

A more common approach to increase or reduce the number of words displayed is by tuning thevalue of min.freq=. Here we have limited the display to those words that occur at least 100times.

set.seed(142)

wordcloud(names(freq), freq, min.freq=100)


Draft Only

Generated 2016-01-10 10:00:58+11:00


13.3 Adding Some Colour

larg

evalu

importrandom

like

section

unde

rsta

nd

includsubset

methodindividu

perform

combin

stepevent workattribut selectform

averagyear

effect

univers

clas

s

link

repres

call

reaction

supp

ortfunction

baseuserexplorvariabl

measurrank

informtabl

stat

ist

stud

i

analysi areadistanctool

level

outlier

servic

experi

time

data

search

cost

william

subspac

patientshowimplement

make

similar

databas

consist

record

lead unexpect

exis

t

propos

utar

angioedematempor

patternstage

intellig

figurdistribut

condit

result

discussaccuraci

drug

australia

identifi

dom

ain

appli

hospit

generalmani

number

perio

d

paper

case

featurbuild

process

describ

hot

technolog

sequencratio

insur

day

well

given

kdduse

rnn

order

within

node

target

applic

two

chang

discov

error

total

transact

smal

l

graham

acsi

fig

need

expect

ofte

n

learn

multipl

structur

interest

current

claim

new

dete

ct

journalobserv

report

allow

sizeone

adr

provid

approachhigh

can

find

classif

window

weight

modelcompar

neural

differ

mine

requir

threeaustralian

particular

vectorsourc

healthsingl

indicentiti

system

valucommon

advers

pmml

occur

cluster

exampl

state

increas

howev

rattl

task

hybridusual

network

defin

mean

regress

unit

dataset

rule

visualrelat

algorithm

follow

page

intern

http

type

point

expe

rtproceed

effici

machin

open

avail

collect

object

csiro

discoveritrain

generat

scienc

map

consid

confer

may

tree

packag

refer

sampl

will

group

forest

popul

also

polici

comput

interesting

firstoper

episod

associ

problem

medic

decis

set

interv

signific

predict

classifi

research

contain

knowledg

risk

present

developcare

nugget

techniqu

estim

test

age

spot

We can also add some colour to the display. Here we make use of brewer.pal() from RColor-Brewer (Neuwirth, 2014) to generate a palette of colours to use.

set.seed(142)

wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))


Draft Only

Generated 2016-01-10 10:00:58+11:00


13.4 Varying the Scaling

largevalu

importrandom

like

section

unde

rsta

nd

includsubset

methodindividu

perform

combin

step eventwork

attribut

selectformaverag

year

effect

universcla

ss

linkrepres

call

reactionsu

ppor

t function

base

userexplor

variabl

measur

rank

inform

tablst

atist

stud

i

analysi

area

distanc

tool

level

outlierservic

experi

time

datasearch

cost

william

subspac

patient

show

implement

make

similar

databas

consist

record

lead

unexpect

exis

t

propos

utar

angioedema

tempor patternstage

intellig

figurdistribut

condit

result

discuss

accuraci

drug

australia

identifi

dom

ain

appli

hospit

general

mani

number

perio

d paper

case

featur

build

processdescrib

hot

technologsequenc

ratio

insur

day

well

given kddusernn

order

within

node

target

applic

two

chang

discov

errortotal

transact

smal

l

graham

acsi

fig

needexpect

ofte

n

learn

multipl

structur

interest

current

claim

newdete

ct

journalobserv

report

allow

size

one

adrprovid

approach

high

can

find

classif

window

weight

modelcompar

neuraldiffer

mine

requir

three

australian

particular

vectorsourc

health

singl

indic

entiti

system

valu

commonadvers

pmmloccur

clusterexampl

state

increas

howev

rattl

task

hybrid

usual

network

defin

mean

regress

unit

datasetrule visual

relat

algorithm

follow

page

intern

http

type

point

expe

rt proceed

effici

machin

openavail

collect

objectcsiro

discoveri

train

generatscienc

map

consid

confer

may

treepackag

refer

sampl

will

group

forestpopul

also

polici

comput

interesting

first

oper

episodassoci

problem

medic

decis set

interv

signific

predict

classifi

research

contain

knowledg

risk

present

develop

care

nugget

techniqu

estim

test

age

spot

We can change the range of font sizes used in the plot using the scale= option. By default themost frequent words have a scale of 4 and the least have a scale of 0.5. Here we illustrate theeffect of increasing the scale range.

set.seed(142)

wordcloud(names(freq), freq, min.freq=100, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))


Draft Only

Generated 2016-01-10 10:00:58+11:00


13.5 Rotating Words

larg

evalu impo

rt

randomlikesection

unde

rsta

nd

includsubset

methodin

divi

du

perform

combin

step

eventworkattribut

selectfo

rm averagyear

effect

univers

clas

slink

repres

call

reaction

supp

ort

function

base

userexplor

variabl

measur

rankinform

tabl

stat

ist

stud

i

analysi

area

distanc

tool

leve

l

outlier

servic

experi

timedata

search

cost

william

subspac

patientshow

implement

make

similar

databas

consist record

lead

unexpect

exis

t

proposutar

angioedema

temporpattern

stage

intellig

figur

distributcondit

resultdiscuss

accuraci

drug

australia

identifi

dom

ain

appli

hospit

general

mani

number

perio

d

paper

case

featur

build

process

desc

rib

hot

technologsequenc

ratio

insurday

well

givenkdduse

rnn

order

within

node

target

applic

two

changdi

scov

errortotal

transact

smal

l

graham

acsi

fig

need

expect

ofte

n

learn

mul

tipl

structur

inte

rest

current

claim

new

dete

ct

journalobservreport

allowsize

one

adr

provid

approach

high

can find

classif

window

weight

model

compar

neural

differ

minerequir

threeaustralian

particular

vectorsourc

health

singl

indic

entiti

system

valu

common

advers

pmml

occur

cluster

exampl

stat

e

increas

howev

rattl

task

hybrid

usual

network

defin

mean

regress

unit

datasetrule

visual

relat

algo

rithm

follow

page

intern

http

typepoint

expe

rt

proceed

effici

machin

open

avail

collect

object

csiro

discoveritrain

generat

scienc

map

consid

confer

may

tree

packagre

fer

samplwill

group

fore

st

popul

also

policicomput

interesting

first

oper

episodas

soci

problem

medic

decis

set

interv

sign

ific

predict

classifi

researchcontainknowledg

risk

present

developcare

nugget

techniqu

estim

test

agesp

ot

We can change the proportion of words that are rotated by 90 degrees from the default 10% to,say, 20% using rot.per=0.2.

set.seed(142)

dark2


14 Quantitative Analysis of Text

The qdap (Rinker, 2015) package provides an extensive suite of functions to support the quanti-tative analysis of text.

We can obtain simple summaries of a list of words, and to do so we will illustrate with theterms from our Term Document Matrix tdm. We first extract the shorter terms from each of ourdocuments into one long word list. To do so we convert tdm into a matrix, extract the columnnames (the terms) and retain those shorter than 20 characters.

words %

as.matrix %>%

colnames %>%

(function(x) x[nchar(x) < 20])

We can then summarise the word list. Notice, in particular, the use of dist tab() from qdap togenerate frequencies and percentages.

length(words)

## [1] 6456

head(words, 15)

## [1] "aaai" "aab" "aad" "aadrbhtm" "aadrbltn"

## [6] "aadrhtmliv" "aai" "aam" "aba" "abbrev"

## [11] "abbrevi" "abc" "abcd" "abdul" "abel"

summary(nchar(words))

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 3.000 5.000 6.000 6.644 8.000 19.000

table(nchar(words))

##

## 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

## 579 867 1044 1114 935 651 397 268 200 138 79 63 34 28 22

## 18 19

## 21 16

dist_tab(nchar(words))

## interval freq cum.freq percent cum.percent

## 1 3 579 579 8.97 8.97

## 2 4 867 1446 13.43 22.40

## 3 5 1044 2490 16.17 38.57

## 4 6 1114 3604 17.26 55.82

## 5 7 935 4539 14.48 70.31

## 6 8 651 5190 10.08 80.39

## 7 9 397 5587 6.15 86.54

## 8 10 268 5855 4.15 90.69

## 9 11 200 6055 3.10 93.79

## 10 12 138 6193 2.14 95.93

....


Draft Only

Generated 2016-01-10 10:00:58+11:00


14.1 Word Length Counts

0

300

600

900

5 10 15 20Number of Letters

Num

ber o

f Wor

ds

A simple plot is then effective in showing the distribution of the word lengths. Here we create asingle column data frame that is passed on to ggplot() to generate a histogram, with a verticalline to show the mean length of words.

data.frame(nletters=nchar(words)) %>%

ggplot(aes(x=nletters)) +

geom_histogram(binwidth=1) +

geom_vline(xintercept=mean(nchar(words)),

colour="green", size=1, alpha=.5) +

labs(x="Number of Letters", y="Number of Words")


Draft Only

Generated 2016-01-10 10:00:58+11:00


14.2 Letter Frequency

ZJQX

WYKVFBGHPDMCULSNOTRAI

E

0% 2% 4% 6% 8% 10% 12%Proportion

Lette

r

Next we want to review the frequency of letters across all of the words in the discourse. Somedata preparation will transform the vector of words into a list of letters, which we then constructa frequency count for, and pass this on to be plotted.

We again use a pipeline to string together the operations on the data. Starting from the vec-tor of words stored in word we split the words into characters using str split() from stringr(Wickham, 2015), removing the first string (an empty string) from each of the results (usingsapply()). Reducing the result into a simple vector, using unlist(), we then generate a dataframe recording the letter frequencies, using dist tab() from qdap. We can then plot the letterproportions.

library(dplyr)

library(stringr)

words %>%

str_split("") %>%

sapply(function(x) x[-1]) %>%

unlist %>%

dist_tab %>%

mutate(Letter=factor(toupper(interval),

levels=toupper(interval[order(freq)]))) %>%

ggplot(aes(Letter, weight=percent)) +

geom_bar() +

coord_flip() +

labs(y="Proportion") +

scale_y_continuous(breaks=seq(0, 12, 2),

label=function(x) paste0(x, "%"),


Draft Only

Generated 2016-01-10 10:00:58+11:00


expand=c(0,0), limits=c(0,12))


Draft Only

Generated 2016-01-10 10:00:58+11:00


14.3 Letter and Position Heatmap

.010 .019 .013 .010 .010 .007 .005 .003 .002 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000

.006 .001 .004 .002 .002 .002 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

.013 .003 .007 .006 .004 .004 .003 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000

.008 .002 .005 .005 .004 .003 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000

.006 .021 .010 .016 .014 .008 .005 .003 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000

.005 .001 .003 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

.004 .001 .004 .004 .002 .002 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

.005 .005 .002 .004 .003 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000

.007 .015 .009 .011 .012 .009 .007 .005 .003 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000

.002 .000 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

.002 .000 .001 .003 .001 .000 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

.005 .005 .008 .008 .006 .004 .004 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000

.009 .003 .007 .005 .003 .003 .002 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000

.005 .010 .012 .008 .007 .009 .005 .004 .003 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000

.005 .021 .009 .008 .009 .005 .005 .003 .002 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000

.011 .003 .006 .005 .002 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000

.001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

.009 .012 .013 .009 .010 .009 .006 .004 .002 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000

.015 .004 .011 .008 .007 .006 .005 .003 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000

.008 .005 .012 .013 .009 .008 .007 .005 .003 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000

.004 .010 .005 .005 .004 .003 .002 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000

.003 .001 .003 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

.005 .002 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

.001 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

.001 .001 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

.001 .000 .000 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000Z

Y

X

W

V

U

T

S

R

Q

P

O

N

M

L

K

J

I

H

G

F

E

D

C

B

A

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Position

Lette

r

Proportion0.000

0.005

0.010

0.015

0.020

The qheat() function from qdap provides an effective visualisation of tabular data. Here wetransform the list of words into a position count of each letter, and constructing a table of theproportions that is passed on to qheat() to do the plotting.

words %>%

lapply(function(x) sapply(letters, gregexpr, x, fixed=TRUE)) %>%

unlist %>%

(function(x) x[x!=-1]) %>%

(function(x) setNames(x, gsub("\\d", "", names(x)))) %>%(function(x) apply(table(data.frame(letter=toupper(names(x)),

position=unname(x))),

1, function(y) y/length(x))) %>%

qheat(high="green", low="yellow", by.column=NULL,

values=TRUE, digits=3, plot=FALSE) +

labs(y="Letter", x="Position") +

theme(axis.text.x=element_text(angle=0)) +

guides(fill=guide_legend(title="Proportion"))


Draft Only

Generated 2016-01-10 10:00:58+11:00


14.4 Miscellaneous Functions

We can generate gender from a name list, using the genderdata (?) package

devtools::install_github("lmullen/gender-data-pkg")

name2sex(qcv(graham, frank, leslie, james, jacqui, jack, kerry, kerrie))

## The genderdata package needs to be installed.

## Error in install genderdata package(): Failed to install the genderdata package.

## Please try installing the package for yourself using the following command:

## install.packages("genderdata", repos = "http://packages.ropensci.org", type

= "source")


Draft Only

Generated 2016-01-10 10:00:58+11:00


15 Word Distances

Continuous bag of words (CBOW). Word2Vec associates each word in a vocabulary with a uniquevector of real numbers of length d. Words that have a similar syntactic context appear closertogether within the vector space. The syntactic context is based on a set of words within aspecific window size.

install.packages("tmcn.word2vec", repos="http://R-Forge.R-project.org")

## Installing package into ’/home/gjw/R/x86 64-pc-linux-gnu-library/3.2’

## (as ’lib’ is unspecified)

##

## The downloaded source packages are in

## '/tmp/Rtmpt1u3GR/downloaded_packages'

library(tmcn.word2vec)

model


16 Review—Preparing the Corpus

Here in one sequence is collected the code to perform a text mining project. Notice that we wouldnot necessarily do all of these steps so pick and choose as is appropriate to your situation.

# Required packages

library(tm)

library(wordcloud)

# Locate and load the Corpus.

cname


17 Review—Analysing the Corpus

# Document term matrix.

dtm


18 LDA

Topic Models such as Latent Dirichlet Allocation has been popular for text mining in last 15years. Applied with varying degrees of success. Text is fed into LDA to extract the topicsunderlying the text document. Examples are the AP corpus and the Science Corpus 1880-2002(Blei and Lafferty 2009). PERHAPS USEFUL IN BOOK?

When is LDA applicable - it will fail on some data and need to choose number of topics tofind and how many documents are needed. HOw do we know the topics learned are correcttopics.

Two fundemental papers - independelty discovered: Blei, Ng, Jordan - NIPS 2001 with 11kcitations. Other paper is Pritchard, Stephens, and Donnelly in Genetics June 200 14K citations- models are exactly the same except for minor differences: except topics versus populationstructures.

No theoretic analysis as such. How to guarantee correct topics and how efficient is the learningprocedure?

Observations:

LDA won’t work on many short tweets or very few long documents.

We should not liberally over-fit the LDA with too many redundant topics...

Limiting factors:

We should use as many documents as we can and short documents less than 10 words won’twork even if there are many of them. Need sufficiently long documents.

Small Dirichlet paramenter helps especially if we overfit. See Long Nguen’s keynote at PAKDD2015 in Vietnam.

number of documents the most important factor

document length plays a useful role too

avoid overfitting as you get too many topics and don’t really learn anything as the humn needsto cull the topics.

New work detects new topics as they emerge.

library(lda)

## Error in library(lda): there is no package called ’lda’

# From demo(lda)

library("ggplot2")

library("reshape2")

data(cora.documents)

## Warning in data(cora.documents): data set ’cora.documents’ not found

data(cora.vocab)

## Warning in data(cora.vocab): data set ’cora.vocab’ not found


Draft Only

Generated 2016-01-10 10:00:58+11:00


theme_set(theme_bw())

set.seed(8675309)

K


19 Further Reading and Acknowledgements

The Rattle Book, published by Springer, provides a comprehensiveintroduction to data mining and analytics using Rattle and R.It is available from Amazon. Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from http://datamining.togaware.com, including theDatamining Desktop Survival Guide.

This chapter is one of many chapters available from http://HandsOnDataScience.com. In particular follow the links on thewebsite with a * which indicates the generally more developed chap-ters.

Other resources include:

The Journal of Statistical Software article, Text Mining Infrastructure in R is a good starthttp://www.jstatsoft.org/v25/i05/paper

Bilisoly (2008) presents methods and algorithms for text mining using Perl.

Thanks also to Tony Nolan for suggestions of some of the examples used in this chapter.

Some of the qdap examples were motivated by http://trinkerrstuff.wordpress.com/2014/10/31/exploration-of-letter-make-up-of-english-words/.


Draft Only

Generated 2016-01-10 10:00:58+11:00

http://www.amazon.com/gp/product/1441998896/ref=as_li_qf_sp_asin_tl?ie=UTF8&tag=togaware-20&linkCode=as2&camp=217145&creative=399373&creativeASIN=1441998896

http://www.amazon.com/gp/product/1441998896/ref=as_li_qf_sp_asin_tl?ie=UTF8&tag=togaware-20&linkCode=as2&camp=217145&creative=399373&creativeASIN=1441998896

http://datamining.togaware.com

http://datamining.togaware.com/survivor/index.html

http://HandsOnDataScience.com

http://HandsOnDataScience.com

http://www.jstatsoft.org/v25/i05/paper

http://trinkerrstuff.wordpress.com/2014/10/31/exploration-of-letter-make-up-of-english-words/

http://trinkerrstuff.wordpress.com/2014/10/31/exploration-of-letter-make-up-of-english-words/


20 References

Bilisoly R (2008). Practical Text Mining with Perl. Wiley Series on Methods and Applicationsin Data Mining. Wiley. ISBN 9780470382851. URL http://books.google.com.au/books?id=YkMFVbsrdzkC.

Feinerer I, Hornik K (2015). tm: Text Mining Package. R package version 0.6-2, URL https://CRAN.R-project.org/package=tm.

Hansen KD, Gentry J, Long L, Gentleman R, Falcon S, Hahne F, Sarkar D (2016). Rgraphviz:Provides plotting capabilities for R graph objects. R package version 2.12.0.

Neuwirth E (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2, URLhttps://CRAN.R-project.org/package=RColorBrewer.

R Core Team (2015). R: A Language and Environment for Statistical Computing. R Foundationfor Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Rinker T (2015). qdap: Bridging the Gap Between Qualitative Data and Quantitative Analysis.R package version 2.2.4, URL https://CRAN.R-project.org/package=qdap.

Wickham H (2015). stringr: Simple, Consistent Wrappers for Common String Operations. Rpackage version 1.0.0, URL https://CRAN.R-project.org/package=stringr.

Williams GJ (2009). “Rattle: A Data Mining GUI for R.” The R Journal, 1(2), 45–55. URLhttp://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf.

Williams GJ (2011). Data Mining with Rattle and R: The art of excavating data for knowledgediscovery. Use R! Springer, New York.

This document, sourced from TextMiningO.Rnw bitbucket revision 76, was processed by KnitRversion 1.12 of 2016-01-06 and took 41.3 seconds to process. It was generated by gjw on nyxrunning Ubuntu 14.04.3 LTS with Intel(R) Xeon(R) CPU W3520 @ 2.67GHz having 8 cores and12.3GB of RAM. It completed the processing 2016-01-10 09:58:57.


Draft Only

Generated 2016-01-10 10:00:58+11:00

http://books.google.com.au/books?id=YkMFVbsrdzkC

http://books.google.com.au/books?id=YkMFVbsrdzkC

https://CRAN.R-project.org/package=tm

https://CRAN.R-project.org/package=tm

https://CRAN.R-project.org/package=RColorBrewer

https://www.R-project.org/

https://CRAN.R-project.org/package=qdap

https://CRAN.R-project.org/package=stringr

http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf

Draft Only

Generated 2016-01-10 10:00:58+11:00

Getting Started: The Corpus

Corpus Sources and Readers

Text Documents

PDF Documents

Word Documents

Exploring the Corpus

Preparing the Corpus

Simple Transforms

Conversion to Lower Case

Remove Numbers

Remove Punctuation

Remove English Stop Words

Remove Own Stop Words

Strip Whitespace

Specific Transformations

Stemming

Creating a Document Term Matrix

Exploring the Document Term Matrix

Distribution of Term Frequencies

Conversion to Matrix and Save to CSV

Removing Sparse Terms

Identifying Frequent Items and Associations

Correlations Plots

Correlations Plot—Options

Plotting Word Frequencies

Word Clouds

Reducing Clutter With Max Words

Reducing Clutter With Min Freq

Adding Some Colour

Varying the Scaling

Rotating Words

Quantitative Analysis of Text

Word Length Counts

Letter Frequency

Letter and Position Heatmap

Miscellaneous Functions

Word Distances

Review—Preparing the Corpus

Review—Analysing the Corpus

LDA

Further Reading and Acknowledgements

References

.ls.objects

Creating Shaped Wordclouds Using R

Tidewater Big Data EnthusiastsChuck Cartledge

Developer

November 3, 2016 at 11:04pm

Contents

List of Figures i

1 Introduction 1

2 Discussion 1

3 Conclusion 7

A Misc. files 9

List of Figures

1 A sample word cloud based on Romeo and Juliet. . . . . . . . . . . . . . . . 22 A more interesting word cloud based on Romeo and Juliette. . . . . . . . . . 33 An empty word cloud figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 A filled word cloud figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 A filled USA word cloud figure. . . . . . . . . . . . . . . . . . . . . . . . . . 66 A collection of sample word clouds. . . . . . . . . . . . . . . . . . . . . . . . 8

i

1 Introduction

The R library wordcloud provides an easy way to create an image showing how often a word(or tag) appears in a corpus (see Figure 1 on the following page). In a word cloud, the sizeof a word indicates how often that word appears. Word cloud words can be colored as well.

While word clouds are easy to create, often the clouds could be shaped differently tocreate a more lasting and profound impression (see Figure 2 on page 3).

2 Discussion

The R library wordcloud21 provides the capability of creating a word cloud that takes theshape of an image, or the shape of letters. The collection of predefined shapes include:

• ’cardiod’ – a heart shape

• ’circle’ – the default

• ’diamond’ – an alias for a square

• ’pentagon’ – the five sided object

• ’star’ – a five pointed star

• ’triangle’ – a triangle with the wide base at the bottom

• ’triangle-forward’ – a triangle with the wide base at the left

This collection of shapes (when combined with a user specified background color), maybe enough to satisfy a wide variety of needs. But it is the figPath option that offers themost potential.

The figPath option can point to a figure that contains the image the cloud path shouldfill.

Here are the steps to create an “interesting” shape to fill with a word cloud:

1. Download/create an image with only two items (see Figure 3 on page 4):

• A white background, and• A black outline of the shape.

2. Fill the interior of the shape with the same color as the outline (see Figure 4 on page 5).

3. Pass the location of the filled image as the figPath parameter (see Figure 5 on page 6).

Figure 1: A sample word cloud based on Romeo and Juliet. The image was created usingwordcloud function in the wordcloud library and the text from “Romeo and Juliet.”

2

Figure 2: A more interesting word cloud based on Romeo and Juliette. The image wascreated using wordcloud2 function in the wordcloud2 library and the text from “Romeo andJuliet.”

3

Figure 3: An empty word cloud figure.

4

Figure 4: A filled word cloud figure.

5

Figure 5: A filled USA word cloud figure.

The wordcloud2 function behaves slightly differently than most of the other R plot func-tions that I’ve used. The result from both wordcloud2 and letterCloud is not displayablewithin R. These functions actually create an HTML page in a temporary directory with em-bedded JavaScript that performs the placement of the words within the shape, and providesa level of interaction after the page is displayed. R “understands” that the product fromthese functions is an HTML widget and starts up the default browser to show the page. Thepage, and its sub-directories are removed when R ends.

The fact that the page uses JavaScript introduces some interesting aspects. Buried in theJavaScript used by the page to place the words in the cloud are a plethora of Math.random()calls. The JavaScript specification says that the Math.random() function has to return avalue greater than or equal to 0, and less than 1, which is reasonable for a random function.The specification also says that the implementation of the random function is up to theJavaScript application, and does not specify how the numbers are to be generated. Meaningthat the same HTML page being viewed by two different browsers, may generate two differentsequences of random numbers. Most random number generators have the capability of settinga seed value so that a repeatable sequence can be generated. JavaScript does not supportthe idea of a random number seed. The HTML page and collection of directories can bemoved to a server where they are available for use and support.

All of this means that each loading and viewing of the page will generate a different

1Available from:https://github.com/Lchiffon/wordcloud2

6

https://github.com/Lchiffon/wordcloud2

image, and there is no practical way to “get back” to an image that was good.In the Files section (see Section A on page 9) is an R script and support files to work

with. The R script was used to create various images (see Figure 6 on the following page).

3 Conclusion

The wordcloud2 library enables you to create word clouds of arbitrary shape inside an HTMLusing JavaScript to position and orient each word. Each HTML page and its associatedlibrary files are placed in individual directories that are removed when the creating R processterminates. Pages and files can be moved, or copied for safe keeping if desired. Because thepages use the Math.random() JavaScript function, each time the page is loaded, words willbe positioned differently in the cloud. If the desired shape has an internal hole, then it ispossible that some words may not be placed in the cloud.

wordcloud2 allows you to create word clouds to support your data visualization needs.

7

(a) A heart.

(b) The letters “USA”.

(c) A star.

(d) The USA.

Figure 6: A collection of sample word clouds. These images were created with the attachedR script.

8

A Misc. files

The files used to create all these figures are attached to this report. They are:

1. romeoAndJuliet.base64 – default text used to demonstrate the software

2. heart.png – a heart shape with a hole

3. usa.png – an outline of the continental United States

4. wordCloud.R – an R script to demonstrate making word clouds

9

ClJvbWVvIGFuZCBKdWxpZXQKU2hha2VzcGVhcmUgaG9tZXBhZ2UgfCBSb21lbyBhbmQgSnVsaWV0IHwgRW50aXJlIHBsYXkKQUNUIEkKUFJPTE9HVUUKCiAgICBUd28gaG91c2Vob2xkcywgYm90aCBhbGlrZSBpbiBkaWduaXR5LAogICAgSW4gZmFpciBWZXJvbmEsIHdoZXJlIHdlIGxheSBvdXIgc2NlbmUsCiAgICBGcm9tIGFuY2llbnQgZ3J1ZGdlIGJyZWFrIHRvIG5ldyBtdXRpbnksCiAgICBXaGVyZSBjaXZpbCBibG9vZCBtYWtlcyBjaXZpbCBoYW5kcyB1bmNsZWFuLgogICAgRnJvbSBmb3J0aCB0aGUgZmF0YWwgbG9pbnMgb2YgdGhlc2UgdHdvIGZvZXMKICAgIEEgcGFpciBvZiBzdGFyLWNyb3NzJ2QgbG92ZXJzIHRha2UgdGhlaXIgbGlmZTsKICAgIFdob3NlIG1pc2FkdmVudHVyZWQgcGl0ZW91cyBvdmVydGhyb3dzCiAgICBEbyB3aXRoIHRoZWlyIGRlYXRoIGJ1cnkgdGhlaXIgcGFyZW50cycgc3RyaWZlLgogICAgVGhlIGZlYXJmdWwgcGFzc2FnZSBvZiB0aGVpciBkZWF0aC1tYXJrJ2QgbG92ZSwKICAgIEFuZCB0aGUgY29udGludWFuY2Ugb2YgdGhlaXIgcGFyZW50cycgcmFnZSwKICAgIFdoaWNoLCBidXQgdGhlaXIgY2hpbGRyZW4ncyBlbmQsIG5vdWdodCBjb3VsZCByZW1vdmUsCiAgICBJcyBub3cgdGhlIHR3byBob3VycycgdHJhZmZpYyBvZiBvdXIgc3RhZ2U7CiAgICBUaGUgd2hpY2ggaWYgeW91IHdpdGggcGF0aWVudCBlYXJzIGF0dGVuZCwKICAgIFdoYXQgaGVyZSBzaGFsbCBtaXNzLCBvdXIgdG9pbCBzaGFsbCBzdHJpdmUgdG8gbWVuZC4KClNDRU5FIEkuIFZlcm9uYS4gQSBwdWJsaWMgcGxhY2UuCgogICAgRW50ZXIgU0FNUFNPTiBhbmQgR1JFR09SWSwgb2YgdGhlIGhvdXNlIG9mIENhcHVsZXQsIGFybWVkIHdpdGggc3dvcmRzIGFuZCBidWNrbGVycyAKClNBTVBTT04KCiAgICBHcmVnb3J5LCBvJyBteSB3b3JkLCB3ZSdsbCBub3QgY2FycnkgY29hbHMuCgpHUkVHT1JZCgogICAgTm8sIGZvciB0aGVuIHdlIHNob3VsZCBiZSBjb2xsaWVycy4KClNBTVBTT04KCiAgICBJIG1lYW4sIGFuIHdlIGJlIGluIGNob2xlciwgd2UnbGwgZHJhdy4KCkdSRUdPUlkKCiAgICBBeSwgd2hpbGUgeW91IGxpdmUsIGRyYXcgeW91ciBuZWNrIG91dCBvJyB0aGUgY29sbGFyLgoKU0FNUFNPTgoKICAgIEkgc3RyaWtlIHF1aWNrbHksIGJlaW5nIG1vdmVkLgoKR1JFR09SWQoKICAgIEJ1dCB0aG91IGFydCBub3QgcXVpY2tseSBtb3ZlZCB0byBzdHJpa2UuCgpTQU1QU09OCgogICAgQSBkb2cgb2YgdGhlIGhvdXNlIG9mIE1vbnRhZ3VlIG1vdmVzIG1lLgoKR1JFR09SWQoKICAgIFRvIG1vdmUgaXMgdG8gc3RpcjsgYW5kIHRvIGJlIHZhbGlhbnQgaXMgdG8gc3RhbmQ6CiAgICB0aGVyZWZvcmUsIGlmIHRob3UgYXJ0IG1vdmVkLCB0aG91IHJ1bm4nc3QgYXdheS4KClNBTVBTT04KCiAgICBBIGRvZyBvZiB0aGF0IGhvdXNlIHNoYWxsIG1vdmUgbWUgdG8gc3RhbmQ6IEkgd2lsbAogICAgdGFrZSB0aGUgd2FsbCBvZiBhbnkgbWFuIG9yIG1haWQgb2YgTW9udGFndWUncy4KCkdSRUdPUlkKCiAgICBUaGF0IHNob3dzIHRoZWUgYSB3ZWFrIHNsYXZlOyBmb3IgdGhlIHdlYWtlc3QgZ29lcwogICAgdG8gdGhlIHdhbGwuCgpTQU1QU09OCgogICAgVHJ1ZTsgYW5kIHRoZXJlZm9yZSB3b21lbiwgYmVpbmcgdGhlIHdlYWtlciB2ZXNzZWxzLAogICAgYXJlIGV2ZXIgdGhydXN0IHRvIHRoZSB3YWxsOiB0aGVyZWZvcmUgSSB3aWxsIHB1c2gKICAgIE1vbnRhZ3VlJ3MgbWVuIGZyb20gdGhlIHdhbGwsIGFuZCB0aHJ1c3QgaGlzIG1haWRzCiAgICB0byB0aGUgd2FsbC4KCkdSRUdPUlkKCiAgICBUaGUgcXVhcnJlbCBpcyBiZXR3ZWVuIG91ciBtYXN0ZXJzIGFuZCB1cyB0aGVpciBtZW4uCgpTQU1QU09OCgogICAgJ1RpcyBhbGwgb25lLCBJIHdpbGwgc2hvdyBteXNlbGYgYSB0eXJhbnQ6IHdoZW4gSQogICAgaGF2ZSBmb3VnaHQgd2l0aCB0aGUgbWVuLCBJIHdpbGwgYmUgY3J1ZWwgd2l0aCB0aGUKICAgIG1haWRzLCBhbmQgY3V0IG9mZiB0aGVpciBoZWFkcy4KCkdSRUdPUlkKCiAgICBUaGUgaGVhZHMgb2YgdGhlIG1haWRzPwoKU0FNUFNPTgoKICAgIEF5LCB0aGUgaGVhZHMgb2YgdGhlIG1haWRzLCBvciB0aGVpciBtYWlkZW5oZWFkczsKICAgIHRha2UgaXQgaW4gd2hhdCBzZW5zZSB0aG91IHdpbHQuCgpHUkVHT1JZCgogICAgVGhleSBtdXN0IHRha2UgaXQgaW4gc2Vuc2UgdGhhdCBmZWVsIGl0LgoKU0FNUFNPTgoKICAgIE1lIHRoZXkgc2hhbGwgZmVlbCB3aGlsZSBJIGFtIGFibGUgdG8gc3RhbmQ6IGFuZAogICAgJ3RpcyBrbm93biBJIGFtIGEgcHJldHR5IHBpZWNlIG9mIGZsZXNoLgoKR1JFR09SWQoKICAgICdUaXMgd2VsbCB0aG91IGFydCBub3QgZmlzaDsgaWYgdGhvdSBoYWRzdCwgdGhvdQogICAgaGFkc3QgYmVlbiBwb29yIEpvaG4uIERyYXcgdGh5IHRvb2whIGhlcmUgY29tZXMKICAgIHR3byBvZiB0aGUgaG91c2Ugb2YgdGhlIE1vbnRhZ3Vlcy4KClNBTVBTT04KCiAgICBNeSBuYWtlZCB3ZWFwb24gaXMgb3V0OiBxdWFycmVsLCBJIHdpbGwgYmFjayB0aGVlLgoKR1JFR09SWQoKICAgIEhvdyEgdHVybiB0aHkgYmFjayBhbmQgcnVuPwoKU0FNUFNPTgoKICAgIEZlYXIgbWUgbm90LgoKR1JFR09SWQoKICAgIE5vLCBtYXJyeTsgSSBmZWFyIHRoZWUhCgpTQU1QU09OCgogICAgTGV0IHVzIHRha2UgdGhlIGxhdyBvZiBvdXIgc2lkZXM7IGxldCB0aGVtIGJlZ2luLgoKR1JFR09SWQoKICAgIEkgd2lsbCBmcm93biBhcyBJIHBhc3MgYnksIGFuZCBsZXQgdGhlbSB0YWtlIGl0IGFzCiAgICB0aGV5IGxpc3QuCgpTQU1QU09OCgogICAgTmF5LCBhcyB0aGV5IGRhcmUuIEkgd2lsbCBiaXRlIG15IHRodW1iIGF0IHRoZW07CiAgICB3aGljaCBpcyBhIGRpc2dyYWNlIHRvIHRoZW0sIGlmIHRoZXkgYmVhciBpdC4KCiAgICBFbnRlciBBQlJBSEFNIGFuZCBCQUxUSEFTQVIKCkFCUkFIQU0KCiAgICBEbyB5b3UgYml0ZSB5b3VyIHRodW1iIGF0IHVzLCBzaXI/CgpTQU1QU09OCgogICAgSSBkbyBiaXRlIG15IHRodW1iLCBzaXIuCgpBQlJBSEFNCgogICAgRG8geW91IGJpdGUgeW91ciB0aHVtYiBhdCB1cywgc2lyPwoKU0FNUFNPTgoKICAgIFtBc2lkZSB0byBHUkVHT1JZXSBJcyB0aGUgbGF3IG9mIG91ciBzaWRlLCBpZiBJIHNheQogICAgYXk/CgpHUkVHT1JZCgogICAgTm8uCgpTQU1QU09OCgogICAgTm8sIHNpciwgSSBkbyBub3QgYml0ZSBteSB0aHVtYiBhdCB5b3UsIHNpciwgYnV0IEkKICAgIGJpdGUgbXkgdGh1bWIsIHNpci4KCkdSRUdPUlkKCiAgICBEbyB5b3UgcXVhcnJlbCwgc2lyPwoKQUJSQUhBTQoKICAgIFF1YXJyZWwgc2lyISBubywgc2lyLgoKU0FNUFNPTgoKICAgIElmIHlvdSBkbywgc2lyLCBJIGFtIGZvciB5b3U6IEkgc2VydmUgYXMgZ29vZCBhIG1hbiBhcyB5b3UuCgpBQlJBSEFNCgogICAgTm8gYmV0dGVyLgoKU0FNUFNPTgoKICAgIFdlbGwsIHNpci4KCkdSRUdPUlkKCiAgICBTYXkgJ2JldHRlcjonIGhlcmUgY29tZXMgb25lIG9mIG15IG1hc3RlcidzIGtpbnNtZW4uCgpTQU1QU09OCgogICAgWWVzLCBiZXR0ZXIsIHNpci4KCkFCUkFIQU0KCiAgICBZb3UgbGllLgoKU0FNUFNPTgoKICAgIERyYXcsIGlmIHlvdSBiZSBtZW4uIEdyZWdvcnksIHJlbWVtYmVyIHRoeSBzd2FzaGluZyBibG93LgoKICAgIFRoZXkgZmlnaHQKCiAgICBFbnRlciBCRU5WT0xJTwoKQkVOVk9MSU8KCiAgICBQYXJ0LCBmb29scyEKICAgIFB1dCB1cCB5b3VyIHN3b3JkczsgeW91IGtub3cgbm90IHdoYXQgeW91IGRvLgoKICAgIEJlYXRzIGRvd24gdGhlaXIgc3dvcmRzCgogICAgRW50ZXIgVFlCQUxUCgpUWUJBTFQKCiAgICBXaGF0LCBhcnQgdGhvdSBkcmF3biBhbW9uZyB0aGVzZSBoZWFydGxlc3MgaGluZHM/CiAgICBUdXJuIHRoZWUsIEJlbnZvbGlvLCBsb29rIHVwb24gdGh5IGRlYXRoLgoKQkVOVk9MSU8KCiAgICBJIGRvIGJ1dCBrZWVwIHRoZSBwZWFjZTogcHV0IHVwIHRoeSBzd29yZCwKICAgIE9yIG1hbmFnZSBpdCB0byBwYXJ0IHRoZXNlIG1lbiB3aXRoIG1lLgoKVFlCQUxUCgogICAgV2hhdCwgZHJhd24sIGFuZCB0YWxrIG9mIHBlYWNlISBJIGhhdGUgdGhlIHdvcmQsCiAgICBBcyBJIGhhdGUgaGVsbCwgYWxsIE1vbnRhZ3VlcywgYW5kIHRoZWU6CiAgICBIYXZlIGF0IHRoZWUsIGNvd2FyZCEKCiAgICBUaGV5IGZpZ2h0CgogICAgRW50ZXIsIHNldmVyYWwgb2YgYm90aCBob3VzZXMsIHdobyBqb2luIHRoZSBmcmF5OyB0aGVuIGVudGVyIENpdGl6ZW5zLCB3aXRoIGNsdWJzCgpGaXJzdCBDaXRpemVuCgogICAgQ2x1YnMsIGJpbGxzLCBhbmQgcGFydGlzYW5zISBzdHJpa2UhIGJlYXQgdGhlbSBkb3duIQogICAgRG93biB3aXRoIHRoZSBDYXB1bGV0cyEgZG93biB3aXRoIHRoZSBNb250YWd1ZXMhCgogICAgRW50ZXIgQ0FQVUxFVCBpbiBoaXMgZ293biwgYW5kIExBRFkgQ0FQVUxFVAoKQ0FQVUxFVAoKICAgIFdoYXQgbm9pc2UgaXMgdGhpcz8gR2l2ZSBtZSBteSBsb25nIHN3b3JkLCBobyEKCkxBRFkgQ0FQVUxFVAoKICAgIEEgY3J1dGNoLCBhIGNydXRjaCEgd2h5IGNhbGwgeW91IGZvciBhIHN3b3JkPwoKQ0FQVUxFVAoKICAgIE15IHN3b3JkLCBJIHNheSEgT2xkIE1vbnRhZ3VlIGlzIGNvbWUsCiAgICBBbmQgZmxvdXJpc2hlcyBoaXMgYmxhZGUgaW4gc3BpdGUgb2YgbWUuCgogICAgRW50ZXIgTU9OVEFHVUUgYW5kIExBRFkgTU9OVEFHVUUKCk1PTlRBR1VFCgogICAgVGhvdSB2aWxsYWluIENhcHVsZXQsLS1Ib2xkIG1lIG5vdCwgbGV0IG1lIGdvLgoKTEFEWSBNT05UQUdVRQoKICAgIFRob3Ugc2hhbHQgbm90IHN0aXIgYSBmb290IHRvIHNlZWsgYSBmb2UuCgogICAgRW50ZXIgUFJJTkNFLCB3aXRoIEF0dGVuZGFudHMKClBSSU5DRQoKICAgIFJlYmVsbGlvdXMgc3ViamVjdHMsIGVuZW1pZXMgdG8gcGVhY2UsCiAgICBQcm9mYW5lcnMgb2YgdGhpcyBuZWlnaGJvdXItc3RhaW5lZCBzdGVlbCwtLQogICAgV2lsbCB0aGV5IG5vdCBoZWFyPyBXaGF0LCBobyEgeW91IG1lbiwgeW91IGJlYXN0cywKICAgIFRoYXQgcXVlbmNoIHRoZSBmaXJlIG9mIHlvdXIgcGVybmljaW91cyByYWdlCiAgICBXaXRoIHB1cnBsZSBmb3VudGFpbnMgaXNzdWluZyBmcm9tIHlvdXIgdmVpbnMsCiAgICBPbiBwYWluIG9mIHRvcnR1cmUsIGZyb20gdGhvc2UgYmxvb2R5IGhhbmRzCiAgICBUaHJvdyB5b3VyIG1pc3RlbXBlcidkIHdlYXBvbnMgdG8gdGhlIGdyb3VuZCwKICAgIEFuZCBoZWFyIHRoZSBzZW50ZW5jZSBvZiB5b3VyIG1vdmVkIHByaW5jZS4KICAgIFRocmVlIGNpdmlsIGJyYXdscywgYnJlZCBvZiBhbiBhaXJ5IHdvcmQsCiAgICBCeSB0aGVlLCBvbGQgQ2FwdWxldCwgYW5kIE1vbnRhZ3VlLAogICAgSGF2ZSB0aHJpY2UgZGlzdHVyYidkIHRoZSBxdWlldCBvZiBvdXIgc3RyZWV0cywKICAgIEFuZCBtYWRlIFZlcm9uYSdzIGFuY2llbnQgY2l0aXplbnMKICAgIENhc3QgYnkgdGhlaXIgZ3JhdmUgYmVzZWVtaW5nIG9ybmFtZW50cywKICAgIFRvIHdpZWxkIG9sZCBwYXJ0aXNhbnMsIGluIGhhbmRzIGFzIG9sZCwKICAgIENhbmtlcidkIHdpdGggcGVhY2UsIHRvIHBhcnQgeW91ciBjYW5rZXInZCBoYXRlOgogICAgSWYgZXZlciB5b3UgZGlzdHVyYiBvdXIgc3RyZWV0cyBhZ2FpbiwKICAgIFlvdXIgbGl2ZXMgc2hhbGwgcGF5IHRoZSBmb3JmZWl0IG9mIHRoZSBwZWFjZS4KICAgIEZvciB0aGlzIHRpbWUsIGFsbCB0aGUgcmVzdCBkZXBhcnQgYXdheToKICAgIFlvdSBDYXB1bGV0OyBzaGFsbCBnbyBhbG9uZyB3aXRoIG1lOgogICAgQW5kLCBNb250YWd1ZSwgY29tZSB5b3UgdGhpcyBhZnRlcm5vb24sCiAgICBUbyBrbm93IG91ciBmdXJ0aGVyIHBsZWFzdXJlIGluIHRoaXMgY2FzZSwKICAgIFRvIG9sZCBGcmVlLXRvd24sIG91ciBjb21tb24ganVkZ21lbnQtcGxhY2UuCiAgICBPbmNlIG1vcmUsIG9uIHBhaW4gb2YgZGVhdGgsIGFsbCBtZW4gZGVwYXJ0LgoKICAgIEV4ZXVudCBhbGwgYnV0IE1PTlRBR1VFLCBMQURZIE1PTlRBR1VFLCBhbmQgQkVOVk9MSU8KCk1PTlRBR1VFCgogICAgV2hvIHNldCB0aGlzIGFuY2llbnQgcXVhcnJlbCBuZXcgYWJyb2FjaD8KICAgIFNwZWFrLCBuZXBoZXcsIHdlcmUgeW91IGJ5IHdoZW4gaXQgYmVnYW4/CgpCRU5WT0xJTwoKICAgIEhlcmUgd2VyZSB0aGUgc2VydmFudHMgb2YgeW91ciBhZHZlcnNhcnksCiAgICBBbmQgeW91cnMsIGNsb3NlIGZpZ2h0aW5nIGVyZSBJIGRpZCBhcHByb2FjaDoKICAgIEkgZHJldyB0byBwYXJ0IHRoZW06IGluIHRoZSBpbnN0YW50IGNhbWUKICAgIFRoZSBmaWVyeSBUeWJhbHQsIHdpdGggaGlzIHN3b3JkIHByZXBhcmVkLAogICAgV2hpY2gsIGFzIGhlIGJyZWF0aGVkIGRlZmlhbmNlIHRvIG15IGVhcnMsCiAgICBIZSBzd3VuZyBhYm91dCBoaXMgaGVhZCBhbmQgY3V0IHRoZSB3aW5kcywKICAgIFdobyBub3RoaW5nIGh1cnQgd2l0aGFsIGhpc3MnZCBoaW0gaW4gc2Nvcm46CiAgICBXaGlsZSB3ZSB3ZXJlIGludGVyY2hhbmdpbmcgdGhydXN0cyBhbmQgYmxvd3MsCiAgICBDYW1lIG1vcmUgYW5kIG1vcmUgYW5kIGZvdWdodCBvbiBwYXJ0IGFuZCBwYXJ0LAogICAgVGlsbCB0aGUgcHJpbmNlIGNhbWUsIHdobyBwYXJ0ZWQgZWl0aGVyIHBhcnQuCgpMQURZIE1PTlRBR1VFCgogICAgTywgd2hlcmUgaXMgUm9tZW8/IHNhdyB5b3UgaGltIHRvLWRheT8KICAgIFJpZ2h0IGdsYWQgSSBhbSBoZSB3YXMgbm90IGF0IHRoaXMgZnJheS4KCkJFTlZPTElPCgogICAgTWFkYW0sIGFuIGhvdXIgYmVmb3JlIHRoZSB3b3JzaGlwcCdkIHN1bgogICAgUGVlcidkIGZvcnRoIHRoZSBnb2xkZW4gd2luZG93IG9mIHRoZSBlYXN0LAogICAgQSB0cm91YmxlZCBtaW5kIGRyYXZlIG1lIHRvIHdhbGsgYWJyb2FkOwogICAgV2hlcmUsIHVuZGVybmVhdGggdGhlIGdyb3ZlIG9mIHN5Y2Ftb3JlCiAgICBUaGF0IHdlc3R3YXJkIHJvb3RldGggZnJvbSB0aGUgY2l0eSdzIHNpZGUsCiAgICBTbyBlYXJseSB3YWxraW5nIGRpZCBJIHNlZSB5b3VyIHNvbjoKICAgIFRvd2FyZHMgaGltIEkgbWFkZSwgYnV0IGhlIHdhcyB3YXJlIG9mIG1lCiAgICBBbmQgc3RvbGUgaW50byB0aGUgY292ZXJ0IG9mIHRoZSB3b29kOgogICAgSSwgbW