data science: data analysis boot camp textual...

35
Data Science: Data Analysis Boot Camp Textual Analysis Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD 8 February 2020 8 February 2020 8 February 2020 8 February 2020 8 February 2020 8 February 2020 8 February 2020 8 February 2020 8 February 2020 8 February 2020 8 February 2020 8 February 2020 8 February 2020 8 February 2020 8 February 2020 8 February 2020 8 February 2020 8 February 2020 8 February 2020 8 February 2020 8 February 2020 1/35

Upload: others

Post on 04-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

  • Data Science: Data Analysis Boot CampTextual Analysis

    Chuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhD

    8 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 20208 February 2020

    1/35

  • 2/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Table of contents (1 of 1)

    1 Intro.2 Background

    Contextualize3 Hands-on

    Examples from the textA little silliness

    4 Q & A5 Conclusion

    6 References

    7 Files

    8 Misc.Equations

  • 3/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    What are we going to cover?

    We’re going to talk about:

    Differences between numerical andtextual data analysis.

    Define common textual dataanalysis terms and ideas.

    Use different textual analysis tools(knn, näıve Bayes, logit, andsupport vector machines)

  • 4/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Contextualize

    Processing textual data is messy.

    With numerical data, there are a limited number of ways to getdata ready for analysis:

    1 Ignore records that are missing/incomplete

    2 Fill in missing values (mean, mode, estimated)

    3 Accept incomplete records and adjust the uncertainties

    Textual data is harder. Data may be complete, but very hard toget ready for analysis.

  • 5/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Contextualize

    Textual “data wrangling”

    There are a few “normal” processing steps to prepare textual datafrom analysis:

    1 Change all text to the same case (usually lower case)

    2 Remove all non-textual glyphs (punctuation marks and so on)

    3 Remove all numbers

    4 Remove all “stop words” (stop words are language anddomain specific)

    5 Remove all “white space”

    6 Apply stemming techniques to what remains

  • 6/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Contextualize

    What does all this mean?

    A sentence that starts like this[6]:Text Mining (or Text Analytics) applies analytic tools to learn from collections of text data, like social media,books, newspapers, emails, etc.

    Ends up like this:text min text analyt appli analyt tool learn collect text data like social media book newspap email etc

  • 7/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Contextualize

    A few definitions[2]

    TF Term Frequency, which measures how frequently a termoccurs in a document.

    tf(t, d) = Number of time the term t appears in the documentTotal number of terms in the documentd

    IDF Inverse Document Frequency, which measures how importanta term is (whether the term is common or rare across alldocuments).

    IDF(t,D) = log N|{d∈D:t∈d}|

    D : the corpus, a collection of documentsN : total number of documents in the corpus N = | D || {d ∈ D : t ∈ d} | : number of documents where the term tappears (i.e., tf(t, d) 6= 0).

  • 8/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Contextualize

    TF and IDF for sample string.

    The terms:

    [1] "analyt" "appli" "book" "collect" "data" "email" "etc"

    [8] "learn" "like" "media" "mine" "newspap" "social" "text"

    [15] "tool"

    The frequency of each term:

    [1] 2 1 1 1 1 1 1 1 1 1 1 1 1 3 1

    IDF is not a real useful metric with only one document:

    weightTfIdf(TermDocumentMatrix(corp1))$v

    ==>named numeric(0)(See next slide for code.)

  • 9/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Contextualize

    R script to create sample text “normalization”

    library(NLP)

    library(tm)

    a

  • 10/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Examples from the text

    Whats happening in the beginning?

    We gather up a predefined set of documents, save them locally,and create a term frequency object:

    tempFile

  • 11/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Examples from the text

    Afterwards we look at the corpus:

    [1] " -- Dumping the object: processed (of type: list, class: VCorpus)"

    [2] " -- Dumping the object: processed (of type: list, class: Corpus)"

    Metadata: corpus specific: 0, document level (indexed): 0

    Content: documents: 2000

    [1] " -- Dumping the object:

    head(Frequencies[order(Frequencies, decreasing = T)], 5)

    (of type: double, class: numeric)"

    film movi one like charact

    11109 6857 5759 3998 3855

    [1] " -- Dumping the object:

    head(DocFrequencies[order(DocFrequencies, decreasing = T)], 5)

    (of type: double, class: numeric)"

    film one movi like charact

    1797 1763 1642 1538 1431

    We now know the most common terms across the 2,000documents in the corpus.

  • 12/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Examples from the text

    Gathering a few corpus statistics.

    It is easy to think about how terms and documents create a2-dimensional array.

    [1] " -- Dumping the object: moreThanOnce (of type: integer, class: integer)"

    [1] 9748

    [1] " -- Dumping the object: total (of type: integer, class: integer)"

    [1] 30585

    [1] " -- Dumping the object: prop (of type: double, class: numeric)"

    [1] 0.3187183

    [1] " -- Dumping the object: ncol(SparseRemoved) (of type: integer, class: integer)"

    [1] 202

    [1] " -- Dumping the object: sum(rowSums(as.matrix(SparseRemoved)) == 0)

    (of type: integer, class: integer)"

    [1] 0

    [1] " -- Dumping the object: colnames(SparseRemoved) (of type: character, class: character)"

    [1] "act" "action" "actor" "actual" "almost" "along"

    Columns that have only one entry are assumed to not be toointeresting.

  • 13/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Examples from the text

    Create a dataframe with all the data

    quality

  • 14/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Examples from the text

    How well do knn classifiers do? (1 of 2)

    The code:

    Class3n

  • 15/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Examples from the text

    How well do knn classifiers do? (2 of 2)

    The results:

    [1] " -- Dumping the object: confusionMatrix(Class3n,

    as.factor(TrainDF$quality)) (of type: list,

    class: confusionMatrix)"

    Confusion Matrix and Statistics

    Reference

    Prediction 0 1

    0 358 126

    1 134 382

    Accuracy : 0.74

    ...

    [1] " -- Dumping the object: confusionMatrix(Class5n,

    as.factor(TrainDF$quality)) (of type: list,

    class: confusionMatrix)"

    Confusion Matrix and Statistics

    Reference

    Prediction 0 1

    0 336 162

    1 156 346

    Accuracy : 0.682

  • 16/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Examples from the text

    How well will a näıve Bayes classifier do? (1 of 2)

    The code:

    model

  • 17/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Examples from the text

    How well will a näıve Bayes classifier do? (2 of 2)

    The results:

    [1] " -- Dumping the object: confusionMatrix(

    as.factor(TrainDF$quality), classifNB)

    (of type: list, class: confusionMatrix)"

    Confusion Matrix and Statistics

    Reference

    Prediction 0 1

    0 353 139

    1 74 434

    Accuracy : 0.787

    ...

    [1] " -- Dumping the object: confusionMatrix(

    as.factor(TestDF$quality), classifNB)

    (of type: list, class: confusionMatrix)"

    Confusion Matrix and Statistics

    Reference

    Prediction 0 1

    0 335 173

    1 120 372

    Accuracy : 0.707

  • 18/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Examples from the text

    How well will logistic regression (logit) do? (1 of 2)

    The code:

    model

  • 19/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Examples from the text

    How well will logistic regression (logit) do? (2 of 2)

    The results:[1] " -- Dumping the object: summary(model) (of type: list, class: summary.glm)"

    glm(formula = quality ~ lengths, family = binomial)

    ...

    Estimate Std. Error z value Pr(>|z|)

    (Intercept) -0.6383373 0.1171536 -5.449 5.07e-08 ***

    lengths 0.0018276 0.0003113 5.871 4.32e-09 ***

    ...

    [1] " -- Dumping the object: tbl (of type: integer, class: table)"

    quality

    classif 0 1

    0 614 507

    1 386 493

    ...

    [1] " -- Dumping the object: confusionMatrix(TrainDF$quality, TrainDF$classif) (of type: list, class: confusionMatrix)"

    Confusion Matrix and Statistics

    Reference

    Prediction 0 1

    0 418 74

    1 69 439

    Accuracy : 0.857

    ...

    [1] " -- Dumping the object: confusionMatrix(TestDF$quality, TestDF$classif) (of type: list, class: confusionMatrix)"

    Confusion Matrix and Statistics

    Reference

    Prediction 0 1

    0 377 131

    1 145 347

    Accuracy : 0.724

  • 20/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Examples from the text

    How well will a support vector machine (svm) do? (1 of2)

    “The support vector classifier is a natural approachfor classification in the two-class setting, if the boundarybetween the two classes is linear.”

    James, et al. [1]

    The code:

    modelSVM

  • 21/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Examples from the text

    How well will a support vector machine (svm) do? (2 of2)

    The results:

    [1] " -- Dumping the object: confusionMatrix(

    TrainDF$quality, classifSVMtrain)

    (of type: list, class: confusionMatrix)"

    Confusion Matrix and Statistics

    Reference

    Prediction 0 1

    0 449 43

    1 38 470

    Accuracy : 0.919

    ...

    [1] " -- Dumping the object: confusionMatrix(

    TestDF$quality, classifSVMtest)

    (of type: list, class: confusionMatrix)"

    Confusion Matrix and Statistics

    Reference

    Prediction 0 1

    0 378 130

    1 146 346

    Accuracy : 0.724

  • 22/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    A little silliness

    Looking at term frequency in a PDF.

    We will do a few things:

    1 Read text directly from aPDF.

    2 “Normalize” the text.

    3 Look at the text in differentways.

    (From the file:chapter-13-textual-silliness.R)

    Attached file.

  • 23/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    A little silliness

    Same image.

    Attached file.

  • 24/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    A little silliness

    Look at text frequency as a B&W word cloud

    Attached file.

  • 25/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    A little silliness

    Look at text frequency as a color word cloud

    Attached file.

  • 26/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    A little silliness

    More colorful examples from Romeo and Juliet

    Attached file (wordCloud.pdf).

  • 27/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Q & A time.

    Q: How do you catch a uniquerabbit?A: Unique up on it!

    Q: How do you catch a tamerabbit?A: The tame way!

  • 28/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    What have we covered?

    Compared and contrastednumerical and textual data analysisProvided a few numericaldefinitions (TF, IDF) that arefundamental to textual analysisApplied different textual analysistools and techniques (knn, näıveBayes, logit, and support vectormachine)Looked at different graphical waystextual data could be displayed

    Next: Serial vs. parallel processing

  • 29/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    References (1 of 2)

    [1] Gareth James, Daniela Witten, Trevor Hastie, and RobertTibshirani, An Introduction to Statistical Learning, vol. 6,Springer, 2013.

    [2] TF-IDF Staff, What does tf-idf mean?,http://www.tfidf.com/, 2017.

    [3] Wikipedia Staff, Logistic function,https://en.wikipedia.org/wiki/Logistic_function,2017.

    [4] , Naive Bayes classifier, https://en.wikipedia.org/wiki/Naive_Bayes_classifier,2017.

    http://www.tfidf.com/https://en.wikipedia.org/wiki/Logistic_functionhttps://en.wikipedia.org/wiki/Naive_Bayes_classifierhttps://en.wikipedia.org/wiki/Naive_Bayes_classifier

  • 30/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    References (2 of 2)

    [5] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar,Introduction to Data Mining, Pearson Education India, 2006.

    [6] G Williams, Hands-On Data Science with R: Text Mining,2016.

  • 31/35

    Intro. Background Hands-on Q & A Conclusion References Files Misc.

    Files of interest

    1 Revised textual analysis

    script2 Silly textual analysis

    script3 PDF file used with silly

    textual analysis script

    4 R library script file

    5 Other ways to display word

    clouds

    6 Code snippets

    rm(list=ls())

    library(lattice)library(ggplot2)library(NLP)library(tm)library(class)library(caret)library(e1071)library(topicmodels)library(qdapDictionaries)library(qdapRegex)

    library(qdapTools)library(RColorBrewer)library(qdap)library(psych)

    source("library.R")

    assignBinary threshold]

  • Hands-On Data Science with R

    Text Mining

    [email protected]

    10th January 2016

    Visit http://HandsOnDataScience.com/ for more Chapters.

    Text Mining (or Text Analytics) applies analytic tools to learn from collections of text data,like social media, books, newspapers, emails, etc. The goal can be considered to be similar tohumans learning by reading such material. However, using automated algorithms we can learnfrom massive amounts of text, very much more than a human can. The material could consist ofmillions of newspaper articles to perhaps summarise the main themes and to identify those thatare of most interest to particular people. Or we might be monitoring twitter feeds to identifyemerging topics that we might need to act upon, as it emerges.

    The required packages for this chapter include:

    library(tm) # Framework for text mining.

    library(qdap) # Quantitative discourse analysis of transcripts.

    library(qdapDictionaries)

    library(dplyr) # Data wrangling, pipe operator %>%().

    library(RColorBrewer) # Generate palette of colours for plots.

    library(ggplot2) # Plot word frequencies.

    library(scales) # Include commas in numbers.

    library(Rgraphviz) # Correlation plots.

    As we work through this chapter, new R commands will be introduced. Be sure to review thecommand’s documentation and understand what the command does. You can ask for help usingthe ? command as in:

    ?read.csv

    We can obtain documentation on a particular package using the help= option of library():

    library(help=rattle)

    This chapter is intended to be hands on. To learn effectively, you are encouraged to have Rrunning (e.g., RStudio) and to run all the commands as they appear here. Check that you getthe same output, and you understand the output. Try some variations. Explore.

    Copyright © 2013-2015 Graham Williams. You can freely copy, distribute,or adapt this material, as long as the attribution is retained and derivativework is provided under the same license.

    http://HandsOnDataScience.com/

    http://creativecommons.org/licenses/by-nc-sa/4.0/

  • Data Science with R Hands-On Text Mining

    1 Getting Started: The Corpus

    The primary package for text mining, tm (Feinerer and Hornik, 2015), provides a frameworkwithin which we perform our text mining. A collection of other standard R packages add valueto the data processing and visualizations for text mining.

    The basic concept is that of a corpus. This is a collection of texts, usually stored electronically,and from which we perform our analysis. A corpus might be a collection of news articles fromReuters or the published works of Shakespeare. Within each corpus we will have separate docu-ments, which might be articles, stories, or book volumes. Each document is treated as a separateentity or record.

    Documents which we wish to analyse come in many different formats. Quite a few formats aresupported by tm (Feinerer and Hornik, 2015), the package we will illustrate text mining with inthis module. The supported formats include text, PDF, Microsoft Word, and XML.

    A number of open source tools are also available to convert most document formats to text files.For our corpus used initially in this module, a collection of PDF documents were converted to textusing pdftotext from the xpdf application which is available for GNU/Linux and MS/Windowsand others. On GNU/Linux we can convert a folder of PDF documents to text with:

    system("for f in *.pdf; do pdftotext -enc ASCII7 -nopgbrk $f; done")

    The -enc ASCII7 ensures the text is converted to ASCII since otherwise we may end up withbinary characters in our text documents.

    We can also convert Word documents to text using anitword, which is another applicationavailable for GNU/Linux.

    system("for f in *.doc; do antiword $f; done")

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 1 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

    https://secure.wikimedia.org/wikipedia/en/wiki/corpus

  • Data Science with R Hands-On Text Mining

    1.1 Corpus Sources and Readers

    There are a variety of sources supported by tm. We can use getSources() to list them.

    getSources()

    ## [1] "DataframeSource" "DirSource" "URISource" "VectorSource"

    ## [5] "XMLSource" "ZipSource"

    In addition to different kinds of sources of documents, our documents for text analysis will comein many different formats. A variety are supported by tm:

    getReaders()

    ## [1] "readDOC" "readPDF"

    ## [3] "readPlain" "readRCV1"

    ## [5] "readRCV1asPlain" "readReut21578XML"

    ## [7] "readReut21578XMLasPlain" "readTabular"

    ## [9] "readTagged" "readXML"

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 2 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    1.2 Text Documents

    We load a sample corpus of text documents. Our corpus consists of a collection of researchpapers all stored in the folder we identify below. To work along with us in this module, youcan create your own folder called corpus/txt and place into that folder a collection of textdocuments. It does not need to be as many as we use here but a reasonable number makes itmore interesting.

    cname

  • Data Science with R Hands-On Text Mining

    ## ai02.txt 2 PlainTextDocument list

    ## ai03.txt 2 PlainTextDocument list

    ## ai97.txt 2 PlainTextDocument list

    ....

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 4 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    1.3 PDF Documents

    If instead of text documents we have a corpus of PDF documents then we can use the readPDF()reader function to convert PDF into text and have that loaded as out Corpus.

    docs

  • Data Science with R Hands-On Text Mining

    1.4 Word Documents

    A simple open source tool to convert Microsoft Word documents into text is antiword. Theseparate antiword application needs to be installed, but once it is available it is used by tm toconvert Word documents into text for loading into R.

    To load a corpus of Word documents we use the readDOC() reader function:

    docs

  • Data Science with R Hands-On Text Mining

    2 Exploring the Corpus

    We can (and should) inspect the documents using inspect(). This will assure us that data hasbeen loaded properly and as we expect.

    inspect(docs[16])

    ##

    ## Metadata: corpus specific: 0, document level (indexed): 0

    ## Content: documents: 1

    ##

    ## [[1]]

    ##

    ## Metadata: 7

    ## Content: chars: 44776

    viewDocs % extract2(n) %>% as.character() %>% writeLines()}viewDocs(docs, 16)

    ## Hybrid weighted random forests for

    ## classifying very high-dimensional data

    ## Baoxun Xu1 , Joshua Zhexue Huang2 , Graham Williams2 and

    ## Yunming Ye1

    ## 1

    ##

    ....

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 7 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    3 Preparing the Corpus

    We generally need to perform some pre-processing of the text data to prepare for the text anal-ysis. Example transformations include converting the text to lower case, removing numbers andpunctuation, removing stop words, stemming and identifying synonyms. The basic transformsare all available within tm.

    getTransformations()

    ## [1] "removeNumbers" "removePunctuation" "removeWords"

    ## [4] "stemDocument" "stripWhitespace"

    The function tm map() is used to apply one of these transformations across all documents withina corpus. Other transformations can be implemented using R functions and wrapped withincontent transformer() to create a function that can be passed through to tm map(). We willsee an example of that in the next section.

    In the following sections we will apply each of the transformations, one-by-one, to remove un-wanted characters from the text.

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 8 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    3.1 Simple Transforms

    We start with some manual special transforms we may want to do. For example, we might wantto replace “/”, used sometimes to separate alternative words, with a space. This will avoid thetwo words being run into one string of characters through the transformations. We might alsoreplace “@” and “|” with a space, for the same reason.

    To create a custom transformation we make use of content transformer() to create a functionto achieve the transformation, and then apply it to the corpus using tm map().

    toSpace

  • Data Science with R Hands-On Text Mining

    3.2 Conversion to Lower Case

    docs

  • Data Science with R Hands-On Text Mining

    3.3 Remove Numbers

    docs

  • Data Science with R Hands-On Text Mining

    3.4 Remove Punctuation

    docs

  • Data Science with R Hands-On Text Mining

    3.5 Remove English Stop Words

    docs

  • Data Science with R Hands-On Text Mining

    3.6 Remove Own Stop Words

    docs

  • Data Science with R Hands-On Text Mining

    3.7 Strip Whitespace

    docs

  • Data Science with R Hands-On Text Mining

    3.8 Specific Transformations

    We might also have some specific transformations we would like to perform. The examples heremay or may not be useful, depending on how we want to analyse the documents. This is reallyfor illustration using the part of the document we are looking at here, rather than suggestingthis specific transform adds value.

    toString

  • Data Science with R Hands-On Text Mining

    3.9 Stemming

    docs

  • Data Science with R Hands-On Text Mining

    4 Creating a Document Term Matrix

    A document term matrix is simply a matrix with documents as the rows and terms as the columnsand a count of the frequency of words as the cells of the matrix. We use DocumentTermMatrix()to create the matrix:

    dtm

  • Data Science with R Hands-On Text Mining

    5 Exploring the Document Term Matrix

    We can obtain the term frequencies as a vector by converting the document term matrix into amatrix and summing the column counts:

    freq

  • Data Science with R Hands-On Text Mining

    6 Distribution of Term Frequencies

    # Frequency of frequencies.

    head(table(freq), 15)

    ## freq

    ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

    ## 2381 1030 503 311 210 188 134 130 82 83 65 61 54 52 51

    tail(table(freq), 15)

    ## freq

    ## 483 544 547 555 578 609 611 616 703 709 776 887 1366 1446 3101

    ## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

    So we can see here that there are 2381 terms that occur just once.

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 20 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    7 Conversion to Matrix and Save to CSV

    We can convert the document term matrix to a simple matrix for writing to a CSV file, forexample, for loading the data into other software if we need to do so. To write to CSV we firstconvert the data structure into a simple matrix:

    m

  • Data Science with R Hands-On Text Mining

    8 Removing Sparse Terms

    We are often not interested in infrequent terms in our documents. Such “sparse” terms can beremoved from the document term matrix quite easily using removeSparseTerms():

    dim(dtm)

    ## [1] 46 6508

    dtms

  • Data Science with R Hands-On Text Mining

    9 Identifying Frequent Items and Associations

    One thing we often to first do is to get an idea of the most frequent terms in the corpus. We usefindFreqTerms() to do this. Here we limit the output to those terms that occur at least 1,000times:

    findFreqTerms(dtm, lowfreq=1000)

    ## [1] "data" "mine" "use"

    So that only lists a few. We can get more of them by reducing the threshold:

    findFreqTerms(dtm, lowfreq=100)

    ## [1] "accuraci" "acsi" "adr" "advers" "age"

    ## [6] "algorithm" "allow" "also" "analysi" "angioedema"

    ## [11] "appli" "applic" "approach" "area" "associ"

    ## [16] "attribut" "australia" "australian" "avail" "averag"

    ## [21] "base" "build" "call" "can" "care"

    ## [26] "case" "chang" "claim" "class" "classif"

    ....

    We can also find associations with a word, specifying a correlation limit.

    findAssocs(dtm, "data", corlimit=0.6)

    ## $data

    ## mine induct challeng know answer

    ## 0.90 0.72 0.70 0.65 0.64

    ## need statistician foundat general boost

    ## 0.63 0.63 0.62 0.62 0.61

    ## major mani come

    ....

    If two words always appear together then the correlation would be 1.0 and if they never appeartogether the correlation would be 0.0. Thus the correlation is a measure of how closely associatedthe words are in the corpus.

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 23 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    10 Correlations Plots

    accuraci acsi adr

    advers

    age

    algorithm

    allowalso

    analysi

    angioedema

    appli applic

    approach

    area

    associ

    attribut

    australia

    australian avail

    averag

    base

    build

    call can

    care

    case

    chang

    claim

    class

    classif

    classifi

    cluster

    collect

    combin common compar

    comput

    condit

    confer

    consid consist

    contain cost

    csiro

    current

    data

    databas dataset

    day

    decis

    plot(dtm,

    terms=findFreqTerms(dtm, lowfreq=100)[1:50],

    corThreshold=0.5)

    Rgraphviz (Hansen et al., 2016) from the BioConductor repository for R (bioconductor.org) isused to plot the network graph that displays the correlation between chosen words in the corpus.Here we choose 50 of the more frequent words as the nodes and include links between wordswhen they have at least a correlation of 0.5.

    By default (without providing terms and a correlation threshold) the plot function chooses arandom 20 terms with a threshold of 0.7.

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 24 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    11 Correlations Plot—Options

    accuraci acsi adr

    advers

    age

    algorithm

    allowalso

    analysi

    angioedema

    appli applic

    approach

    area

    associ

    attribut

    australia

    australian avail

    averag

    base

    build

    call can

    care

    case

    chang

    claim

    class

    classif

    classifi

    cluster

    collect

    combin common compar

    comput

    condit

    confer

    consid consist

    contain cost

    csiro

    current

    data

    databas dataset

    day

    decis

    plot(dtm,

    terms=findFreqTerms(dtm, lowfreq=100)[1:50],

    corThreshold=0.5)

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 25 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    12 Plotting Word Frequencies

    We can generate the frequency count of all words in a corpus:

    freq %

    ggplot(aes(word, freq)) +

    geom_bar(stat="identity") +

    theme(axis.text.x=element_text(angle=45, hjust=1))

    0

    1000

    2000

    3000

    algori

    thm can

    cluste

    rda

    ta

    datas

    etfea

    tur

    metho

    dmi

    nemo

    del

    patte

    rn rule se

    ttre

    eus

    e

    word

    freq

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 26 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    13 Word Clouds

    effectconsidergovern

    adrcomponlocatcare

    environ

    sciencsegment

    timewhole miner

    movedefin

    begin

    taxa

    t

    endvarieti element

    complexpremium

    databas

    right

    market

    can

    result

    threeinclud

    mean

    thusarchitectur

    term

    simpl

    frequentclaim

    call

    streamangioedema

    donohostahel

    individuthreshold

    mml

    monitor

    report

    howevbase

    due

    observ

    advers

    normal

    withoutdiseas

    esophag

    pressreview

    classifi

    output

    advanc

    instal

    transformset

    tar

    outlierentiti

    theo

    ri

    interv

    kdd

    two

    known

    high

    caa

    knowledg

    hiddeninduct

    basic direct

    conclus

    access

    rattl

    rather

    distinct

    datasetless

    receivmonth

    busi startlarger

    methodolog

    show

    behaviour

    day

    new

    parallel

    introduct

    visual

    technolog

    spot

    demonstr

    will

    version

    increas

    csiro

    resp

    ons

    machin

    geograph

    split

    type

    subsequ

    count

    var

    discovanoth

    variablrespect

    pmml

    sequenc

    sampl

    unex

    pect

    interpret

    oper

    existconsid

    parti

    t

    purpos

    visualis

    run

    typic

    forward

    yearnode

    limit

    linux

    distribut

    hot

    act

    episod

    help

    author

    acsimani

    consequ

    predictensembl

    seri

    ace

    item

    now

    illustr

    calcul

    usag

    insurca

    se

    list

    region

    gis

    multipl

    specif

    difficult

    breiman

    log

    intellig

    line

    understand

    applide

    sign

    issu

    condit

    paramet

    sinc

    general

    detect

    conferfemal

    input

    next

    variousmap

    success

    rank

    made intrus

    algorithm

    refer

    provid

    ann

    neur

    al

    tree

    subspac

    mathemat

    within

    form

    reco

    rdgraphic

    http

    deriv

    project

    confid

    group

    eventutar

    overal

    smyth

    engin

    determin

    periodupon

    reduc

    vect

    or

    valu

    key

    muchweight

    matrix

    tradit

    tabl

    recent

    prop

    ort

    exposur

    employ

    introduc

    divers

    attribut

    categor

    drug

    volapproach

    contextdata

    user

    becom

    summari

    hadi

    queensland

    distanc

    probabl

    hazardproblem

    tmip

    altern

    rnn

    health

    occur

    even

    portf

    olio

    quit

    plot

    networkjournal

    inform

    system

    interest pro

    ceed

    error

    higher

    worker

    frequenc

    layer

    real

    exclus

    cand

    id

    cost

    plane

    factorcustom

    addit

    alendron

    cluster

    hepat

    complet

    focu

    s

    common

    build

    deploy

    postdischarg

    control

    gnuallow

    appear

    definit

    evalu

    descript

    smaller

    averagstage

    studi

    disc

    over

    i

    subset

    patientoccurr

    publish

    represent

    describ

    advantag

    age

    interfac

    differrandom

    futu

    r

    one

    therefor

    open

    make

    medic

    compar

    use

    william

    som

    choos

    prepar

    supp

    still

    optim

    adm

    iss

    discuss

    chen

    milli

    on

    natur

    artifici

    preprocess

    patte

    rn

    offic

    score

    softwar

    minimum

    fit

    must

    strength

    ieee

    generat

    practicview

    degr

    e

    doctor

    targ

    et

    prior

    previous gender

    outcom

    cove

    r

    scheme

    rule

    train

    hybridcodepa

    ge

    area

    fig

    situat

    assess

    categori

    statist

    prune

    part

    temporwindow point

    program

    found

    produc

    suitabl

    mbs

    characterist

    popul

    mine

    match

    research

    standard

    pbs

    dimension

    hospitsepar

    medicar

    togeth

    singl

    forest

    address

    way

    nation

    collect

    servic

    posit

    built

    densiti

    techniqu

    order

    applic

    tool

    miss

    depend

    met

    hod

    loca

    l

    analysi

    consist

    relat

    four

    expect

    link

    comparison

    grow

    accord

    experi

    regress

    rang australia

    exam

    pl

    suppa

    final

    mutarc risk

    expert

    certa

    in

    mod

    elleverag

    copyright

    number

    step

    creat

    residualleverag

    structur

    interesting

    actual

    extract

    connect

    possibl

    graham

    high

    light

    length

    lead

    class

    continu

    australian

    contain

    acm

    import

    deliv

    specifi

    decis

    domain

    clinic

    remain

    field

    univers

    test

    estim

    firstsee

    learn

    action

    export

    implement

    major

    shown

    cart

    yes

    select

    nugget

    industri

    evolutionari

    second

    captur

    insight

    languag

    literatur

    interact

    avail

    piatetskyshapirosuggest

    size

    also

    debian

    administr

    managnote

    labo

    rato

    ri

    unit

    objectstate

    chang

    experiment

    find

    messag

    explor

    space

    follo

    w

    independ

    tem

    plat

    intern

    patholog

    work

    total

    like

    transact

    rare

    investig

    drg

    ratio

    spatial

    functiongive

    usual

    simpli

    reaction

    tune

    activ valid

    goal

    sever

    propos

    correspond

    reason

    good

    huang

    appropri

    clearcorrel

    prototyp

    aim

    initi

    file

    cca

    inhibitor

    global

    accuraci

    current

    fraud

    support

    least

    developmay

    packag

    obtain

    detail

    need

    remov

    commonwealth

    involv

    task

    best

    equat

    particular

    paper

    might

    section

    process

    emerg

    benefit

    rate

    public

    comput

    preadmiss

    wellabl

    regular

    indic

    among

    polici

    better

    present

    figuroftengi

    ven

    larg

    identifi

    asso

    ci

    chosen

    igi

    concept

    orig

    in

    similar

    abst

    ract

    effici

    idea

    orga

    nis

    classif

    text

    sourc

    mea

    sur

    world

    canberra

    low

    perform

    main

    top

    gain

    automat

    repres

    construct

    analys

    search

    framework

    signific

    level

    featur sequenti

    requir

    improvcombin

    small

    take

    gap

    We can generate a word cloud as an effective alternative to providing a quick visual overview ofthe frequency of words in a corpus.

    The wordcloud (?) package provides the required function.

    library(wordcloud)

    set.seed(123)

    wordcloud(names(freq), freq, min.freq=40)

    Notice the use of set.seed() only so that we can obtain the same layout each time—otherwisea random layout is chosen, which is not usually an issue.

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 27 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    13.1 Reducing Clutter With Max Words

    providservicfigur

    casepatient

    databaspopul

    valu

    data

    set

    distribut

    dataperform

    prob

    lem

    knowledg

    function

    describ

    min

    edeveloprecord

    processalso

    drug

    risk

    number

    first

    paper

    can

    discoveri

    one

    includ

    user

    kdd

    cluster

    map

    research

    inform

    statist

    generat

    treeforest

    learn

    randomsourc

    may

    associ

    computidentifi

    present

    systemdecis

    studi

    mea

    sur

    support

    tabl

    group

    sequenc

    algorithm

    time classde

    tect

    health

    interest

    appr

    oach

    analysi

    relat

    select

    set

    variabl two

    techniqu

    larg

    base

    structur

    williamclassif

    event

    perio

    d

    similarnew

    section

    tempor

    differtest

    rule

    insur

    method

    pattern

    model

    high

    will

    outlier

    mani

    work

    train

    applic

    general

    use

    examplresult

    featur

    To increase or reduce the number of words displayed we can tune the value of max.words=. Herewe have limited the display to the 100 most frequent words.

    set.seed(142)

    wordcloud(names(freq), freq, max.words=100)

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 28 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    13.2 Reducing Clutter With Min Freq

    larg

    evalu

    importrandom

    like

    section

    unde

    rsta

    nd

    includsubset

    methodindividu

    perform

    combin

    stepevent workattribut selectform

    averagyear

    effect

    univers

    clas

    s

    link

    repres

    call

    reaction

    supp

    ortfunction

    baseuserexplorvariabl

    measurrank

    informtabl

    stat

    ist

    stud

    i

    analysi areadistanctool

    level

    outlier

    servic

    experi

    time

    data

    search

    cost

    william

    subspac

    patientshowimplement

    make

    similar

    databas

    consist

    record

    lead unexpect

    exis

    t

    propos

    utar

    angioedematempor

    patternstage

    intellig

    figurdistribut

    condit

    result

    discussaccuraci

    drug

    australia

    identifi

    dom

    ain

    appli

    hospit

    generalmani

    number

    perio

    d

    paper

    case

    featurbuild

    process

    describ

    hot

    technolog

    sequencratio

    insur

    day

    well

    given

    kdduse

    rnn

    order

    within

    node

    target

    applic

    two

    chang

    discov

    error

    total

    transact

    smal

    l

    graham

    acsi

    fig

    need

    expect

    ofte

    n

    learn

    multipl

    structur

    interest

    current

    claim

    new

    dete

    ct

    journalobserv

    report

    allow

    sizeone

    adr

    provid

    approachhigh

    can

    find

    classif

    window

    weight

    modelcompar

    neural

    differ

    mine

    requir

    threeaustralian

    particular

    vectorsourc

    healthsingl

    indicentiti

    system

    valucommon

    advers

    pmml

    occur

    cluster

    exampl

    state

    increas

    howev

    rattl

    task

    hybridusual

    network

    defin

    mean

    regress

    unit

    dataset

    rule

    visualrelat

    algorithm

    follow

    page

    intern

    http

    type

    point

    expe

    rtproceed

    effici

    machin

    open

    avail

    collect

    object

    csiro

    discoveritrain

    generat

    scienc

    map

    consid

    confer

    may

    tree

    packag

    refer

    sampl

    will

    group

    forest

    popul

    also

    polici

    comput

    interesting

    firstoper

    episod

    associ

    problem

    medic

    decis

    set

    interv

    signific

    predict

    classifi

    research

    contain

    knowledg

    risk

    present

    developcare

    nugget

    techniqu

    estim

    test

    age

    spot

    A more common approach to increase or reduce the number of words displayed is by tuning thevalue of min.freq=. Here we have limited the display to those words that occur at least 100times.

    set.seed(142)

    wordcloud(names(freq), freq, min.freq=100)

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 29 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    13.3 Adding Some Colour

    larg

    evalu

    importrandom

    like

    section

    unde

    rsta

    nd

    includsubset

    methodindividu

    perform

    combin

    stepevent workattribut selectform

    averagyear

    effect

    univers

    clas

    s

    link

    repres

    call

    reaction

    supp

    ortfunction

    baseuserexplorvariabl

    measurrank

    informtabl

    stat

    ist

    stud

    i

    analysi areadistanctool

    level

    outlier

    servic

    experi

    time

    data

    search

    cost

    william

    subspac

    patientshowimplement

    make

    similar

    databas

    consist

    record

    lead unexpect

    exis

    t

    propos

    utar

    angioedematempor

    patternstage

    intellig

    figurdistribut

    condit

    result

    discussaccuraci

    drug

    australia

    identifi

    dom

    ain

    appli

    hospit

    generalmani

    number

    perio

    d

    paper

    case

    featurbuild

    process

    describ

    hot

    technolog

    sequencratio

    insur

    day

    well

    given

    kdduse

    rnn

    order

    within

    node

    target

    applic

    two

    chang

    discov

    error

    total

    transact

    smal

    l

    graham

    acsi

    fig

    need

    expect

    ofte

    n

    learn

    multipl

    structur

    interest

    current

    claim

    new

    dete

    ct

    journalobserv

    report

    allow

    sizeone

    adr

    provid

    approachhigh

    can

    find

    classif

    window

    weight

    modelcompar

    neural

    differ

    mine

    requir

    threeaustralian

    particular

    vectorsourc

    healthsingl

    indicentiti

    system

    valucommon

    advers

    pmml

    occur

    cluster

    exampl

    state

    increas

    howev

    rattl

    task

    hybridusual

    network

    defin

    mean

    regress

    unit

    dataset

    rule

    visualrelat

    algorithm

    follow

    page

    intern

    http

    type

    point

    expe

    rtproceed

    effici

    machin

    open

    avail

    collect

    object

    csiro

    discoveritrain

    generat

    scienc

    map

    consid

    confer

    may

    tree

    packag

    refer

    sampl

    will

    group

    forest

    popul

    also

    polici

    comput

    interesting

    firstoper

    episod

    associ

    problem

    medic

    decis

    set

    interv

    signific

    predict

    classifi

    research

    contain

    knowledg

    risk

    present

    developcare

    nugget

    techniqu

    estim

    test

    age

    spot

    We can also add some colour to the display. Here we make use of brewer.pal() from RColor-Brewer (Neuwirth, 2014) to generate a palette of colours to use.

    set.seed(142)

    wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 30 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    13.4 Varying the Scaling

    largevalu

    importrandom

    like

    section

    unde

    rsta

    nd

    includsubset

    methodindividu

    perform

    combin

    step eventwork

    attribut

    selectformaverag

    year

    effect

    universcla

    ss

    linkrepres

    call

    reactionsu

    ppor

    t function

    base

    userexplor

    variabl

    measur

    rank

    inform

    tablst

    atist

    stud

    i

    analysi

    area

    distanc

    tool

    level

    outlierservic

    experi

    time

    datasearch

    cost

    william

    subspac

    patient

    show

    implement

    make

    similar

    databas

    consist

    record

    lead

    unexpect

    exis

    t

    propos

    utar

    angioedema

    tempor patternstage

    intellig

    figurdistribut

    condit

    result

    discuss

    accuraci

    drug

    australia

    identifi

    dom

    ain

    appli

    hospit

    general

    mani

    number

    perio

    d paper

    case

    featur

    build

    processdescrib

    hot

    technologsequenc

    ratio

    insur

    day

    well

    given kddusernn

    order

    within

    node

    target

    applic

    two

    chang

    discov

    errortotal

    transact

    smal

    l

    graham

    acsi

    fig

    needexpect

    ofte

    n

    learn

    multipl

    structur

    interest

    current

    claim

    newdete

    ct

    journalobserv

    report

    allow

    size

    one

    adrprovid

    approach

    high

    can

    find

    classif

    window

    weight

    modelcompar

    neuraldiffer

    mine

    requir

    three

    australian

    particular

    vectorsourc

    health

    singl

    indic

    entiti

    system

    valu

    commonadvers

    pmmloccur

    clusterexampl

    state

    increas

    howev

    rattl

    task

    hybrid

    usual

    network

    defin

    mean

    regress

    unit

    datasetrule visual

    relat

    algorithm

    follow

    page

    intern

    http

    type

    point

    expe

    rt proceed

    effici

    machin

    openavail

    collect

    objectcsiro

    discoveri

    train

    generatscienc

    map

    consid

    confer

    may

    treepackag

    refer

    sampl

    will

    group

    forestpopul

    also

    polici

    comput

    interesting

    first

    oper

    episodassoci

    problem

    medic

    decis set

    interv

    signific

    predict

    classifi

    research

    contain

    knowledg

    risk

    present

    develop

    care

    nugget

    techniqu

    estim

    test

    age

    spot

    We can change the range of font sizes used in the plot using the scale= option. By default themost frequent words have a scale of 4 and the least have a scale of 0.5. Here we illustrate theeffect of increasing the scale range.

    set.seed(142)

    wordcloud(names(freq), freq, min.freq=100, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 31 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    13.5 Rotating Words

    larg

    evalu impo

    rt

    randomlikesection

    unde

    rsta

    nd

    includsubset

    methodin

    divi

    du

    perform

    combin

    step

    eventworkattribut

    selectfo

    rm averagyear

    effect

    univers

    clas

    slink

    repres

    call

    reaction

    supp

    ort

    function

    base

    userexplor

    variabl

    measur

    rankinform

    tabl

    stat

    ist

    stud

    i

    analysi

    area

    distanc

    tool

    leve

    l

    outlier

    servic

    experi

    timedata

    search

    cost

    william

    subspac

    patientshow

    implement

    make

    similar

    databas

    consist record

    lead

    unexpect

    exis

    t

    proposutar

    angioedema

    temporpattern

    stage

    intellig

    figur

    distributcondit

    resultdiscuss

    accuraci

    drug

    australia

    identifi

    dom

    ain

    appli

    hospit

    general

    mani

    number

    perio

    d

    paper

    case

    featur

    build

    process

    desc

    rib

    hot

    technologsequenc

    ratio

    insurday

    well

    givenkdduse

    rnn

    order

    within

    node

    target

    applic

    two

    changdi

    scov

    errortotal

    transact

    smal

    l

    graham

    acsi

    fig

    need

    expect

    ofte

    n

    learn

    mul

    tipl

    structur

    inte

    rest

    current

    claim

    new

    dete

    ct

    journalobservreport

    allowsize

    one

    adr

    provid

    approach

    high

    can find

    classif

    window

    weight

    model

    compar

    neural

    differ

    minerequir

    threeaustralian

    particular

    vectorsourc

    health

    singl

    indic

    entiti

    system

    valu

    common

    advers

    pmml

    occur

    cluster

    exampl

    stat

    e

    increas

    howev

    rattl

    task

    hybrid

    usual

    network

    defin

    mean

    regress

    unit

    datasetrule

    visual

    relat

    algo

    rithm

    follow

    page

    intern

    http

    typepoint

    expe

    rt

    proceed

    effici

    machin

    open

    avail

    collect

    object

    csiro

    discoveritrain

    generat

    scienc

    map

    consid

    confer

    may

    tree

    packagre

    fer

    samplwill

    group

    fore

    st

    popul

    also

    policicomput

    interesting

    first

    oper

    episodas

    soci

    problem

    medic

    decis

    set

    interv

    sign

    ific

    predict

    classifi

    researchcontainknowledg

    risk

    present

    developcare

    nugget

    techniqu

    estim

    test

    agesp

    ot

    We can change the proportion of words that are rotated by 90 degrees from the default 10% to,say, 20% using rot.per=0.2.

    set.seed(142)

    dark2

  • Data Science with R Hands-On Text Mining

    14 Quantitative Analysis of Text

    The qdap (Rinker, 2015) package provides an extensive suite of functions to support the quanti-tative analysis of text.

    We can obtain simple summaries of a list of words, and to do so we will illustrate with theterms from our Term Document Matrix tdm. We first extract the shorter terms from each of ourdocuments into one long word list. To do so we convert tdm into a matrix, extract the columnnames (the terms) and retain those shorter than 20 characters.

    words %

    as.matrix %>%

    colnames %>%

    (function(x) x[nchar(x) < 20])

    We can then summarise the word list. Notice, in particular, the use of dist tab() from qdap togenerate frequencies and percentages.

    length(words)

    ## [1] 6456

    head(words, 15)

    ## [1] "aaai" "aab" "aad" "aadrbhtm" "aadrbltn"

    ## [6] "aadrhtmliv" "aai" "aam" "aba" "abbrev"

    ## [11] "abbrevi" "abc" "abcd" "abdul" "abel"

    summary(nchar(words))

    ## Min. 1st Qu. Median Mean 3rd Qu. Max.

    ## 3.000 5.000 6.000 6.644 8.000 19.000

    table(nchar(words))

    ##

    ## 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

    ## 579 867 1044 1114 935 651 397 268 200 138 79 63 34 28 22

    ## 18 19

    ## 21 16

    dist_tab(nchar(words))

    ## interval freq cum.freq percent cum.percent

    ## 1 3 579 579 8.97 8.97

    ## 2 4 867 1446 13.43 22.40

    ## 3 5 1044 2490 16.17 38.57

    ## 4 6 1114 3604 17.26 55.82

    ## 5 7 935 4539 14.48 70.31

    ## 6 8 651 5190 10.08 80.39

    ## 7 9 397 5587 6.15 86.54

    ## 8 10 268 5855 4.15 90.69

    ## 9 11 200 6055 3.10 93.79

    ## 10 12 138 6193 2.14 95.93

    ....

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 33 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    14.1 Word Length Counts

    0

    300

    600

    900

    5 10 15 20Number of Letters

    Num

    ber o

    f Wor

    ds

    A simple plot is then effective in showing the distribution of the word lengths. Here we create asingle column data frame that is passed on to ggplot() to generate a histogram, with a verticalline to show the mean length of words.

    data.frame(nletters=nchar(words)) %>%

    ggplot(aes(x=nletters)) +

    geom_histogram(binwidth=1) +

    geom_vline(xintercept=mean(nchar(words)),

    colour="green", size=1, alpha=.5) +

    labs(x="Number of Letters", y="Number of Words")

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 34 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    14.2 Letter Frequency

    ZJQX

    WYKVFBGHPDMCULSNOTRAI

    E

    0% 2% 4% 6% 8% 10% 12%Proportion

    Lette

    r

    Next we want to review the frequency of letters across all of the words in the discourse. Somedata preparation will transform the vector of words into a list of letters, which we then constructa frequency count for, and pass this on to be plotted.

    We again use a pipeline to string together the operations on the data. Starting from the vec-tor of words stored in word we split the words into characters using str split() from stringr(Wickham, 2015), removing the first string (an empty string) from each of the results (usingsapply()). Reducing the result into a simple vector, using unlist(), we then generate a dataframe recording the letter frequencies, using dist tab() from qdap. We can then plot the letterproportions.

    library(dplyr)

    library(stringr)

    words %>%

    str_split("") %>%

    sapply(function(x) x[-1]) %>%

    unlist %>%

    dist_tab %>%

    mutate(Letter=factor(toupper(interval),

    levels=toupper(interval[order(freq)]))) %>%

    ggplot(aes(Letter, weight=percent)) +

    geom_bar() +

    coord_flip() +

    labs(y="Proportion") +

    scale_y_continuous(breaks=seq(0, 12, 2),

    label=function(x) paste0(x, "%"),

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 35 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    expand=c(0,0), limits=c(0,12))

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 36 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    14.3 Letter and Position Heatmap

    .010 .019 .013 .010 .010 .007 .005 .003 .002 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000

    .006 .001 .004 .002 .002 .002 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

    .013 .003 .007 .006 .004 .004 .003 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000

    .008 .002 .005 .005 .004 .003 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000

    .006 .021 .010 .016 .014 .008 .005 .003 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000

    .005 .001 .003 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

    .004 .001 .004 .004 .002 .002 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

    .005 .005 .002 .004 .003 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000

    .007 .015 .009 .011 .012 .009 .007 .005 .003 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000

    .002 .000 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

    .002 .000 .001 .003 .001 .000 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

    .005 .005 .008 .008 .006 .004 .004 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000

    .009 .003 .007 .005 .003 .003 .002 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000

    .005 .010 .012 .008 .007 .009 .005 .004 .003 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000

    .005 .021 .009 .008 .009 .005 .005 .003 .002 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000

    .011 .003 .006 .005 .002 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000

    .001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

    .009 .012 .013 .009 .010 .009 .006 .004 .002 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000

    .015 .004 .011 .008 .007 .006 .005 .003 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000

    .008 .005 .012 .013 .009 .008 .007 .005 .003 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000

    .004 .010 .005 .005 .004 .003 .002 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000

    .003 .001 .003 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

    .005 .002 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

    .001 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

    .001 .001 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

    .001 .000 .000 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000Z

    Y

    X

    W

    V

    U

    T

    S

    R

    Q

    P

    O

    N

    M

    L

    K

    J

    I

    H

    G

    F

    E

    D

    C

    B

    A

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Position

    Lette

    r

    Proportion0.000

    0.005

    0.010

    0.015

    0.020

    The qheat() function from qdap provides an effective visualisation of tabular data. Here wetransform the list of words into a position count of each letter, and constructing a table of theproportions that is passed on to qheat() to do the plotting.

    words %>%

    lapply(function(x) sapply(letters, gregexpr, x, fixed=TRUE)) %>%

    unlist %>%

    (function(x) x[x!=-1]) %>%

    (function(x) setNames(x, gsub("\\d", "", names(x)))) %>%(function(x) apply(table(data.frame(letter=toupper(names(x)),

    position=unname(x))),

    1, function(y) y/length(x))) %>%

    qheat(high="green", low="yellow", by.column=NULL,

    values=TRUE, digits=3, plot=FALSE) +

    labs(y="Letter", x="Position") +

    theme(axis.text.x=element_text(angle=0)) +

    guides(fill=guide_legend(title="Proportion"))

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 37 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    14.4 Miscellaneous Functions

    We can generate gender from a name list, using the genderdata (?) package

    devtools::install_github("lmullen/gender-data-pkg")

    name2sex(qcv(graham, frank, leslie, james, jacqui, jack, kerry, kerrie))

    ## The genderdata package needs to be installed.

    ## Error in install genderdata package(): Failed to install the genderdata package.

    ## Please try installing the package for yourself using the following command:

    ## install.packages("genderdata", repos = "http://packages.ropensci.org", type

    = "source")

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 38 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    15 Word Distances

    Continuous bag of words (CBOW). Word2Vec associates each word in a vocabulary with a uniquevector of real numbers of length d. Words that have a similar syntactic context appear closertogether within the vector space. The syntactic context is based on a set of words within aspecific window size.

    install.packages("tmcn.word2vec", repos="http://R-Forge.R-project.org")

    ## Installing package into ’/home/gjw/R/x86 64-pc-linux-gnu-library/3.2’

    ## (as ’lib’ is unspecified)

    ##

    ## The downloaded source packages are in

    ## '/tmp/Rtmpt1u3GR/downloaded_packages'

    library(tmcn.word2vec)

    model

  • Data Science with R Hands-On Text Mining

    16 Review—Preparing the Corpus

    Here in one sequence is collected the code to perform a text mining project. Notice that we wouldnot necessarily do all of these steps so pick and choose as is appropriate to your situation.

    # Required packages

    library(tm)

    library(wordcloud)

    # Locate and load the Corpus.

    cname

  • Data Science with R Hands-On Text Mining

    17 Review—Analysing the Corpus

    # Document term matrix.

    dtm

  • Data Science with R Hands-On Text Mining

    18 LDA

    Topic Models such as Latent Dirichlet Allocation has been popular for text mining in last 15years. Applied with varying degrees of success. Text is fed into LDA to extract the topicsunderlying the text document. Examples are the AP corpus and the Science Corpus 1880-2002(Blei and Lafferty 2009). PERHAPS USEFUL IN BOOK?

    When is LDA applicable - it will fail on some data and need to choose number of topics tofind and how many documents are needed. HOw do we know the topics learned are correcttopics.

    Two fundemental papers - independelty discovered: Blei, Ng, Jordan - NIPS 2001 with 11kcitations. Other paper is Pritchard, Stephens, and Donnelly in Genetics June 200 14K citations- models are exactly the same except for minor differences: except topics versus populationstructures.

    No theoretic analysis as such. How to guarantee correct topics and how efficient is the learningprocedure?

    Observations:

    LDA won’t work on many short tweets or very few long documents.

    We should not liberally over-fit the LDA with too many redundant topics...

    Limiting factors:

    We should use as many documents as we can and short documents less than 10 words won’twork even if there are many of them. Need sufficiently long documents.

    Small Dirichlet paramenter helps especially if we overfit. See Long Nguen’s keynote at PAKDD2015 in Vietnam.

    number of documents the most important factor

    document length plays a useful role too

    avoid overfitting as you get too many topics and don’t really learn anything as the humn needsto cull the topics.

    New work detects new topics as they emerge.

    library(lda)

    ## Error in library(lda): there is no package called ’lda’

    # From demo(lda)

    library("ggplot2")

    library("reshape2")

    data(cora.documents)

    ## Warning in data(cora.documents): data set ’cora.documents’ not found

    data(cora.vocab)

    ## Warning in data(cora.vocab): data set ’cora.vocab’ not found

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 42 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

  • Data Science with R Hands-On Text Mining

    theme_set(theme_bw())

    set.seed(8675309)

    K

  • Data Science with R Hands-On Text Mining

    19 Further Reading and Acknowledgements

    The Rattle Book, published by Springer, provides a comprehensiveintroduction to data mining and analytics using Rattle and R.It is available from Amazon. Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from http://datamining.togaware.com, including theDatamining Desktop Survival Guide.

    This chapter is one of many chapters available from http://HandsOnDataScience.com. In particular follow the links on thewebsite with a * which indicates the generally more developed chap-ters.

    Other resources include:

    The Journal of Statistical Software article, Text Mining Infrastructure in R is a good starthttp://www.jstatsoft.org/v25/i05/paper

    Bilisoly (2008) presents methods and algorithms for text mining using Perl.

    Thanks also to Tony Nolan for suggestions of some of the examples used in this chapter.

    Some of the qdap examples were motivated by http://trinkerrstuff.wordpress.com/2014/10/31/exploration-of-letter-make-up-of-english-words/.

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 44 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

    http://www.amazon.com/gp/product/1441998896/ref=as_li_qf_sp_asin_tl?ie=UTF8&tag=togaware-20&linkCode=as2&camp=217145&creative=399373&creativeASIN=1441998896

    http://www.amazon.com/gp/product/1441998896/ref=as_li_qf_sp_asin_tl?ie=UTF8&tag=togaware-20&linkCode=as2&camp=217145&creative=399373&creativeASIN=1441998896

    http://datamining.togaware.com

    http://datamining.togaware.com/survivor/index.html

    http://HandsOnDataScience.com

    http://HandsOnDataScience.com

    http://www.jstatsoft.org/v25/i05/paper

    http://trinkerrstuff.wordpress.com/2014/10/31/exploration-of-letter-make-up-of-english-words/

    http://trinkerrstuff.wordpress.com/2014/10/31/exploration-of-letter-make-up-of-english-words/

  • Data Science with R Hands-On Text Mining

    20 References

    Bilisoly R (2008). Practical Text Mining with Perl. Wiley Series on Methods and Applicationsin Data Mining. Wiley. ISBN 9780470382851. URL http://books.google.com.au/books?id=YkMFVbsrdzkC.

    Feinerer I, Hornik K (2015). tm: Text Mining Package. R package version 0.6-2, URL https://CRAN.R-project.org/package=tm.

    Hansen KD, Gentry J, Long L, Gentleman R, Falcon S, Hahne F, Sarkar D (2016). Rgraphviz:Provides plotting capabilities for R graph objects. R package version 2.12.0.

    Neuwirth E (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2, URLhttps://CRAN.R-project.org/package=RColorBrewer.

    R Core Team (2015). R: A Language and Environment for Statistical Computing. R Foundationfor Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

    Rinker T (2015). qdap: Bridging the Gap Between Qualitative Data and Quantitative Analysis.R package version 2.2.4, URL https://CRAN.R-project.org/package=qdap.

    Wickham H (2015). stringr: Simple, Consistent Wrappers for Common String Operations. Rpackage version 1.0.0, URL https://CRAN.R-project.org/package=stringr.

    Williams GJ (2009). “Rattle: A Data Mining GUI for R.” The R Journal, 1(2), 45–55. URLhttp://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf.

    Williams GJ (2011). Data Mining with Rattle and R: The art of excavating data for knowledgediscovery. Use R! Springer, New York.

    This document, sourced from TextMiningO.Rnw bitbucket revision 76, was processed by KnitRversion 1.12 of 2016-01-06 and took 41.3 seconds to process. It was generated by gjw on nyxrunning Ubuntu 14.04.3 LTS with Intel(R) Xeon(R) CPU W3520 @ 2.67GHz having 8 cores and12.3GB of RAM. It completed the processing 2016-01-10 09:58:57.

    Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 45 of 46

    Draft Only

    Generated 2016-01-10 10:00:58+11:00

    http://books.google.com.au/books?id=YkMFVbsrdzkC

    http://books.google.com.au/books?id=YkMFVbsrdzkC

    https://CRAN.R-project.org/package=tm

    https://CRAN.R-project.org/package=tm

    https://CRAN.R-project.org/package=RColorBrewer

    https://www.R-project.org/

    https://CRAN.R-project.org/package=qdap

    https://CRAN.R-project.org/package=stringr

    http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf

  • Draft Only

    Generated 2016-01-10 10:00:58+11:00

    Getting Started: The Corpus

    Corpus Sources and Readers

    Text Documents

    PDF Documents

    Word Documents

    Exploring the Corpus

    Preparing the Corpus

    Simple Transforms

    Conversion to Lower Case

    Remove Numbers

    Remove Punctuation

    Remove English Stop Words

    Remove Own Stop Words

    Strip Whitespace

    Specific Transformations

    Stemming

    Creating a Document Term Matrix

    Exploring the Document Term Matrix

    Distribution of Term Frequencies

    Conversion to Matrix and Save to CSV

    Removing Sparse Terms

    Identifying Frequent Items and Associations

    Correlations Plots

    Correlations Plot—Options

    Plotting Word Frequencies

    Word Clouds

    Reducing Clutter With Max Words

    Reducing Clutter With Min Freq

    Adding Some Colour

    Varying the Scaling

    Rotating Words

    Quantitative Analysis of Text

    Word Length Counts

    Letter Frequency

    Letter and Position Heatmap

    Miscellaneous Functions

    Word Distances

    Review—Preparing the Corpus

    Review—Analysing the Corpus

    LDA

    Further Reading and Acknowledgements

    References

    .ls.objects

  • Creating Shaped Wordclouds Using R

    Tidewater Big Data EnthusiastsChuck Cartledge

    Developer

    November 3, 2016 at 11:04pm

    Contents

    List of Figures i

    1 Introduction 1

    2 Discussion 1

    3 Conclusion 7

    A Misc. files 9

    List of Figures

    1 A sample word cloud based on Romeo and Juliet. . . . . . . . . . . . . . . . 22 A more interesting word cloud based on Romeo and Juliette. . . . . . . . . . 33 An empty word cloud figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 A filled word cloud figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 A filled USA word cloud figure. . . . . . . . . . . . . . . . . . . . . . . . . . 66 A collection of sample word clouds. . . . . . . . . . . . . . . . . . . . . . . . 8

    i

  • 1 Introduction

    The R library wordcloud provides an easy way to create an image showing how often a word(or tag) appears in a corpus (see Figure 1 on the following page). In a word cloud, the sizeof a word indicates how often that word appears. Word cloud words can be colored as well.

    While word clouds are easy to create, often the clouds could be shaped differently tocreate a more lasting and profound impression (see Figure 2 on page 3).

    2 Discussion

    The R library wordcloud21 provides the capability of creating a word cloud that takes theshape of an image, or the shape of letters. The collection of predefined shapes include:

    • ’cardiod’ – a heart shape

    • ’circle’ – the default

    • ’diamond’ – an alias for a square

    • ’pentagon’ – the five sided object

    • ’star’ – a five pointed star

    • ’triangle’ – a triangle with the wide base at the bottom

    • ’triangle-forward’ – a triangle with the wide base at the left

    This collection of shapes (when combined with a user specified background color), maybe enough to satisfy a wide variety of needs. But it is the figPath option that offers themost potential.

    The figPath option can point to a figure that contains the image the cloud path shouldfill.

    Here are the steps to create an “interesting” shape to fill with a word cloud:

    1. Download/create an image with only two items (see Figure 3 on page 4):

    • A white background, and• A black outline of the shape.

    2. Fill the interior of the shape with the same color as the outline (see Figure 4 on page 5).

    3. Pass the location of the filled image as the figPath parameter (see Figure 5 on page 6).

  • Figure 1: A sample word cloud based on Romeo and Juliet. The image was created usingwordcloud function in the wordcloud library and the text from “Romeo and Juliet.”

    2

  • Figure 2: A more interesting word cloud based on Romeo and Juliette. The image wascreated using wordcloud2 function in the wordcloud2 library and the text from “Romeo andJuliet.”

    3

  • Figure 3: An empty word cloud figure.

    4

  • Figure 4: A filled word cloud figure.

    5

  • Figure 5: A filled USA word cloud figure.

    The wordcloud2 function behaves slightly differently than most of the other R plot func-tions that I’ve used. The result from both wordcloud2 and letterCloud is not displayablewithin R. These functions actually create an HTML page in a temporary directory with em-bedded JavaScript that performs the placement of the words within the shape, and providesa level of interaction after the page is displayed. R “understands” that the product fromthese functions is an HTML widget and starts up the default browser to show the page. Thepage, and its sub-directories are removed when R ends.

    The fact that the page uses JavaScript introduces some interesting aspects. Buried in theJavaScript used by the page to place the words in the cloud are a plethora of Math.random()calls. The JavaScript specification says that the Math.random() function has to return avalue greater than or equal to 0, and less than 1, which is reasonable for a random function.The specification also says that the implementation of the random function is up to theJavaScript application, and does not specify how the numbers are to be generated. Meaningthat the same HTML page being viewed by two different browsers, may generate two differentsequences of random numbers. Most random number generators have the capability of settinga seed value so that a repeatable sequence can be generated. JavaScript does not supportthe idea of a random number seed. The HTML page and collection of directories can bemoved to a server where they are available for use and support.

    All of this means that each loading and viewing of the page will generate a different

    1Available from:https://github.com/Lchiffon/wordcloud2

    6

    https://github.com/Lchiffon/wordcloud2

  • image, and there is no practical way to “get back” to an image that was good.In the Files section (see Section A on page 9) is an R script and support files to work

    with. The R script was used to create various images (see Figure 6 on the following page).

    3 Conclusion

    The wordcloud2 library enables you to create word clouds of arbitrary shape inside an HTMLusing JavaScript to position and orient each word. Each HTML page and its associatedlibrary files are placed in individual directories that are removed when the creating R processterminates. Pages and files can be moved, or copied for safe keeping if desired. Because thepages use the Math.random() JavaScript function, each time the page is loaded, words willbe positioned differently in the cloud. If the desired shape has an internal hole, then it ispossible that some words may not be placed in the cloud.

    wordcloud2 allows you to create word clouds to support your data visualization needs.

    7

  • (a) A heart.

    (b) The letters “USA”.

    (c) A star.

    (d) The USA.

    Figure 6: A collection of sample word clouds. These images were created with the attachedR script.

    8

  • A Misc. files

    The files used to create all these figures are attached to this report. They are:

    1. romeoAndJuliet.base64 – default text used to demonstrate the software

    2. heart.png – a heart shape with a hole

    3. usa.png – an outline of the continental United States

    4. wordCloud.R – an R script to demonstrate making word clouds

    9

    ClJvbWVvIGFuZCBKdWxpZXQKU2hha2VzcGVhcmUgaG9tZXBhZ2UgfCBSb21lbyBhbmQgSnVsaWV0IHwgRW50aXJlIHBsYXkKQUNUIEkKUFJPTE9HVUUKCiAgICBUd28gaG91c2Vob2xkcywgYm90aCBhbGlrZSBpbiBkaWduaXR5LAogICAgSW4gZmFpciBWZXJvbmEsIHdoZXJlIHdlIGxheSBvdXIgc2NlbmUsCiAgICBGcm9tIGFuY2llbnQgZ3J1ZGdlIGJyZWFrIHRvIG5ldyBtdXRpbnksCiAgICBXaGVyZSBjaXZpbCBibG9vZCBtYWtlcyBjaXZpbCBoYW5kcyB1bmNsZWFuLgogICAgRnJvbSBmb3J0aCB0aGUgZmF0YWwgbG9pbnMgb2YgdGhlc2UgdHdvIGZvZXMKICAgIEEgcGFpciBvZiBzdGFyLWNyb3NzJ2QgbG92ZXJzIHRha2UgdGhlaXIgbGlmZTsKICAgIFdob3NlIG1pc2FkdmVudHVyZWQgcGl0ZW91cyBvdmVydGhyb3dzCiAgICBEbyB3aXRoIHRoZWlyIGRlYXRoIGJ1cnkgdGhlaXIgcGFyZW50cycgc3RyaWZlLgogICAgVGhlIGZlYXJmdWwgcGFzc2FnZSBvZiB0aGVpciBkZWF0aC1tYXJrJ2QgbG92ZSwKICAgIEFuZCB0aGUgY29udGludWFuY2Ugb2YgdGhlaXIgcGFyZW50cycgcmFnZSwKICAgIFdoaWNoLCBidXQgdGhlaXIgY2hpbGRyZW4ncyBlbmQsIG5vdWdodCBjb3VsZCByZW1vdmUsCiAgICBJcyBub3cgdGhlIHR3byBob3VycycgdHJhZmZpYyBvZiBvdXIgc3RhZ2U7CiAgICBUaGUgd2hpY2ggaWYgeW91IHdpdGggcGF0aWVudCBlYXJzIGF0dGVuZCwKICAgIFdoYXQgaGVyZSBzaGFsbCBtaXNzLCBvdXIgdG9pbCBzaGFsbCBzdHJpdmUgdG8gbWVuZC4KClNDRU5FIEkuIFZlcm9uYS4gQSBwdWJsaWMgcGxhY2UuCgogICAgRW50ZXIgU0FNUFNPTiBhbmQgR1JFR09SWSwgb2YgdGhlIGhvdXNlIG9mIENhcHVsZXQsIGFybWVkIHdpdGggc3dvcmRzIGFuZCBidWNrbGVycyAKClNBTVBTT04KCiAgICBHcmVnb3J5LCBvJyBteSB3b3JkLCB3ZSdsbCBub3QgY2FycnkgY29hbHMuCgpHUkVHT1JZCgogICAgTm8sIGZvciB0aGVuIHdlIHNob3VsZCBiZSBjb2xsaWVycy4KClNBTVBTT04KCiAgICBJIG1lYW4sIGFuIHdlIGJlIGluIGNob2xlciwgd2UnbGwgZHJhdy4KCkdSRUdPUlkKCiAgICBBeSwgd2hpbGUgeW91IGxpdmUsIGRyYXcgeW91ciBuZWNrIG91dCBvJyB0aGUgY29sbGFyLgoKU0FNUFNPTgoKICAgIEkgc3RyaWtlIHF1aWNrbHksIGJlaW5nIG1vdmVkLgoKR1JFR09SWQoKICAgIEJ1dCB0aG91IGFydCBub3QgcXVpY2tseSBtb3ZlZCB0byBzdHJpa2UuCgpTQU1QU09OCgogICAgQSBkb2cgb2YgdGhlIGhvdXNlIG9mIE1vbnRhZ3VlIG1vdmVzIG1lLgoKR1JFR09SWQoKICAgIFRvIG1vdmUgaXMgdG8gc3RpcjsgYW5kIHRvIGJlIHZhbGlhbnQgaXMgdG8gc3RhbmQ6CiAgICB0aGVyZWZvcmUsIGlmIHRob3UgYXJ0IG1vdmVkLCB0aG91IHJ1bm4nc3QgYXdheS4KClNBTVBTT04KCiAgICBBIGRvZyBvZiB0aGF0IGhvdXNlIHNoYWxsIG1vdmUgbWUgdG8gc3RhbmQ6IEkgd2lsbAogICAgdGFrZSB0aGUgd2FsbCBvZiBhbnkgbWFuIG9yIG1haWQgb2YgTW9udGFndWUncy4KCkdSRUdPUlkKCiAgICBUaGF0IHNob3dzIHRoZWUgYSB3ZWFrIHNsYXZlOyBmb3IgdGhlIHdlYWtlc3QgZ29lcwogICAgdG8gdGhlIHdhbGwuCgpTQU1QU09OCgogICAgVHJ1ZTsgYW5kIHRoZXJlZm9yZSB3b21lbiwgYmVpbmcgdGhlIHdlYWtlciB2ZXNzZWxzLAogICAgYXJlIGV2ZXIgdGhydXN0IHRvIHRoZSB3YWxsOiB0aGVyZWZvcmUgSSB3aWxsIHB1c2gKICAgIE1vbnRhZ3VlJ3MgbWVuIGZyb20gdGhlIHdhbGwsIGFuZCB0aHJ1c3QgaGlzIG1haWRzCiAgICB0byB0aGUgd2FsbC4KCkdSRUdPUlkKCiAgICBUaGUgcXVhcnJlbCBpcyBiZXR3ZWVuIG91ciBtYXN0ZXJzIGFuZCB1cyB0aGVpciBtZW4uCgpTQU1QU09OCgogICAgJ1RpcyBhbGwgb25lLCBJIHdpbGwgc2hvdyBteXNlbGYgYSB0eXJhbnQ6IHdoZW4gSQogICAgaGF2ZSBmb3VnaHQgd2l0aCB0aGUgbWVuLCBJIHdpbGwgYmUgY3J1ZWwgd2l0aCB0aGUKICAgIG1haWRzLCBhbmQgY3V0IG9mZiB0aGVpciBoZWFkcy4KCkdSRUdPUlkKCiAgICBUaGUgaGVhZHMgb2YgdGhlIG1haWRzPwoKU0FNUFNPTgoKICAgIEF5LCB0aGUgaGVhZHMgb2YgdGhlIG1haWRzLCBvciB0aGVpciBtYWlkZW5oZWFkczsKICAgIHRha2UgaXQgaW4gd2hhdCBzZW5zZSB0aG91IHdpbHQuCgpHUkVHT1JZCgogICAgVGhleSBtdXN0IHRha2UgaXQgaW4gc2Vuc2UgdGhhdCBmZWVsIGl0LgoKU0FNUFNPTgoKICAgIE1lIHRoZXkgc2hhbGwgZmVlbCB3aGlsZSBJIGFtIGFibGUgdG8gc3RhbmQ6IGFuZAogICAgJ3RpcyBrbm93biBJIGFtIGEgcHJldHR5IHBpZWNlIG9mIGZsZXNoLgoKR1JFR09SWQoKICAgICdUaXMgd2VsbCB0aG91IGFydCBub3QgZmlzaDsgaWYgdGhvdSBoYWRzdCwgdGhvdQogICAgaGFkc3QgYmVlbiBwb29yIEpvaG4uIERyYXcgdGh5IHRvb2whIGhlcmUgY29tZXMKICAgIHR3byBvZiB0aGUgaG91c2Ugb2YgdGhlIE1vbnRhZ3Vlcy4KClNBTVBTT04KCiAgICBNeSBuYWtlZCB3ZWFwb24gaXMgb3V0OiBxdWFycmVsLCBJIHdpbGwgYmFjayB0aGVlLgoKR1JFR09SWQoKICAgIEhvdyEgdHVybiB0aHkgYmFjayBhbmQgcnVuPwoKU0FNUFNPTgoKICAgIEZlYXIgbWUgbm90LgoKR1JFR09SWQoKICAgIE5vLCBtYXJyeTsgSSBmZWFyIHRoZWUhCgpTQU1QU09OCgogICAgTGV0IHVzIHRha2UgdGhlIGxhdyBvZiBvdXIgc2lkZXM7IGxldCB0aGVtIGJlZ2luLgoKR1JFR09SWQoKICAgIEkgd2lsbCBmcm93biBhcyBJIHBhc3MgYnksIGFuZCBsZXQgdGhlbSB0YWtlIGl0IGFzCiAgICB0aGV5IGxpc3QuCgpTQU1QU09OCgogICAgTmF5LCBhcyB0aGV5IGRhcmUuIEkgd2lsbCBiaXRlIG15IHRodW1iIGF0IHRoZW07CiAgICB3aGljaCBpcyBhIGRpc2dyYWNlIHRvIHRoZW0sIGlmIHRoZXkgYmVhciBpdC4KCiAgICBFbnRlciBBQlJBSEFNIGFuZCBCQUxUSEFTQVIKCkFCUkFIQU0KCiAgICBEbyB5b3UgYml0ZSB5b3VyIHRodW1iIGF0IHVzLCBzaXI/CgpTQU1QU09OCgogICAgSSBkbyBiaXRlIG15IHRodW1iLCBzaXIuCgpBQlJBSEFNCgogICAgRG8geW91IGJpdGUgeW91ciB0aHVtYiBhdCB1cywgc2lyPwoKU0FNUFNPTgoKICAgIFtBc2lkZSB0byBHUkVHT1JZXSBJcyB0aGUgbGF3IG9mIG91ciBzaWRlLCBpZiBJIHNheQogICAgYXk/CgpHUkVHT1JZCgogICAgTm8uCgpTQU1QU09OCgogICAgTm8sIHNpciwgSSBkbyBub3QgYml0ZSBteSB0aHVtYiBhdCB5b3UsIHNpciwgYnV0IEkKICAgIGJpdGUgbXkgdGh1bWIsIHNpci4KCkdSRUdPUlkKCiAgICBEbyB5b3UgcXVhcnJlbCwgc2lyPwoKQUJSQUhBTQoKICAgIFF1YXJyZWwgc2lyISBubywgc2lyLgoKU0FNUFNPTgoKICAgIElmIHlvdSBkbywgc2lyLCBJIGFtIGZvciB5b3U6IEkgc2VydmUgYXMgZ29vZCBhIG1hbiBhcyB5b3UuCgpBQlJBSEFNCgogICAgTm8gYmV0dGVyLgoKU0FNUFNPTgoKICAgIFdlbGwsIHNpci4KCkdSRUdPUlkKCiAgICBTYXkgJ2JldHRlcjonIGhlcmUgY29tZXMgb25lIG9mIG15IG1hc3RlcidzIGtpbnNtZW4uCgpTQU1QU09OCgogICAgWWVzLCBiZXR0ZXIsIHNpci4KCkFCUkFIQU0KCiAgICBZb3UgbGllLgoKU0FNUFNPTgoKICAgIERyYXcsIGlmIHlvdSBiZSBtZW4uIEdyZWdvcnksIHJlbWVtYmVyIHRoeSBzd2FzaGluZyBibG93LgoKICAgIFRoZXkgZmlnaHQKCiAgICBFbnRlciBCRU5WT0xJTwoKQkVOVk9MSU8KCiAgICBQYXJ0LCBmb29scyEKICAgIFB1dCB1cCB5b3VyIHN3b3JkczsgeW91IGtub3cgbm90IHdoYXQgeW91IGRvLgoKICAgIEJlYXRzIGRvd24gdGhlaXIgc3dvcmRzCgogICAgRW50ZXIgVFlCQUxUCgpUWUJBTFQKCiAgICBXaGF0LCBhcnQgdGhvdSBkcmF3biBhbW9uZyB0aGVzZSBoZWFydGxlc3MgaGluZHM/CiAgICBUdXJuIHRoZWUsIEJlbnZvbGlvLCBsb29rIHVwb24gdGh5IGRlYXRoLgoKQkVOVk9MSU8KCiAgICBJIGRvIGJ1dCBrZWVwIHRoZSBwZWFjZTogcHV0IHVwIHRoeSBzd29yZCwKICAgIE9yIG1hbmFnZSBpdCB0byBwYXJ0IHRoZXNlIG1lbiB3aXRoIG1lLgoKVFlCQUxUCgogICAgV2hhdCwgZHJhd24sIGFuZCB0YWxrIG9mIHBlYWNlISBJIGhhdGUgdGhlIHdvcmQsCiAgICBBcyBJIGhhdGUgaGVsbCwgYWxsIE1vbnRhZ3VlcywgYW5kIHRoZWU6CiAgICBIYXZlIGF0IHRoZWUsIGNvd2FyZCEKCiAgICBUaGV5IGZpZ2h0CgogICAgRW50ZXIsIHNldmVyYWwgb2YgYm90aCBob3VzZXMsIHdobyBqb2luIHRoZSBmcmF5OyB0aGVuIGVudGVyIENpdGl6ZW5zLCB3aXRoIGNsdWJzCgpGaXJzdCBDaXRpemVuCgogICAgQ2x1YnMsIGJpbGxzLCBhbmQgcGFydGlzYW5zISBzdHJpa2UhIGJlYXQgdGhlbSBkb3duIQogICAgRG93biB3aXRoIHRoZSBDYXB1bGV0cyEgZG93biB3aXRoIHRoZSBNb250YWd1ZXMhCgogICAgRW50ZXIgQ0FQVUxFVCBpbiBoaXMgZ293biwgYW5kIExBRFkgQ0FQVUxFVAoKQ0FQVUxFVAoKICAgIFdoYXQgbm9pc2UgaXMgdGhpcz8gR2l2ZSBtZSBteSBsb25nIHN3b3JkLCBobyEKCkxBRFkgQ0FQVUxFVAoKICAgIEEgY3J1dGNoLCBhIGNydXRjaCEgd2h5IGNhbGwgeW91IGZvciBhIHN3b3JkPwoKQ0FQVUxFVAoKICAgIE15IHN3b3JkLCBJIHNheSEgT2xkIE1vbnRhZ3VlIGlzIGNvbWUsCiAgICBBbmQgZmxvdXJpc2hlcyBoaXMgYmxhZGUgaW4gc3BpdGUgb2YgbWUuCgogICAgRW50ZXIgTU9OVEFHVUUgYW5kIExBRFkgTU9OVEFHVUUKCk1PTlRBR1VFCgogICAgVGhvdSB2aWxsYWluIENhcHVsZXQsLS1Ib2xkIG1lIG5vdCwgbGV0IG1lIGdvLgoKTEFEWSBNT05UQUdVRQoKICAgIFRob3Ugc2hhbHQgbm90IHN0aXIgYSBmb290IHRvIHNlZWsgYSBmb2UuCgogICAgRW50ZXIgUFJJTkNFLCB3aXRoIEF0dGVuZGFudHMKClBSSU5DRQoKICAgIFJlYmVsbGlvdXMgc3ViamVjdHMsIGVuZW1pZXMgdG8gcGVhY2UsCiAgICBQcm9mYW5lcnMgb2YgdGhpcyBuZWlnaGJvdXItc3RhaW5lZCBzdGVlbCwtLQogICAgV2lsbCB0aGV5IG5vdCBoZWFyPyBXaGF0LCBobyEgeW91IG1lbiwgeW91IGJlYXN0cywKICAgIFRoYXQgcXVlbmNoIHRoZSBmaXJlIG9mIHlvdXIgcGVybmljaW91cyByYWdlCiAgICBXaXRoIHB1cnBsZSBmb3VudGFpbnMgaXNzdWluZyBmcm9tIHlvdXIgdmVpbnMsCiAgICBPbiBwYWluIG9mIHRvcnR1cmUsIGZyb20gdGhvc2UgYmxvb2R5IGhhbmRzCiAgICBUaHJvdyB5b3VyIG1pc3RlbXBlcidkIHdlYXBvbnMgdG8gdGhlIGdyb3VuZCwKICAgIEFuZCBoZWFyIHRoZSBzZW50ZW5jZSBvZiB5b3VyIG1vdmVkIHByaW5jZS4KICAgIFRocmVlIGNpdmlsIGJyYXdscywgYnJlZCBvZiBhbiBhaXJ5IHdvcmQsCiAgICBCeSB0aGVlLCBvbGQgQ2FwdWxldCwgYW5kIE1vbnRhZ3VlLAogICAgSGF2ZSB0aHJpY2UgZGlzdHVyYidkIHRoZSBxdWlldCBvZiBvdXIgc3RyZWV0cywKICAgIEFuZCBtYWRlIFZlcm9uYSdzIGFuY2llbnQgY2l0aXplbnMKICAgIENhc3QgYnkgdGhlaXIgZ3JhdmUgYmVzZWVtaW5nIG9ybmFtZW50cywKICAgIFRvIHdpZWxkIG9sZCBwYXJ0aXNhbnMsIGluIGhhbmRzIGFzIG9sZCwKICAgIENhbmtlcidkIHdpdGggcGVhY2UsIHRvIHBhcnQgeW91ciBjYW5rZXInZCBoYXRlOgogICAgSWYgZXZlciB5b3UgZGlzdHVyYiBvdXIgc3RyZWV0cyBhZ2FpbiwKICAgIFlvdXIgbGl2ZXMgc2hhbGwgcGF5IHRoZSBmb3JmZWl0IG9mIHRoZSBwZWFjZS4KICAgIEZvciB0aGlzIHRpbWUsIGFsbCB0aGUgcmVzdCBkZXBhcnQgYXdheToKICAgIFlvdSBDYXB1bGV0OyBzaGFsbCBnbyBhbG9uZyB3aXRoIG1lOgogICAgQW5kLCBNb250YWd1ZSwgY29tZSB5b3UgdGhpcyBhZnRlcm5vb24sCiAgICBUbyBrbm93IG91ciBmdXJ0aGVyIHBsZWFzdXJlIGluIHRoaXMgY2FzZSwKICAgIFRvIG9sZCBGcmVlLXRvd24sIG91ciBjb21tb24ganVkZ21lbnQtcGxhY2UuCiAgICBPbmNlIG1vcmUsIG9uIHBhaW4gb2YgZGVhdGgsIGFsbCBtZW4gZGVwYXJ0LgoKICAgIEV4ZXVudCBhbGwgYnV0IE1PTlRBR1VFLCBMQURZIE1PTlRBR1VFLCBhbmQgQkVOVk9MSU8KCk1PTlRBR1VFCgogICAgV2hvIHNldCB0aGlzIGFuY2llbnQgcXVhcnJlbCBuZXcgYWJyb2FjaD8KICAgIFNwZWFrLCBuZXBoZXcsIHdlcmUgeW91IGJ5IHdoZW4gaXQgYmVnYW4/CgpCRU5WT0xJTwoKICAgIEhlcmUgd2VyZSB0aGUgc2VydmFudHMgb2YgeW91ciBhZHZlcnNhcnksCiAgICBBbmQgeW91cnMsIGNsb3NlIGZpZ2h0aW5nIGVyZSBJIGRpZCBhcHByb2FjaDoKICAgIEkgZHJldyB0byBwYXJ0IHRoZW06IGluIHRoZSBpbnN0YW50IGNhbWUKICAgIFRoZSBmaWVyeSBUeWJhbHQsIHdpdGggaGlzIHN3b3JkIHByZXBhcmVkLAogICAgV2hpY2gsIGFzIGhlIGJyZWF0aGVkIGRlZmlhbmNlIHRvIG15IGVhcnMsCiAgICBIZSBzd3VuZyBhYm91dCBoaXMgaGVhZCBhbmQgY3V0IHRoZSB3aW5kcywKICAgIFdobyBub3RoaW5nIGh1cnQgd2l0aGFsIGhpc3MnZCBoaW0gaW4gc2Nvcm46CiAgICBXaGlsZSB3ZSB3ZXJlIGludGVyY2hhbmdpbmcgdGhydXN0cyBhbmQgYmxvd3MsCiAgICBDYW1lIG1vcmUgYW5kIG1vcmUgYW5kIGZvdWdodCBvbiBwYXJ0IGFuZCBwYXJ0LAogICAgVGlsbCB0aGUgcHJpbmNlIGNhbWUsIHdobyBwYXJ0ZWQgZWl0aGVyIHBhcnQuCgpMQURZIE1PTlRBR1VFCgogICAgTywgd2hlcmUgaXMgUm9tZW8/IHNhdyB5b3UgaGltIHRvLWRheT8KICAgIFJpZ2h0IGdsYWQgSSBhbSBoZSB3YXMgbm90IGF0IHRoaXMgZnJheS4KCkJFTlZPTElPCgogICAgTWFkYW0sIGFuIGhvdXIgYmVmb3JlIHRoZSB3b3JzaGlwcCdkIHN1bgogICAgUGVlcidkIGZvcnRoIHRoZSBnb2xkZW4gd2luZG93IG9mIHRoZSBlYXN0LAogICAgQSB0cm91YmxlZCBtaW5kIGRyYXZlIG1lIHRvIHdhbGsgYWJyb2FkOwogICAgV2hlcmUsIHVuZGVybmVhdGggdGhlIGdyb3ZlIG9mIHN5Y2Ftb3JlCiAgICBUaGF0IHdlc3R3YXJkIHJvb3RldGggZnJvbSB0aGUgY2l0eSdzIHNpZGUsCiAgICBTbyBlYXJseSB3YWxraW5nIGRpZCBJIHNlZSB5b3VyIHNvbjoKICAgIFRvd2FyZHMgaGltIEkgbWFkZSwgYnV0IGhlIHdhcyB3YXJlIG9mIG1lCiAgICBBbmQgc3RvbGUgaW50byB0aGUgY292ZXJ0IG9mIHRoZSB3b29kOgogICAgSSwgbW