data visualization: language variation suite and interactive text mining suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Data Visualization: Language Variation Suiteand Interactive Text Mining Suite

Olga Scrivner

Indiana University

LSU, April 2016

1 / 54

Introduction




Conclusion

References

Data Analysis and Visualization

“As our collective knowledge continues to be digitized andstored (...) it becomes more difficult to find and discover what

we are looking for.” (Blei 2012)

“Mastery of quantitative methods is increasingly becoming avital component of linguistic training” (Johnson, 2008)

2 / 54

Introduction




Conclusion

References

Data Types

1 Structured Data

2 Unstructured Data

3 / 54

Introduction




Conclusion

References

Quantitative Analysis for Structured Data

Traditional Tools Linguistic Data

a. Categorical variable

b. Independence ofobservation

c. Normally distributed data

d. Large corpus size

a. Categorical, continuous,multivariate, ordinal

b. Correlated data

c. Unbalanced data

d. Small corpus size

4 / 54

Introduction




Conclusion

References

Data Visualization

Word Order in Latin (Passarotti et al., 2013)

Visual Analytics - “The science of analytical reasoningfacilitated by visual interactive interfaces” (Thomas et al.,2005)

5 / 54

Introduction




Conclusion

References

New Tools of Linguistic Analysis (Baayen 2008,Tagliamonte 2014, Gries 2015)

1 Mixed Model:

A statistical regression model containing fixed effects(independent variables) and random effects (e.g.,individual- or word-specific effects).

Measures variability between subjects and correlation ofobservation within subjects

Can handle unbalanced data

6 / 54

Introduction




Conclusion

References

New Tools of Linguistic Analysis (Baayen 2008,Tagliamonte 2014, Gries 2015)

2 Conditional inference trees and Random Forests

Uses predictive modeling

“Proves to be more stable than stepwise variable selectionapproaches available for logistic regression” (Strobl2009:325)

Can handle skewed data that often violate the assumptionsof regression approaches

7 / 54

Introduction




Conclusion

References

RStudio and Shiny Application

1 R - a free programming language for statistical computingand graphics

2 RStudio - Integrated Development Environment: a sourcecode editor, an executor and a debugger

3 Shiny App - a web application framework for R

Computational power of R + Web interactivity

8 / 54

Introduction




Conclusion

References

Language Variation Suite (LVS) - a StatisticalShiny Application

1 From https://languagevariationsuite.wordpress.com/

download Labov’s data New York 1966(LabovData.csv) andCaracas data Bentivoglio & Sedano 1993 (CaracasData.csv)

2 Open LVS applicationhttps://languagevariationsuite.shinyapps.io/Pages

9 / 54

https://languagevariationsuite.wordpress.com/

https://languagevariationsuite.shinyapps.io/Pages

Introduction




Conclusion

References

Language Variation Suite - Introduction

1 Data in csv format (no spaces in column names)

10 / 54

Introduction




Conclusion

References

Files

2 Upload your file

11 / 54

Introduction




Conclusion

References

Descriptive Data Analysis - Table

Table displays your dataset and allows for filtering columns bya search word, or in descending/ascending order.

12 / 54

Introduction




Conclusion

References

Summary

Summary provides a quantitative summary for each variable,ex. frequency count, mean, median.

13 / 54

Introduction




Conclusion

References

Data Structure

1 factor - categorical values, ex. m/f (gender), 20-34/65+(age), low/high (economic level)

2 num - numerical values, ex. 0.95, 1.53 int - integer values, ex. 1, 2, 10

14 / 54

Introduction




Conclusion

References

Data Subset

15 / 54

Introduction




Conclusion

References

Cross-Tabulation

Cross-tabulation is a useful feature to examine the distributionof your dependent variable.

16 / 54

Introduction




Conclusion

References

Cross-Tabulation

Saks (upper middle-class store), Macy’s (middle-class store), Klein

(working-class)17 / 54

Introduction




Conclusion

References

Cluster

Cluster Analysis allows you to classify your data intosub-groups (clusters), which are defined by your data. Items inthe same cluster will be very similar to one another.

Saks (upper middle-class store), Macy’s (middle-class store), Klein

(working-class)18 / 54

Introduction




Conclusion

References

LVS - Inferential Analysis

Fixed Regression Model - ignoring individual variations(speakers or words) may lead to Type I Error:“a chance effect is mistaken for a real differencebetween the populations”

Mixed Regression Model - prone to Type II Error:“if speaker variation is at a high level, we cannotdiscern small population effects without a largenumber of speakers” (Johnson 2009, 2015)

19 / 54

Introduction




Conclusion

References

Regression Model Selection

20 / 54

Introduction




Conclusion

References

Model Output

21 / 54

Introduction




Conclusion

References

Interpretation

Dependent Variable: deletion and retention

By default - deletion is a reference value (alphabetically)

Results are interpreted for retention

22 / 54

Introduction




Conclusion

References

Interpretation

Lexical item Fourth has a negative effect on retention and issignificant

Normal style has a slightly negative effect on retention but itscoefficient is not significant

Macy’s and Saks have a positive and significant effect onretention. Saks (upper middle class store) is more significantthan Macy’s (middle class store)

23 / 54

Introduction




Conclusion

References

Conditional Tree

24 / 54

Introduction




Conclusion

References

Conditional Tree

Store is the most significant factor for R-use: Kleins (working class

store) - more R-deletion; Macy’s and Saks have a higher rate of

R-retention, which also depends on the lexical item (Floor shows

more retention than Fourth)25 / 54

Introduction




Conclusion

References

Random Forest

26 / 54

Introduction




Conclusion

References

Random Forest

The variable importance score demonstrates that Store is the most

important predictor, followed by Lexical Item. The variable is

irrelevant is its importance is around the zero and the cut-off value

(red dotted line).27 / 54

Introduction




Conclusion

References

Data with Token Frequency

Upload CaracasData.csv fromhttps://languagevariationsuite.wordpress.com/

28 / 54


Introduction




Conclusion

References

Tokens

29 / 54

Introduction




Conclusion

References

Let’s Have a Short Break

30 / 54

Introduction




Conclusion

References

Visual Analytics for Digital Humanities

The “epic transformation of archives” - shifting from print todigital archival form (Folsom, 2007)

31 / 54

Introduction




Conclusion

References

Digital Humanity Manifesto 2.0 (2009) and Berry(2011)

1st Wave: “The first wave of digital humanities work wasquantitative, mobilizing the search and retrievalpowers of the database, automating corpuslinguistics, stacking hypercards into criticalarrays”

2nd Wave: “The second wave is qualitative, interpretive”,concentrating on new tools for creating andcurating digital repositories (Berry, 2011)

3rd Wave: Concentration on the computationality, search,retrieval and analysis originated inhumanity-based work

32 / 54

Introduction




Conclusion

References

New Ways of Exploring Data Collections

Graphs, maps and trees for literature analysis (Moretti,2005)

33 / 54

Introduction




Conclusion

References

Visual Analytics

Word clouds to analyze a novel (Vuillemot et al., 2009)

34 / 54

Introduction




Conclusion

References

Visual Analytics

Social network graphs of characters in Greek tragedies(Rydberg-Cox, 2011)

35 / 54

Introduction




Conclusion

References

Visual Analytics

Literary fingerprint and summaries (Oelke et al., 2012)

36 / 54

Introduction




Conclusion

References

Visual Analytics

Tracking emotion and sentiment in fairy tales(Mohammad, 2012)

37 / 54

Introduction




Conclusion

References

Topic Modeling

Discovering underlying theme of collection from Science magazine1990-2000 (Blei, 2012)

For more information on topic modeling:http://www.matthewjockers.net/2011/09/29/

the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/38 / 54

http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/

http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/

Introduction




Conclusion

References

Interactive Text Mining Suite - Introduction

1 Download 3 text files (dante01.txt, dante02.txt,dante03.txt) fromhttps://languagevariationsuite.wordpress.com/

(workshop)

2 ITMS Application:https://languagevariationsuite.shinyapps.io/

TextMining/

39 / 54


https://languagevariationsuite.shinyapps.io/TextMining/

https://languagevariationsuite.shinyapps.io/TextMining/

Introduction




Conclusion

References

Upload Files - txt

40 / 54

Introduction




Conclusion

References

Explore

41 / 54

Introduction




Conclusion

References

Metadata

ID Date Title Author Other

Extract from pdf files

Upload from csv file

42 / 54

Introduction




Conclusion

References

Stopwords

43 / 54

Introduction




Conclusion

References

Frequency

44 / 54

Introduction




Conclusion

References

Frequency Visualization

45 / 54

Introduction




Conclusion

References

More Stopwords

46 / 54

Introduction




Conclusion

References

Topic Modeling

Selection of topics (how many different themes)

Selection of words per theme (how many words per topic)

Identification of the best topic number

47 / 54

Introduction




Conclusion

References

Models

LDA (Latent Dirichlet allocation)

STM (Structural Topic model)

Chronological topic visualization (lda): requires metadata

48 / 54

Introduction




Conclusion

References

Cluster Analysis

49 / 54

Introduction




Conclusion

References

Punctuation Analysis

50 / 54

Introduction




Conclusion

References

Future Directions

1 New LVS features:

(a) Traditional Rrbrul analysis (for comparison)

(b) Variable re-coding and dataset modification

2 New ITMS features:

(a) Network graphs

(b) Dynamic graphs

51 / 54

Introduction




Conclusion

References

Acknowledgements

I would like to thank Professor Rafael Orozco, Professor IrinaShport and LSU Linguistics for inviting me and organizing thisworkshop.

52 / 54

Introduction




Conclusion

References

References I

[1] Baayen, Harald. 2008. Analyzing linguistic data: A practical introduction to statistics. Cambridge:Cambridge University Press

[2] Bentivoglio, Paola and Mercedes Sedano. 1993. Investigacion sociolinguıstica: sus metodos aplicados auna experiencia venezolana. Boletın de Linguıstica 8. 3-35

[3] Gries, Stefan Th. 2015. Quantitative designs and statistical techniques. In Douglas Biber RandiReppen (eds.), The Cambridge Handbook of English Corpus Linguistics. Cambridge: CambridgeUniversity Press

[4] Jockers, Matthew. 2014. Text Analysis with R for Students of Literature. Quantitative Methods in theHumanities and Social Sciences. Springer International Publishing, Cham

[5] Labov, W. 1966. The Social Stratification of English in New York City. Washington: Center for AppliedLinguistics

[6] Moretti, Franco. 2005. Graphs, Maps, Trees: Abstract Models for a Literary History. Verso

[7] Oelke, Daniella, Dimitrios Kokkinakis, and Mats Malm. 2012. Advanced visual analytics methods forliterature analysis. Proceedings of the 6th EACL Workshop on Language Technology for CulturalHeritage, Social 561Sciences, and Humanities, pages 3544

[8] Passarotti, Marco, Barbara McGillivray, and David Bamman. “A Treebank-based Study on Latin WordOrder.” In proceedings of 16th International Colloquium on Latin Linguistics, At Uppsala, Sweden.2013, 340–352

[9] Schnapp, Jeffrey, and Peter Presner. 2009. Digital Humanities Manifesto 2.0.

[10] http://blog.kandu.com/post/57065268403/book-reading-gif

[11] http://cdn.business2community.com/wp-content/uploads/2014/09/archives01.jpg

53 / 54