data visualization: language variation suite and interactive text mining suite

54
Introduction Language Variation Suite Visual Analytics for Digital Humanities Interactive Text Mining Suite Conclusion References Data Visualization: Language Variation Suite and Interactive Text Mining Suite Olga Scrivner Indiana University LSU, April 2016 1 / 54

Upload: olga-scrivner

Post on 14-Apr-2017

2.657 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Data Visualization: Language Variation Suiteand Interactive Text Mining Suite

Olga Scrivner

Indiana University

LSU, April 2016

1 / 54

Page 2: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Data Analysis and Visualization

“As our collective knowledge continues to be digitized andstored (...) it becomes more difficult to find and discover what

we are looking for.” (Blei 2012)

“Mastery of quantitative methods is increasingly becoming avital component of linguistic training” (Johnson, 2008)

2 / 54

Page 3: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Data Analysis and Visualization

“As our collective knowledge continues to be digitized andstored (...) it becomes more difficult to find and discover what

we are looking for.” (Blei 2012)

“Mastery of quantitative methods is increasingly becoming avital component of linguistic training” (Johnson, 2008)

2 / 54

Page 4: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Data Types

1 Structured Data

2 Unstructured Data

3 / 54

Page 5: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Quantitative Analysis for Structured Data

Traditional Tools Linguistic Data

a. Categorical variable

b. Independence ofobservation

c. Normally distributed data

d. Large corpus size

a. Categorical, continuous,multivariate, ordinal

b. Correlated data

c. Unbalanced data

d. Small corpus size

4 / 54

Page 6: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Data Visualization

Word Order in Latin (Passarotti et al., 2013)

Visual Analytics - “The science of analytical reasoningfacilitated by visual interactive interfaces” (Thomas et al.,2005)

5 / 54

Page 7: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

New Tools of Linguistic Analysis (Baayen 2008,Tagliamonte 2014, Gries 2015)

1 Mixed Model:

A statistical regression model containing fixed effects(independent variables) and random effects (e.g.,individual- or word-specific effects).

Measures variability between subjects and correlation ofobservation within subjects

Can handle unbalanced data

6 / 54

Page 8: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

New Tools of Linguistic Analysis (Baayen 2008,Tagliamonte 2014, Gries 2015)

2 Conditional inference trees and Random Forests

Uses predictive modeling

“Proves to be more stable than stepwise variable selectionapproaches available for logistic regression” (Strobl2009:325)

Can handle skewed data that often violate the assumptionsof regression approaches

7 / 54

Page 9: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

RStudio and Shiny Application

1 R - a free programming language for statistical computingand graphics

2 RStudio - Integrated Development Environment: a sourcecode editor, an executor and a debugger

3 Shiny App - a web application framework for R

Computational power of R + Web interactivity

8 / 54

Page 10: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Language Variation Suite (LVS) - a StatisticalShiny Application

1 From https://languagevariationsuite.wordpress.com/

download Labov’s data New York 1966(LabovData.csv) andCaracas data Bentivoglio & Sedano 1993 (CaracasData.csv)

2 Open LVS applicationhttps://languagevariationsuite.shinyapps.io/Pages

9 / 54

Page 11: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Language Variation Suite - Introduction

1 Data in csv format (no spaces in column names)

10 / 54

Page 12: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Files

2 Upload your file

11 / 54

Page 13: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Descriptive Data Analysis - Table

Table displays your dataset and allows for filtering columns bya search word, or in descending/ascending order.

12 / 54

Page 14: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Summary

Summary provides a quantitative summary for each variable,ex. frequency count, mean, median.

13 / 54

Page 15: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Data Structure

1 factor - categorical values, ex. m/f (gender), 20-34/65+(age), low/high (economic level)

2 num - numerical values, ex. 0.95, 1.53 int - integer values, ex. 1, 2, 10

14 / 54

Page 16: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Data Subset

15 / 54

Page 17: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Cross-Tabulation

Cross-tabulation is a useful feature to examine the distributionof your dependent variable.

16 / 54

Page 18: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Cross-Tabulation

Saks (upper middle-class store), Macy’s (middle-class store), Klein

(working-class)17 / 54

Page 19: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Cluster

Cluster Analysis allows you to classify your data intosub-groups (clusters), which are defined by your data. Items inthe same cluster will be very similar to one another.

Saks (upper middle-class store), Macy’s (middle-class store), Klein

(working-class)18 / 54

Page 20: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

LVS - Inferential Analysis

Fixed Regression Model - ignoring individual variations(speakers or words) may lead to Type I Error:“a chance effect is mistaken for a real differencebetween the populations”

Mixed Regression Model - prone to Type II Error:“if speaker variation is at a high level, we cannotdiscern small population effects without a largenumber of speakers” (Johnson 2009, 2015)

19 / 54

Page 21: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Regression Model Selection

20 / 54

Page 22: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Model Output

21 / 54

Page 23: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Interpretation

Dependent Variable: deletion and retention

By default - deletion is a reference value (alphabetically)

Results are interpreted for retention

22 / 54

Page 24: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Interpretation

Lexical item Fourth has a negative effect on retention and issignificant

Normal style has a slightly negative effect on retention but itscoefficient is not significant

Macy’s and Saks have a positive and significant effect onretention. Saks (upper middle class store) is more significantthan Macy’s (middle class store)

23 / 54

Page 25: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Conditional Tree

24 / 54

Page 26: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Conditional Tree

Store is the most significant factor for R-use: Kleins (working class

store) - more R-deletion; Macy’s and Saks have a higher rate of

R-retention, which also depends on the lexical item (Floor shows

more retention than Fourth)25 / 54

Page 27: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Random Forest

26 / 54

Page 28: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Random Forest

The variable importance score demonstrates that Store is the most

important predictor, followed by Lexical Item. The variable is

irrelevant is its importance is around the zero and the cut-off value

(red dotted line).27 / 54

Page 29: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Data with Token Frequency

Upload CaracasData.csv fromhttps://languagevariationsuite.wordpress.com/

28 / 54

Page 30: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Tokens

29 / 54

Page 31: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Let’s Have a Short Break

30 / 54

Page 32: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Visual Analytics for Digital Humanities

The “epic transformation of archives” - shifting from print todigital archival form (Folsom, 2007)

31 / 54

Page 33: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Digital Humanity Manifesto 2.0 (2009) and Berry(2011)

1st Wave: “The first wave of digital humanities work wasquantitative, mobilizing the search and retrievalpowers of the database, automating corpuslinguistics, stacking hypercards into criticalarrays”

2nd Wave: “The second wave is qualitative, interpretive”,concentrating on new tools for creating andcurating digital repositories (Berry, 2011)

3rd Wave: Concentration on the computationality, search,retrieval and analysis originated inhumanity-based work

32 / 54

Page 34: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

New Ways of Exploring Data Collections

Graphs, maps and trees for literature analysis (Moretti,2005)

33 / 54

Page 35: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Visual Analytics

Word clouds to analyze a novel (Vuillemot et al., 2009)

34 / 54

Page 36: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Visual Analytics

Social network graphs of characters in Greek tragedies(Rydberg-Cox, 2011)

35 / 54

Page 37: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Visual Analytics

Literary fingerprint and summaries (Oelke et al., 2012)

36 / 54

Page 38: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Visual Analytics

Tracking emotion and sentiment in fairy tales(Mohammad, 2012)

37 / 54

Page 39: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Topic Modeling

Discovering underlying theme of collection from Science magazine1990-2000 (Blei, 2012)

For more information on topic modeling:http://www.matthewjockers.net/2011/09/29/

the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/38 / 54

Page 40: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Interactive Text Mining Suite - Introduction

1 Download 3 text files (dante01.txt, dante02.txt,dante03.txt) fromhttps://languagevariationsuite.wordpress.com/

(workshop)

2 ITMS Application:https://languagevariationsuite.shinyapps.io/

TextMining/

39 / 54

Page 41: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Upload Files - txt

40 / 54

Page 42: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Explore

41 / 54

Page 43: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Metadata

ID Date Title Author Other

Extract from pdf files

Upload from csv file

42 / 54

Page 44: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Stopwords

43 / 54

Page 45: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Frequency

44 / 54

Page 46: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Frequency Visualization

45 / 54

Page 47: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

More Stopwords

46 / 54

Page 48: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Topic Modeling

Selection of topics (how many different themes)

Selection of words per theme (how many words per topic)

Identification of the best topic number

47 / 54

Page 49: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Models

LDA (Latent Dirichlet allocation)

STM (Structural Topic model)

Chronological topic visualization (lda): requires metadata

48 / 54

Page 50: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Cluster Analysis

49 / 54

Page 51: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Punctuation Analysis

50 / 54

Page 52: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Future Directions

1 New LVS features:

(a) Traditional Rrbrul analysis (for comparison)

(b) Variable re-coding and dataset modification

2 New ITMS features:

(a) Network graphs

(b) Dynamic graphs

51 / 54

Page 53: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

Acknowledgements

I would like to thank Professor Rafael Orozco, Professor IrinaShport and LSU Linguistics for inviting me and organizing thisworkshop.

52 / 54

Page 54: Data Visualization: Language Variation Suite and Interactive Text Mining Suite

Introduction

LanguageVariationSuite

VisualAnalytics forDigitalHumanities

InteractiveText MiningSuite

Conclusion

References

References I

[1] Baayen, Harald. 2008. Analyzing linguistic data: A practical introduction to statistics. Cambridge:Cambridge University Press

[2] Bentivoglio, Paola and Mercedes Sedano. 1993. Investigacion sociolinguıstica: sus metodos aplicados auna experiencia venezolana. Boletın de Linguıstica 8. 3-35

[3] Gries, Stefan Th. 2015. Quantitative designs and statistical techniques. In Douglas Biber RandiReppen (eds.), The Cambridge Handbook of English Corpus Linguistics. Cambridge: CambridgeUniversity Press

[4] Jockers, Matthew. 2014. Text Analysis with R for Students of Literature. Quantitative Methods in theHumanities and Social Sciences. Springer International Publishing, Cham

[5] Labov, W. 1966. The Social Stratification of English in New York City. Washington: Center for AppliedLinguistics

[6] Moretti, Franco. 2005. Graphs, Maps, Trees: Abstract Models for a Literary History. Verso

[7] Oelke, Daniella, Dimitrios Kokkinakis, and Mats Malm. 2012. Advanced visual analytics methods forliterature analysis. Proceedings of the 6th EACL Workshop on Language Technology for CulturalHeritage, Social 561Sciences, and Humanities, pages 3544

[8] Passarotti, Marco, Barbara McGillivray, and David Bamman. “A Treebank-based Study on Latin WordOrder.” In proceedings of 16th International Colloquium on Latin Linguistics, At Uppsala, Sweden.2013, 340–352

[9] Schnapp, Jeffrey, and Peter Presner. 2009. Digital Humanities Manifesto 2.0.

[10] http://blog.kandu.com/post/57065268403/book-reading-gif

[11] http://cdn.business2community.com/wp-content/uploads/2014/09/archives01.jpg

53 / 54