capitalizing on machine reading to engage bigger data
TRANSCRIPT
Capitalizing on Machine Reading to Engage “Big(ger) Data”FOUR USE CASES FROM HIGHER ED
Summer Institute on Distance Learning and Instructional Technology (SIDLIT 2016)August 4 - 5, 2016
OverviewWhat are some ways to select, say, 200 research articles to “close read” from a set of 2,000 PDF articles* gleaned from library databases and Google Scholar? How can a researcher make sense of a trending issue in the flood of Tweets and RT based on a particular hashtag (#) or keyword search or an especially lively Tweetstream based on a particular social media account? People are dealing with ever more prodigious amounts of information—from a number of sources. Those who are savvy to the uses of computers to aid their reading (through “distant reading” or “not-reading”) may find that they are able to cover much more ground. This presentation introduces the use of NVivo 11 Plus (matrix queries, word frequency counts, text searches and dendrograms, cluster analyses, topic modeling, and others) for multiple cases of distant reading to aid in academic and research work.
*Note: In NVivo 11 Plus, it seems easier to process .docx and .txt textual data than .PDF, so some file transcoding may be necessary for smoother processing.
2
Presentation Topic Headings What is Big(ger) Data?
What is Machine Reading?
Some Machine Reading Sequences
Four “Use Cases” from Higher Education ◦ Use Case #1: Topic Modeling and Article Histograms (latent structure in documents and document sets)◦ Use Case #2: Engaging with Social Media (Microblogging and Social Network) Data ◦ Use Case #3: Exploring Manual and Machine Coding ◦ Use Case #4: Machine Reading for Sentiment Analysis
3
What is Big(ger) Data? STRUCTURED DATA
Thousands of individual records extracted from a microblogging or social networking platform
…and others
UNSTRUCTURED DATA
Hundreds to thousands of scraped images from a content-sharing platform
Hundreds of videos
Complex mixed media (heterogeneous) sets of data
…and others
5
Where Does the “Bigger Data” Come From?Social media platforms (crowd-sourced encyclopedias, social networking sites, microblogging sites, blogging sites, image sharing sites, video sharing sites, email sites, collaborative work sites, and others)
Databases of published contents
World Wide Web (WWW) and Internet
Synthetic data from various software tools or sites or other processes,
…and others
6
What is Machine Reading? Use of computers to decode text (which generally was created from natural language processes)◦ Also sometimes called “distant reading” (Moretti, 2000) or “not-reading”
Capturing of quantitative measures of a text ◦ Reliable and consistent, reproducible / repeatable ◦ Patterned ◦ May be generalizable and transferable to other contexts
8
What is Machine Reading? (cont.)
Some common types of machine-based text analytics include the following: ◦ Linguistic analysis (based on word counts of function words, punctuation, and other parts of written
and spoken language) ◦ Style / stylometry◦ Sentiment analysis ◦ Emotion analysis◦ Cognitive analysis ◦ Social standing analysis ◦ Deception analysis, and others
9
What is Machine Reading? (cont.)
◦ Word frequency counts (and resulting word clouds) ◦ Word searches (and resulting word trees) ◦ Word network analysis (and resulting network graphs) ◦ Word similarity clustering (and resulting 2D and 3D word clusters) ◦ Word proximity clustering (and resulting 2D and 3D word clusters) ◦ Topic modeling or theme/subtheme extraction through unsupervised machine learning
10
A Brief History of Machine Reading Technologies originated in 1960s ◦ Assistive technology track (Haskins Laboratories, 1970s; Kurzweil Computer Products, 1975; and others)◦ Natural language processing track (Bobrow, 1964; Weizenbaum, 1965; Schank, 1969; Woods, 1970;
Winograd, 1971; Hendrix, 1982; and others)
Recently ◦ applied to Web and Internet scale texts ◦ popularizing to individual academic researcher-level applications
11
Supervised and Unsupervised Machine Learning from Text / Text Corpora SUPERVISED OR SEMI-SUPERVISED (WITH DIRECT HUMAN INPUTS)
Coding by existing pattern (computer emulates human coding over part of a text set and codes the rest of the text set, based on human coding examples)
XML coding and analysis of coded segments of text ◦ Often manual coding ◦ Sometimes machine-enhanced XML coding
UNSUPERVISED (WHOLLY BASED ON COMPUTER ALGORITHM) ◦ Sentiment, emotion, cognitive, and other types
of analysis of text data ◦ Word frequency counts with stopwords lists
(and resulting word clouds, treemaps,) ◦ Word searches (and resulting word trees) ◦ Word network analysis (and resulting network
graphs)
12
Supervised and Unsupervised Machine Learning from Text / Text Corpora (cont.)
SUPERVISED OR SEMI-SUPERVISED (WITH DIRECT HUMAN INPUTS)
Uses of human-labeled data
UNSUPERVISED (WHOLLY BASED ON COMPUTER ALGORITHM) ◦ Word similarity clustering (and resulting 2D and
3D word clusters) (and resulting dendrogramvisualizations, 2D and 3D cluster diagrams, ring lattice graphs)
◦ Word proximity clustering (and resulting dendrogram visualizations, 2D and 3D word clusters diagrams, ring lattice graphs)
◦ Clustering from factor analysis ◦ Topic modeling or theme/subtheme extraction
through unsupervised machine learning (with intensity matrices, bar charts, hierarchy diagrams like treemaps and sunbursts)
13
Supervised and Unsupervised Machine Learning from Text / Text Corpora (cont.)
All with extracted data tables (from which data visualizations are created)
All with extracted word lists (from which data visualizations are created)
14
Features and Affordances of Machine Reading FEATURES
Still some level of human interpretation needed
Extracted clusters, but the human has to apply the interpretation of what those clusters represent (factor analysis, principal components analysis, and others)
AFFORDANCES
Speed
Scale
Reproducibility
15
Under the Hood… Bag of words (parsing: removal of punctuation, breaking writing apart into words) vs. structure and context-preserving methods
Sliding “windows” for co-occurrence and proximity captures
Languages are fairly predictably structured, so statistical methods and counting can be applied based on those patterns
Matrices are a common tool to capture relationships between words and parts of a text and documents; all relational matrices may be re-visualized as network graphs and other types of data visualizations
Pre-coded word sets or dictionaries are often used for sentiment analysis, emotion analysis, and the extraction of psychometric features
16
Under the Hood… (cont.)
Often a sequence of algorithmic procedures (and often black box except for the few software makers who are highly focused on documenting accurately)
17
Some Software and Capabilities Linguistic Inquiry and Word Count (LIWC2015): application of psychometrics and other constructs across over 100 dimensions; two-plus decades of empirical, lab, and other research; dictionaries available in multiple languages; customized dictionaries may be applied
AutoMap and NetScenes: extraction of content networks, and others (http://www.casos.cs.cmu.edu/projects/automap/)
Coh-Metrix (http://cohmetrix.com/)
DICTION (http://www.dictionsoftware.com/)
Latent Semantic Analysis @ CU Boulder (http://lsa.colorado.edu/)
(The presenter has only tried the top two. MS Word has a brief lexical element which enables the extraction of readability statistics based on the Flesch Reading Ease test and the Flesch-Kincaid Grade Level test.)
18
NVivo 11 PlusPrimary focus: Qualitative research data analysis suite ◦ Digital data curation ◦ Manual coding ◦ Data queries ◦ Auto coding ◦ Data visualizations, and others
Some built-in data analytics tools that enable machine reading of texts
This presentation will only focus on this particular tool for the use cases.
19
A Conceptualization of a Recursive and Semi-Linear SequenceReview of the Literature*
Research Design / Exploration and Discovery*
Multimedia Collection / Text Collection*
Close Human Reading
Data Cleaning
Text and Multimedia Data Curation
Data Runs with Machine Reading Software
Extractions of Data Tables and Text Sets
Data Visualizations
Write-up Presentation
*all potential start points
21
Power in Combinations and in SequencesMany points-of-entry to machine reading
Ranges of tactics and strategies and capabilities based on available texts, various software tools, and researcher capabilities
Text processing is sensitive to sequential time, so it is important to be very clear about what is happening at each processing phase (and how the data changes) ◦ Need clear documentation of what changes happened in each step
23
Why Text? May be human-collected data sets, sometimes computer program-scraped data sets, or a mix
Text of various types based on conventions
Text as a lowest common denominator in terms of multimedia objects (images, audio, video, and others)
24
Text Curation and Data Cleaning Selection of texts to include and those to exclude
Consistent and informative file naming protocols
Rendering of multimedia files to searchable text ones
Ensuring texts are machine readable ◦ Rendering of texts to searchability
De-duplication of files
Normalizing textual data, and others
25
26
dendrogram hierarchy showing similarity clustering
Note: Start with the leaves at the far right and read “up” the tree to branches and then the trunk, in order to understand clustering relationships…from the most granular to the most broad or coarse.
Setup of Machine Reading Processes There are a half-dozen widely-available software programs that may be used for machine reading of texts. Those who use high-level computer languages also have packages that enable various types of text analysis.
It helps to know what the capabilities are for the various software tools.
It helps to know what to set the parameters at for the various processes.
It helps to know how the respective data visualizations are to be interpreted and to mitigate for the fact that data visualizations are summary data. It is important to mitigate negative learning.
31
AssertabilityIt helps to know what may be asserted from the respective machine reading processes, so the findings may be accurately represented.
Also, many researchers use multiple processes and outside-understandings in order to contextualize the data from machine reading. Those insights should be properly couched as well.
32
Four “Use Cases” from Higher Education • APPLICATION OF MACHINE READING FOR:
• Learning • Awareness • Research
33
Overview of Use Cases Use Case #1: Topic Modeling and Article Histograms (latent structure in documents and document sets)
Use Case #2: Engaging with Social Media (Microblogging and Social Network) Data
Use Case #3: Exploring Manual and Machine Coding
Use Case #4: Machine Reading for Sentiment Analysis
34
Use Case #1: Topic Modeling and Article Histograms (latent structure in documents and document / text sets)
Using a computer to read a large amount of articles or contents to extract topics ◦ Can be a “knowledge poor” approach in which no prior information about the domain is applied to the
topic extraction (as in the case of NVivo 11 Plus)
Topic modeling may be used to identify which works should be read using human “close reading”◦ Article histograms through theme / subtheme extraction (topic modeling)◦ Understanding of topics as a finite feature set ◦ Classification of articles by their main named contents
Using a computer to extract themes and sub-themes to understand the general gist of an article or a text set
Can auto-code at three levels of granularity: sentence, paragraph, and cell (depending on the structure of the data)
35
Use Case #1 Work Sequence Collecting the documents as separate items in a text corpus
Cleaning the text sets
Running the theme and sub-theme extraction against the set (NVivo 11 Plus)
Selecting the appropriate themes and subthemes
Auto-coded at the more granular sentence level (vs. paragraph level)
Exporting the text-document intensity matrix
Mapping the entire set as a line graph (Excel)
Mapping separate articles as article histograms
36
Assumptions of the Topic Modeling ApproachWord counts in respective documents are usually normalized to account for varying document lengths (so as not to overweight the words appearing in longer documents)
Each article is represented as a list of identified “important” words ◦ Important words are likely semantic meaning-bearing terms ◦ Important words are sometimes “rare” ones◦ Important words are not functional words (articles, pronouns, etc.) usually on a “stopwords” list◦ If the TF-IDF (term frequency-inverse document frequency) calculation is used,
◦ words are upweighted if they occur a fair amount in a local document but occur rarely globally in the set, and ◦ words are downweighted if they occur a lot across documents (globally) … in order to lower the influence of frequently appearing
words over rarer-occurring ones
◦ Words directly linked to the identified themes are placed as sub-themes (in bigrams, three-grams, and so on)
37
VisualizationsA dataset of articles about “machine reading” from Springer, IEEE Xplore Digital Library, ACM and Google Scholar
Kept at document size and stored in one folder
38
An Integrated File of All Articles1751 pp., > a million words from articles based on “machine reading” from the academic literature
Binder1 file
63 MB
Saved out as a .txt file from .pdf
Saved out as an .rtf file from .pdf
46
combining of article set of “machine reading” articles for corpus-based summaries
55
bar chart showing autocoded sentiment in four classifications from synthesized article set
intensity matrix
Use Case #2: Engaging with Social Media (Microblogging and Social Network) Data Using NCapture web browser add-on (on MS’s IE or Google Chrome) to extract a Twitter Tweetstream
Using NCapture web browser add-on to extract a Facebook wall of postings
Ability to capture text messages, URLs, geographical coordinates, thumbnail images, and other data
Ability to run sentiment analysis over the collected data
Ability to run theme and sub-theme extraction over the collected data
Ability to map social networks (sociograms) based on interaction data
56
Use Case #2 Work Sequence Capture a microblogging text set (Tweetstream, #hashtag conversation, keyword conversations, or other data extractions) from Twitter’s API
Run a text frequency count to capture a gist of the focus of the target text set (NVivo 11 Plus)
Run a word or phrase or name search to capture a sense of the gist of the targeted word use (NVivo 11 Plus)
Process the data and extract social networks to identify main communicators in the #hashtag network or the keyword network
57
Use Case #2 Work Sequence (cont.)
Map the microblogging data to a geographical map
Capture the URLs from the data set for more analysis
Capture related imagery from the data set for more analysis
Look at the postings over time for more analysis
Conduct a theme and sub-theme analysis of the text set to capture meaning
Map the text set for sentiment to capture the general sentiment of the text set ◦ Examine the extracted textual data for deeper insights
58
60
dendrogram showing user names clustered by word similarity (suggestive of intercommunications and topical interests)
62
ring lattice graph / circle graph showing Twitter user accounts and clustering by word similarity from exchanged microblogged messaging
65
sociogram / sociogram depicting target user account on Twitter’s out-degree ego neighborhood (1 deg.)
71
sunburst hierarchy chart showing the “educator” theme and its related subthemes from Twitter Tweetstream
72
bar chart of extracted themes (in alphabetical order) from the Twitter account Tweetstream; subthemes not depicted here
73
least frequent word count to capture the “long tail” from the Tweetstream
often alphanumeric garble, misspellings, and a few nuggetsof insight in the long tail
power law curve frequency distribution
75
0
5
10
15
20
25
30
trea
tmen
tw
elco
me
test
edm
ason
a88d
a9id
efin
clud
ing
ther
apy
#saa
m7p
ppg0
epoe
adva
nces
bust
ers
coor
dina
tion
driv
ing
five
high
est
kille
dm
ovem
ent
per
rece
ive
sign
ifica
ntth
ree
wee
kend
#for
ewis
hes
#sm
okin
g20
085o
ulv6
6ucg
9ocv
gfzr
26@
nych
ealth
yaf
3qhw
dkqy
back
upbx
sqo6
wle
ecm
vdc
qd9t
8ssu
durt
xyyr
ljex
iste
nce
fron
thb
dzzw
enxu
iden
tifie
siz
0i7c
q4lx
kqgr
4oez
zcm
9bip
8im
rry2
ntcv
q nypa
ges
prep
ped
rbtv
gkry
nmfo
uobu
spon
sore
dte
chni
ques
trut
hva
cuna
wild
land
yq4t
l0no
as
Freq
uenc
y Co
unt
Words and String Data in the Long Tail
An Example of the Long Tail (from @CDCGov Tweetstream)
frequency bar chart from Excel 2016
In the Long Tail… (with a somewhat arbitrary raw count to indicate the start of the long tail)
NOT USEFUL
Misspelled words
Alphanumeric garble
POSSIBLY USEFUL
URLs (uniform resource locators) or web addresses
Rare topics
Concepts
Unusual terms
Names, proper nouns, named entities
76
79
bar chart of auto-extracted sentiments in four categories or a binary split: • very negative, moderately negative, moderately positive, and very positive (and “neutral”); • also as negative-positive polarity (and “neutral”)
Use Case #3: Exploring Manual and Machine CodingExploring human-coded (and / or auto-coded or machine-coded) nodes for pattern identification: interrelationships, similarity clustering, and others
Data queries to enable coding exploration: word frequency count, text search, and others
Matrix coding query to explore interrelationships between codes (nodes)
81
Use Case #3 Work SequenceRaw source data ingested
Manual (and / or automated) coding of that data
Data queries of the coding to observe patterns in coding of that data◦ Ability to set up text frequency counts based on various parameters: exact matches, stemmed words,
synonyms, specializations, and generalizations
Text search with special character capabilities
Proximity searches for words occurring near a certain term
82
Special Features for Text Searches ◦ Wildcard searches (?) Any one character ◦ Wildcard searches (*) any characters ◦ AND ◦ OR ◦ NOT ◦ Required ◦ Prohibit ◦ Fuzzy ◦ Near… (proximity searches) / an extension of memory
83
GroupingEnabling text searches, word frequency counts, and other queries from most specific to increasing gradations of generality (based on placement of the slider)
Exact matches
With stemmed words
With synonyms
With specializations
With generalizations
84
85
interactive word tree based on “player” seeding term from coded data (whether manual-coded or auto/machine-coded or combined)
Use Case #4: Machine Reading for Sentiment Analysis Ability to conduct a fast extraction of sentiment ◦ either as a polarity (positive-negative), with uncoded neutral text, or as ◦ a four-category set of sentiment (very negative, moderately negatively, moderately positive, and very
positive) and one of neutrality
Ability to explore the sentiment-coded text sets for textual contents ◦ Ability to query the coded text sets for additional word relationships and patterns
Ability to uncode or re-code autocoded text for sentiment to increase accuracy (through human oversight)
90
Use Case #4: Machine Reading for Sentiment Analysis (cont.)
Can auto-code at three levels of granularity: sentence, paragraph, and cell (depending on the structure of the data) ◦ Sentences (granular) and paragraphs (coarser) are common in documents ◦ Cells are common in data tables, and many contain structured data but also phrases, URLs, and
thumbnail imagery (from social media); data tables from online surveys may contain whole sentences and paragraphs
91
Use Case #4 Work SequenceRun sentiment analysis on the text set
Auto-coded at paragraph level (coarser than at the granular sentence level)
Review the data visualizations (intensity matrix, bar chart)
Explore the text in each sentiment sub-category (very negative, moderately negative, moderately positive, very positive) ◦ Run word frequency count ◦ Run selected text searches◦ Run theme and sub-theme extraction ◦ Create additional related data visualizations
92
95
auto-coded nodes based on sentiment dictionary in either a four-category mode (top) or a binary two-category mode (bottom)
97
access to exportable coded text sets in each of the sentiment categories, with ability to re-code and un-code
Value-added Aspects to Machine Reading Machine reading augments human capabilities for knowing. It decodes (“reads”) beyond a surface level of understanding of words in various contexts. ◦ It enables efficient access to latent insights. ◦ There are ways to chain processes and compare / contrast information in a way that illuminates new
insights.
NVivo 11 Plus enables ways to save “macros” of the various data queries and autocodingsequences to enhance re-runs of the macro sequences. This may be helpful with continuing data collection, in order to update query results after the acquisition of new data.
There is value in comparing the outcomes from one software program to another…and to see what happens with different settings for the different machine reading approaches. ◦ Researchers may export data tables for additional analytics and data visualizations outside of the
software.
100
Conclusion and Contact Dr. Shalin Hai-Jew◦ iTAC, Kansas State University ◦ 212 Hale / Farrell Library ◦ [email protected]◦ 785-532-5262
No ties: The presenter has no formal tie to QSR International, the maker of NVivo 11 Plus, nor to Microsoft, the maker of Excel.
About the data visualizations: Some of the data extractions and processes used only a few dozen source items, in part because of the need for coherent data visualizations and also to enable processing of the data on a local Windows machine. ◦ However, in a server-hosted context, text sets closer to big(ger) data may be run using NVivo 11 Plus.
A sampling: The four “use cases” presented here are by no means comprehensive. These offer a taste of some of the possibilities.
103