new software and tools for analyzing …vanatteveldt.com/p/ica2017_tools.pdfcollecting and analyzing...
TRANSCRIPT
New Software and Tools for Analyzing Communication
http://ica-cm.org/ica2017-tools/
Collecting and Analyzing Social Media Data Using SocialMediaLab
Timothy John Graham, The Australian National URobert Ackland, Australian National U
1
SocialMediaLab R PackageAims to be the “Swiss army knife” for collecting social media data via free APIs and constructing datasets for network and text analysis
• Tim Graham (ANU, @TimothyJGraham)
• Rob Ackland (ANU, @RobAckland)
• Chung-hong Chan (Univ. of Hong Kong, @chainsawriot); new UI using maggritr
Download: https://cran.r-project.org/web/packages/SocialMediaLab/index.html
Tutorials / help: http://vosonlab.net/SocialMediaLab
2
SocialMediaLab data typology
Workflow: code to data to network
3
• Collect 500 latest tweets from #ica17 and construct an “actor” network showing replies+mentions+retweetsbetween users
Try it yourself! https://goo.gl/8Rb1sA
Virtual Observatory for the Study of Online Networks (VOSON)
Robert Ackland, Australian National U
1
Virtual Observatory for the Study of Online Networks (VOSON) software
Web-based tool – originally for hyperlink network construction and analysis – from June 2017 VOSON includes Twitter collection
Created at the Australian National University - VOSON Lab http://vosonlab.net
Since 2010 VOSON has been commercially hosted and developed by Uberlink http://www.uberlink.com
– used by academics, students, analysts worldwide
– over 2500 user accounts issued
2
Uberlink VOSON development team:– Rob Ackland (ANU, Uberlink Founder & CEO)
– Jamsheed Shorish (Uberlink CTO)
– Francisca Borquez (Uberlink Communication Officer & Research Assistant)
VOSON 2.5 will be released 6 June 2017
– Improved user interface/workflow
– More flexbility with database naming (e.g. special characters)
– Collect Twitter data from the real-time stream of tweets matching your search criteria (e.g. hashtag use) over a scheduled time period.
3
1. Scheduling a 1 hour collection on hashtag #ica17
2. As collection is run there is an update of the number of nodes (Twitter users) collected.
3. @mention tie network and key SNA metrics
Same, Same? Ensuring Comparative Equivalence in the Semantic Analysis of
Heterogeneous, Multilingual Corpora
Christian Baden, Hebrew U of Jerusalem
Christian Baden | Noah Mozes Department of Communication & JournalismEnsuring Comparative Equivalence
167th ICA Annual Conference | San Diego, CA, USA | 26-05 11 2017
PUNCHING THE BAG OF WORDS
…and some related approaches
BAG Assumptions of relation homogeneity
Assumptions about meaning uniqueness
Christian Baden | Noah Mozes Department of Communication & JournalismEnsuring Comparative Equivalence
267th ICA Annual Conference | San Diego, CA, USA | 26-05 11 2017
WORDS
JAMCODE
Christian Baden | Noah Mozes Department of Communication & JournalismEnsuring Comparative Equivalence
267th ICA Annual Conference | San Diego, CA, USA | 26-05 11 2017
IN THE BEGINNING, THERE WAS THE WORD…
Lexical units are different from unique meanings:
Languages, jargons, etc.
Synonymy, Nicknames, Acronyms, etc.
Polysemy
Roots & Inflections
Partial meaning: Metonymy, Metaphor, etc.
Nested meanings
Polygrams: named entities, standing expressions
Anaphora, Coreference & Exophora
Need to map words as pointers onto meanings
DICTIONARY
‘Trump’ or ‘The Donald’ or ‘US President’ or
‘Agent Orange’ or …
…but not ‘Trump card’, ‘Ivanka Trump’, …
…only after 20 January
2017Trump’s Valueless Foreign Policywww.nytimes.com | Roger Cohen | 2 May 2017
So the threats were no more than bluster, and all is well. That is one view of President Trump’s foreign policy at the 100-day-or-so mark.
Wrong.
Yes, there’s no sign of the Wall, and NATO is no longer “obsolete,” and the Iran nuclear deal is still in place, […]
Defense Secretary Jim Mattis and H.R. McMaster, the national security adviser, have ring-fenced Trump’s recklessness and bellicosity. They have neutralized his ignorance even if nobody can help the president grasp its extent. Some of the loonier members of the president’s entourage have been fired or marginalized. Adults have taken charge. There is still a lot of noise, but “America First” has not upended the world.
The WallMusic album by
Pink Floyd
wall (n.)Continuous vertical
brick or stone structure
fired (1)job contract
ended
fired (2)launched projectile
The WallTrump’s policy proposal to protect border with
Mexico
Christian Baden | Noah Mozes Department of Communication & JournalismEnsuring Comparative Equivalence
367th ICA Annual Conference | San Diego, CA, USA | 26-05 11 2017
Content Analysis Tool
Trump’s Valueless Foreign Policywww.nytimes.com | Roger Cohen | 2 May 2017
So the threats were no more than bluster, and all is well. That is one view of President Trump’s foreign policy at the 100-day-or-so mark.
Wrong.
Yes, there’s no sign of the Wall, and NATO is no longer “obsolete,” and the Iran nuclear deal is still in place, […]
Defense Secretary Jim Mattis and H.R. McMaster, the national security adviser, have ring-fenced Trump’s recklessness and bellicosity. They have neutralized his ignorance even if nobody can help the president grasp its extent. Some of the loonier members of the president’s entourage have been fired or marginalized. Adults have taken charge. There is still a lot of noise, but “America First” has not upended the world.
LETTING THE CAT OUT OF THE BAG
Co-presence is related to relatedness,
but the relationship is complicated.
Macrosyntax: Headlines, Turns, Lists, etc.
Syntax: Clauses, Parentheses, Sentences, etc.
Conjunctions, Grammatic Roles & POS
Anaphora, Coreference & Exophora
Sequence & Proximity
Stylistic devices: Rhymes, Puns, Alliterations
Register & Contextual Knowledge
JAMCODE
Shallow parsing of Syntax and Macrosyntax
Scoring based on Proximity & Sequence
Imputation based on Contextual Discourse
Multi-feature Probabilistic Relatedness
Micro & Macro Orderclause, sentence,
paragraph, title, …
Local Coherence proximity probabilityIntertextual Inference
Context Model
Automatic Text Analysis Made Easy: Using AmCAT, NLPipe, and R
For Corpus Management, Linguistic Processing, and Automatic Text Analysis
Wouter van Atteveldt, VU Amsterdam Kasper Welbers, U Leuven et al.
AmCAT - Easy document management and querying● Manage large text collections
○ Rights management for multiple users○ Upload plain text, csv, PDF, lexisnexis, …
● Complex keyword queries
● Quantitative manual coding
● Available for use!○ Free and open source○ Use amcat.nl, setup your own server, use docker image
An API for power users● All functionality available through API● Use python/R to manage and analyse data● Scrape and upload articles● Conduct automatic queries● Download text, metadata● Upload new article sets● Create projects, users, etc.● Workflow:
○ Corpus/project management and explorative analysis using website○ Reproducible queries using API○ Download text or results and connect to other tools (topicmodels etc)
NLPipe - easy NLP processing● Setting up NLP tools can be challenging
○ Lemmatizing, POS tagging, parsing○ Need to download tools, fix prerequisites, install
● NLPipe provides a simple interface to multiple tools○ CoreNLP, Alpino, Frog○ Connect from R, python○ Works on local computer or distributed (server/worker/clients)
● Can be installed as docker image for server/workers● Easy to connect to quanteda, corpustools, tm, ...
Facebook Page Data Extraction for Nonprogrammers: Introducing the Netvizz and Facepager Tools
Michael Che Ming Chan, Chinese U of Hong Kong
NETVIZZ (Rieder, 2013) Online access through https://apps.facebook.com/netvizz/. Must have Facebook account.
Use
Facebook Page ID
What posts to extract
Data to output
FACEPAGER (Keyling & Jünger, 2016) Download program from https://github.com/strohne/Facepager. PC or Mac version. Must have Facebook account.
Customize field output to database file
User Facebook credentials
Data request commands
SOME ANALYTICAL POSSIBILITIES
Corpustools: An R Package for Text Analysis Beyond Bags of Words
Kasper Welbers, U of LeuvenWouter van Atteveldt, VU Amsterdam
Kasper Welbers, KU Leuven & Wouter van Atteveldt, VU
corpustools:An R package for text analysis beyond bag-of-words
Why another R corpus package?- Focus on maintaining token data
Full text tokens bag-of-words
doc_id token token_index lemma pos
111541965 It 1 it O111541965 is 2 be V111541965 our 3 we O111541965 unfinished 4 unfinished A111541965 task 5 task N111541965 to 6 to ?
The tCorpus classAn R6 class– Reference class, to prevent unnecessary copies– Intuitive syntax for methods– clear distinction public/private
Token and meta data use the data.table package– Memory efficient and fast
Some cool features:
- Basic preprocessing- Word co-occurrence- Document similarity and fuzzy deduplication- Complex Boolean queries- Keyword + condition queries- KWIC that also supports co-occurrence- Vocabulary comparison- Annotating the token data based on various analyses, e.g., LDA
In progress:
- dealing with data that doesn't fit in memory- making fancy text browsers with annotations (being developed as the tokenbrowser package)
MPPA
Codegithub.com/kasperwelbers/corpustools
CRANplanned for this summer
Related packages
RNewsflow tokenbrowser (formerly topicbrowser)semnet
corpustools:An R package for text analysis beyond bag-of-words
Kasper Welbers, KU Leuven Wouter van Atteveldt, VU
Interactive Tool demos
http://ica-cm.org/ica2017-tools/